Re: Strange behavior with 'not' and filter pushdown

2017-02-13 Thread Takeshi Yamamuro
Oh, Thanks for checking!

On Tue, Feb 14, 2017 at 12:32 PM, Xiao Li  wrote:

> https://github.com/apache/spark/pull/16894
>
> Already backported to Spark 2.0
>
> Thanks!
>
> Xiao
>
> 2017-02-13 17:41 GMT-08:00 Takeshi Yamamuro :
>
>> cc: xiao
>>
>> IIUC a xiao's commit below fixed this issue in master.
>> https://github.com/apache/spark/commit/2eb093decb5e87a1ea71b
>> baa28092876a8c84996
>>
>> Is this fix worth backporting to the v2.0 branch?
>> I checked I could reproduce there:
>>
>> ---
>>
>> scala> Seq((1, "a"), (2, "b"), (3, null)).toDF("c0",
>> "c1").write.mode("overwrite").parquet("/Users/maropu/Desktop/data")
>> scala> spark.read.parquet("/Users/maropu/Desktop/data").createOrRep
>> laceTempView("t")
>> scala> val df = sql("SELECT c0 FROM t WHERE NOT(c1 IS NOT NULL)")
>> scala> df.explain(true)
>> == Parsed Logical Plan ==
>> 'Project ['c0]
>> +- 'Filter NOT isnotnull('c1)
>>+- 'UnresolvedRelation `t`
>>
>> == Analyzed Logical Plan ==
>> c0: int
>> Project [c0#16]
>> +- Filter NOT isnotnull(c1#17)
>>+- SubqueryAlias t
>>   +- Relation[c0#16,c1#17] parquet
>>
>> == Optimized Logical Plan ==
>> Project [c0#16]
>> +- Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
>>
>>+- Relation[c0#16,c1#17] parquet
>>
>> == Physical Plan ==
>> *Project [c0#16]
>> +- *Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
>>+- *BatchedScan parquet [c0#16,c1#17] Format: ParquetFormat,
>> InputPaths: file:/Users/maropu/Desktop/data, PartitionFilters: [],
>> PushedFilters: [IsNotNull(c1), Not(IsNotNull(c1))], ReadSchema:
>> struct
>>
>> scala> df.show
>> +---+
>> | c0|
>> +---+
>> +---+
>>
>>
>>
>>
>> // maropu
>>
>>
>> On Sun, Feb 12, 2017 at 10:01 AM, Everett Anderson <
>> ever...@nuna.com.invalid> wrote:
>>
>>> On the plus side, looks like this may be fixed in 2.1.0:
>>>
>>> == Physical Plan ==
>>> *HashAggregate(keys=[], functions=[count(1)])
>>> +- Exchange SinglePartition
>>>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>>>   +- *Project
>>>  +- *Filter NOT isnotnull(username#14)
>>> +- *FileScan parquet [username#14] Batched: true, Format:
>>> Parquet, Location: InMemoryFileIndex[file:/tmp/test_table],
>>> PartitionFilters: [], PushedFilters: [Not(IsNotNull(username))],
>>> ReadSchema: struct
>>>
>>>
>>>
>>> On Fri, Feb 10, 2017 at 11:26 AM, Everett Anderson 
>>> wrote:
>>>
 Bumping this thread.

 Translating "where not(username is not null)" into a filter of  
 [IsNotNull(username),
 Not(IsNotNull(username))] seems like a rather severe bug.

 Spark 1.6.2:

 explain select count(*) from parquet_table where not( username is not
 null)

 == Physical Plan ==
 TungstenAggregate(key=[], 
 functions=[(count(1),mode=Final,isDistinct=false)],
 output=[_c0#1822L])
 +- TungstenExchange SinglePartition, None
  +- TungstenAggregate(key=[], 
 functions=[(count(1),mode=Partial,isDistinct=false)],
 output=[count#1825L])
  +- Project
  +- Filter NOT isnotnull(username#1590)
  +- Scan ParquetRelation[username#1590] InputPaths: ,
 PushedFilters: [Not(IsNotNull(username))]

 Spark 2.0.2

 explain select count(*) from parquet_table where not( username is not
 null)

 == Physical Plan ==
 *HashAggregate(keys=[], functions=[count(1)])
 +- Exchange SinglePartition
  +- *HashAggregate(keys=[], functions=[partial_count(1)])
  +- *Project
  +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35))
  +- *BatchedScan parquet default.[username#35] Format:
 ParquetFormat, InputPaths: , PartitionFilters: [],
 PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],
 ReadSchema: struct

 Example to generate the above:

 // Create some fake data

 import org.apache.spark.sql.Row
 import org.apache.spark.sql.Dataset
 import org.apache.spark.sql.types._

 val rowsRDD = sc.parallelize(Seq(
 Row(1, "fred"),
 Row(2, "amy"),
 Row(3, null)))

 val schema = StructType(Seq(
 StructField("id", IntegerType, nullable = true),
 StructField("username", StringType, nullable = true)))

 val data = sqlContext.createDataFrame(rowsRDD, schema)

 val path = "SOME PATH HERE"

 data.write.mode("overwrite").parquet(path)

 val testData = sqlContext.read.parquet(path)

 testData.registerTempTable("filter_test_table")


 %sql
 explain select count(*) from filter_test_table where not( username is
 not null)


 On Wed, Feb 8, 2017 at 4:56 PM, Alexi Kostibas <
 akosti...@nuna.com.invalid> wrote:

> Hi,
>
> I have an application where I’m filtering data with SparkSQL with
> simple WHERE clauses. I also want the ability to show the unmatched rows
> for any 

Re: Strange behavior with 'not' and filter pushdown

2017-02-13 Thread Xiao Li
https://github.com/apache/spark/pull/16894

Already backported to Spark 2.0

Thanks!

Xiao

2017-02-13 17:41 GMT-08:00 Takeshi Yamamuro :

> cc: xiao
>
> IIUC a xiao's commit below fixed this issue in master.
> https://github.com/apache/spark/commit/2eb093decb5e87a1ea71bbaa280928
> 76a8c84996
>
> Is this fix worth backporting to the v2.0 branch?
> I checked I could reproduce there:
>
> ---
>
> scala> Seq((1, "a"), (2, "b"), (3, null)).toDF("c0",
> "c1").write.mode("overwrite").parquet("/Users/maropu/Desktop/data")
> scala> spark.read.parquet("/Users/maropu/Desktop/data").
> createOrReplaceTempView("t")
> scala> val df = sql("SELECT c0 FROM t WHERE NOT(c1 IS NOT NULL)")
> scala> df.explain(true)
> == Parsed Logical Plan ==
> 'Project ['c0]
> +- 'Filter NOT isnotnull('c1)
>+- 'UnresolvedRelation `t`
>
> == Analyzed Logical Plan ==
> c0: int
> Project [c0#16]
> +- Filter NOT isnotnull(c1#17)
>+- SubqueryAlias t
>   +- Relation[c0#16,c1#17] parquet
>
> == Optimized Logical Plan ==
> Project [c0#16]
> +- Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
>
>+- Relation[c0#16,c1#17] parquet
>
> == Physical Plan ==
> *Project [c0#16]
> +- *Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
>+- *BatchedScan parquet [c0#16,c1#17] Format: ParquetFormat,
> InputPaths: file:/Users/maropu/Desktop/data, PartitionFilters: [],
> PushedFilters: [IsNotNull(c1), Not(IsNotNull(c1))], ReadSchema:
> struct
>
> scala> df.show
> +---+
> | c0|
> +---+
> +---+
>
>
>
>
> // maropu
>
>
> On Sun, Feb 12, 2017 at 10:01 AM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> On the plus side, looks like this may be fixed in 2.1.0:
>>
>> == Physical Plan ==
>> *HashAggregate(keys=[], functions=[count(1)])
>> +- Exchange SinglePartition
>>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>>   +- *Project
>>  +- *Filter NOT isnotnull(username#14)
>> +- *FileScan parquet [username#14] Batched: true, Format:
>> Parquet, Location: InMemoryFileIndex[file:/tmp/test_table],
>> PartitionFilters: [], PushedFilters: [Not(IsNotNull(username))],
>> ReadSchema: struct
>>
>>
>>
>> On Fri, Feb 10, 2017 at 11:26 AM, Everett Anderson 
>> wrote:
>>
>>> Bumping this thread.
>>>
>>> Translating "where not(username is not null)" into a filter of  
>>> [IsNotNull(username),
>>> Not(IsNotNull(username))] seems like a rather severe bug.
>>>
>>> Spark 1.6.2:
>>>
>>> explain select count(*) from parquet_table where not( username is not
>>> null)
>>>
>>> == Physical Plan ==
>>> TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Final,isDistinct=false)],
>>> output=[_c0#1822L])
>>> +- TungstenExchange SinglePartition, None
>>>  +- TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Partial,isDistinct=false)],
>>> output=[count#1825L])
>>>  +- Project
>>>  +- Filter NOT isnotnull(username#1590)
>>>  +- Scan ParquetRelation[username#1590] InputPaths: ,
>>> PushedFilters: [Not(IsNotNull(username))]
>>>
>>> Spark 2.0.2
>>>
>>> explain select count(*) from parquet_table where not( username is not
>>> null)
>>>
>>> == Physical Plan ==
>>> *HashAggregate(keys=[], functions=[count(1)])
>>> +- Exchange SinglePartition
>>>  +- *HashAggregate(keys=[], functions=[partial_count(1)])
>>>  +- *Project
>>>  +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35))
>>>  +- *BatchedScan parquet default.[username#35] Format:
>>> ParquetFormat, InputPaths: , PartitionFilters: [],
>>> PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],
>>> ReadSchema: struct
>>>
>>> Example to generate the above:
>>>
>>> // Create some fake data
>>>
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.Dataset
>>> import org.apache.spark.sql.types._
>>>
>>> val rowsRDD = sc.parallelize(Seq(
>>> Row(1, "fred"),
>>> Row(2, "amy"),
>>> Row(3, null)))
>>>
>>> val schema = StructType(Seq(
>>> StructField("id", IntegerType, nullable = true),
>>> StructField("username", StringType, nullable = true)))
>>>
>>> val data = sqlContext.createDataFrame(rowsRDD, schema)
>>>
>>> val path = "SOME PATH HERE"
>>>
>>> data.write.mode("overwrite").parquet(path)
>>>
>>> val testData = sqlContext.read.parquet(path)
>>>
>>> testData.registerTempTable("filter_test_table")
>>>
>>>
>>> %sql
>>> explain select count(*) from filter_test_table where not( username is
>>> not null)
>>>
>>>
>>> On Wed, Feb 8, 2017 at 4:56 PM, Alexi Kostibas <
>>> akosti...@nuna.com.invalid> wrote:
>>>
 Hi,

 I have an application where I’m filtering data with SparkSQL with
 simple WHERE clauses. I also want the ability to show the unmatched rows
 for any filter, and so am wrapping the previous clause in `NOT()` to get
 the inverse. Example:

 Filter:  username is not null
 Inverse filter:  NOT(username is not null)

 This worked fine in Spark 1.6. After upgrading to Spark 2.0.2, the
 inverse 

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Takuya UESHIN
Thank you very much everyone!
I really look forward to working with you!


On Tue, Feb 14, 2017 at 9:47 AM, Yanbo Liang  wrote:

> Congratulations!
>
> On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki 
> wrote:
>
>> Congrats!
>>
>> Kazuaki Ishizaki
>>
>>
>>
>> From:Reynold Xin 
>> To:"dev@spark.apache.org" 
>> Date:2017/02/14 04:18
>> Subject:welcoming Takuya Ueshin as a new Apache Spark committer
>> --
>>
>>
>>
>> Hi all,
>>
>> Takuya-san has recently been elected an Apache Spark committer. He's been
>> active in the SQL area and writes very small, surgical patches that are
>> high quality. Please join me in congratulating Takuya-san!
>>
>>
>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: Strange behavior with 'not' and filter pushdown

2017-02-13 Thread Takeshi Yamamuro
cc: xiao

IIUC a xiao's commit below fixed this issue in master.
https://github.com/apache/spark/commit/2eb093decb5e87a1ea71bbaa28092876a8c84996

Is this fix worth backporting to the v2.0 branch?
I checked I could reproduce there:

---

scala> Seq((1, "a"), (2, "b"), (3, null)).toDF("c0",
"c1").write.mode("overwrite").parquet("/Users/maropu/Desktop/data")
scala>
spark.read.parquet("/Users/maropu/Desktop/data").createOrReplaceTempView("t")
scala> val df = sql("SELECT c0 FROM t WHERE NOT(c1 IS NOT NULL)")
scala> df.explain(true)
== Parsed Logical Plan ==
'Project ['c0]
+- 'Filter NOT isnotnull('c1)
   +- 'UnresolvedRelation `t`

== Analyzed Logical Plan ==
c0: int
Project [c0#16]
+- Filter NOT isnotnull(c1#17)
   +- SubqueryAlias t
  +- Relation[c0#16,c1#17] parquet

== Optimized Logical Plan ==
Project [c0#16]
+- Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
   
   +- Relation[c0#16,c1#17] parquet

== Physical Plan ==
*Project [c0#16]
+- *Filter (isnotnull(c1#17) && NOT isnotnull(c1#17))
   +- *BatchedScan parquet [c0#16,c1#17] Format: ParquetFormat, InputPaths:
file:/Users/maropu/Desktop/data, PartitionFilters: [], PushedFilters:
[IsNotNull(c1), Not(IsNotNull(c1))], ReadSchema: struct

scala> df.show
+---+
| c0|
+---+
+---+




// maropu


On Sun, Feb 12, 2017 at 10:01 AM, Everett Anderson  wrote:

> On the plus side, looks like this may be fixed in 2.1.0:
>
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *Project
>  +- *Filter NOT isnotnull(username#14)
> +- *FileScan parquet [username#14] Batched: true, Format:
> Parquet, Location: InMemoryFileIndex[file:/tmp/test_table],
> PartitionFilters: [], PushedFilters: [Not(IsNotNull(username))],
> ReadSchema: struct
>
>
>
> On Fri, Feb 10, 2017 at 11:26 AM, Everett Anderson 
> wrote:
>
>> Bumping this thread.
>>
>> Translating "where not(username is not null)" into a filter of  
>> [IsNotNull(username),
>> Not(IsNotNull(username))] seems like a rather severe bug.
>>
>> Spark 1.6.2:
>>
>> explain select count(*) from parquet_table where not( username is not
>> null)
>>
>> == Physical Plan ==
>> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)],
>> output=[_c0#1822L])
>> +- TungstenExchange SinglePartition, None
>>  +- TungstenAggregate(key=[], 
>> functions=[(count(1),mode=Partial,isDistinct=false)],
>> output=[count#1825L])
>>  +- Project
>>  +- Filter NOT isnotnull(username#1590)
>>  +- Scan ParquetRelation[username#1590] InputPaths: ,
>> PushedFilters: [Not(IsNotNull(username))]
>>
>> Spark 2.0.2
>>
>> explain select count(*) from parquet_table where not( username is not
>> null)
>>
>> == Physical Plan ==
>> *HashAggregate(keys=[], functions=[count(1)])
>> +- Exchange SinglePartition
>>  +- *HashAggregate(keys=[], functions=[partial_count(1)])
>>  +- *Project
>>  +- *Filter (isnotnull(username#35) && NOT isnotnull(username#35))
>>  +- *BatchedScan parquet default.[username#35] Format:
>> ParquetFormat, InputPaths: , PartitionFilters: [],
>> PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],
>> ReadSchema: struct
>>
>> Example to generate the above:
>>
>> // Create some fake data
>>
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.Dataset
>> import org.apache.spark.sql.types._
>>
>> val rowsRDD = sc.parallelize(Seq(
>> Row(1, "fred"),
>> Row(2, "amy"),
>> Row(3, null)))
>>
>> val schema = StructType(Seq(
>> StructField("id", IntegerType, nullable = true),
>> StructField("username", StringType, nullable = true)))
>>
>> val data = sqlContext.createDataFrame(rowsRDD, schema)
>>
>> val path = "SOME PATH HERE"
>>
>> data.write.mode("overwrite").parquet(path)
>>
>> val testData = sqlContext.read.parquet(path)
>>
>> testData.registerTempTable("filter_test_table")
>>
>>
>> %sql
>> explain select count(*) from filter_test_table where not( username is not
>> null)
>>
>>
>> On Wed, Feb 8, 2017 at 4:56 PM, Alexi Kostibas <
>> akosti...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> I have an application where I’m filtering data with SparkSQL with simple
>>> WHERE clauses. I also want the ability to show the unmatched rows for any
>>> filter, and so am wrapping the previous clause in `NOT()` to get the
>>> inverse. Example:
>>>
>>> Filter:  username is not null
>>> Inverse filter:  NOT(username is not null)
>>>
>>> This worked fine in Spark 1.6. After upgrading to Spark 2.0.2, the
>>> inverse filter always returns zero results. It looks like this is a problem
>>> with how the filter is getting pushed down to Parquet. Specifically, the
>>> pushdown includes both the “is not null” filter, AND “not(is not null)”,
>>> which would obviously result in zero matches. An example below:
>>>
>>> pyspark:
>>> > x = spark.sql('select my_id from my_table where *username is not 

Re: Request for comments: Java 7 removal

2017-02-13 Thread Charles Allen
I think the biggest concern is enterprise users/operators who do not have
the authority or access to upgrade hadoop/yarn clusters to java8. As a
reference point, apparently CDH 5.3

 shipped with java 8 in December 2014. I would be surprised if such users
were active consumers of the dev mailing list, though. Unfortunately
there's a bit of a selection bias in this list.

The other concern is if there is guaranteed compatibility between scala and
java8 for all versions you want to use (which is somewhat touched upon in
the PR). Are you thinking about supporting scala 2.10 against java 8 byte
code?

See https://groups.google.com/d/msg/druid-user/aTGQlnF1KLk/NvBPfmigAAAJ for
the similar discussion that went forward in the Druid community.


On Fri, Feb 10, 2017 at 8:47 AM Sean Owen  wrote:

> As you have seen, there's a WIP PR to implement removal of Java 7 support:
> https://github.com/apache/spark/pull/16871
>
> I have heard several +1s at
> https://issues.apache.org/jira/browse/SPARK-19493 but am asking for
> concerns too, now that there's a concrete change to review.
>
> If this goes in for 2.2 it can be followed by more extensive update of the
> Java code to take advantage of Java 8; this is more or less the baseline
> change.
>
> We also just removed Hadoop 2.5 support. I know there was talk about
> removing Python 2.6. I have no opinion on that myself, but, might be time
> to revive that conversation too.
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Yanbo Liang
Congratulations!

On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki 
wrote:

> Congrats!
>
> Kazuaki Ishizaki
>
>
>
> From:Reynold Xin 
> To:"dev@spark.apache.org" 
> Date:2017/02/14 04:18
> Subject:welcoming Takuya Ueshin as a new Apache Spark committer
> --
>
>
>
> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's been
> active in the SQL area and writes very small, surgical patches that are
> high quality. Please join me in congratulating Takuya-san!
>
>
>
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Kazuaki Ishizaki
Congrats!

Kazuaki Ishizaki



From:   Reynold Xin 
To: "dev@spark.apache.org" 
Date:   2017/02/14 04:18
Subject:welcoming Takuya Ueshin as a new Apache Spark committer



Hi all,

Takuya-san has recently been elected an Apache Spark committer. He's been 
active in the SQL area and writes very small, surgical patches that are 
high quality. Please join me in congratulating Takuya-san!






Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Asher Krim
Congrats!

Asher Krim
Senior Software Engineer

On Mon, Feb 13, 2017 at 6:24 PM, Kousuke Saruta 
wrote:

> Congratulations,  Takuya!
>
> - Kousuke
> On 2017/02/14 7:38, Herman van Hövell tot Westerflier wrote:
>
> Congrats Takuya!
>
> On Mon, Feb 13, 2017 at 11:27 PM, Neelesh Salian  > wrote:
>
>> Congratulations, Takuya!
>>
>> On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin 
>> wrote:
>>
>>> Hi all,
>>>
>>> Takuya-san has recently been elected an Apache Spark committer. He's
>>> been active in the SQL area and writes very small, surgical patches that
>>> are high quality. Please join me in congratulating Takuya-san!
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Neelesh S. Salian
>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] 
>
>
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Kousuke Saruta

Congratulations,  Takuya!

- Kousuke

On 2017/02/14 7:38, Herman van Hövell tot Westerflier wrote:

Congrats Takuya!

On Mon, Feb 13, 2017 at 11:27 PM, Neelesh Salian 
> wrote:


Congratulations, Takuya!

On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin > wrote:

Hi all,

Takuya-san has recently been elected an Apache Spark
committer. He's been active in the SQL area and writes very
small, surgical patches that are high quality. Please join me
in congratulating Takuya-san!





-- 
Regards,

Neelesh S. Salian




--

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com 

+31 6 420 590 27

databricks.com 

http://databricks.com 





Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
It's a good question. Py4J seems to have been updated 5 times in 2016 and
is a bit involved (from a review point of view verifying the zip file
contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to
cloudpickle which aren't correctly tagged as backporting changes from the
fork (and this can take awhile to review since we don't always catch them
right away as being backports).

Another difficulty with looking at backports is that since our review
process for PySpark has historically been on the slow side, changes
benefiting systems like dask or IPython parallel were not backported to
Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of
cloudpickle, using a more standardized packaging of dependencies, simpler
updates of dependencies reduces friction to gaining benefits from other
related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin  wrote:

> With any dependency update (or refactoring of existing code), I always ask
> this question: what's the benefit? In this case it looks like the benefit
> is to reduce efforts in backports. Do you know how often we needed to do
> those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau 
> wrote:
>
>> Hi PySpark Developers,
>>
>> Cloudpickle is a core part of PySpark, and is originally copied from (and
>> improved from) picloud. Since then other projects have found cloudpickle
>> useful and a fork of cloudpickle
>>  was created and is now
>> maintained as its own library  
>> (with
>> better test coverage and resulting bug fixes I understand). We've had a few
>> PRs backporting fixes from the cloudpickle project into Spark's local copy
>> of cloudpickle - how would people feel about moving to taking an explicit
>> (pinned) dependency on cloudpickle?
>>
>> We could add cloudpickle to the setup.py and a requirements.txt file for
>> users who prefer not to do a system installation of PySpark.
>>
>> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
>> repo but could instead have a pinned version required. While we do depend
>> on a lot of py4j internal APIs, version pinning should be sufficient to
>> ensure functionality (and simplify the update process).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Reynold Xin
With any dependency update (or refactoring of existing code), I always ask
this question: what's the benefit? In this case it looks like the benefit
is to reduce efforts in backports. Do you know how often we needed to do
those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau  wrote:

> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally copied from (and
> improved from) picloud. Since then other projects have found cloudpickle
> useful and a fork of cloudpickle
>  was created and is now
> maintained as its own library  (with
> better test coverage and resulting bug fixes I understand). We've had a few
> PRs backporting fixes from the cloudpickle project into Spark's local copy
> of cloudpickle - how would people feel about moving to taking an explicit
> (pinned) dependency on cloudpickle?
>
> We could add cloudpickle to the setup.py and a requirements.txt file for
> users who prefer not to do a system installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
> repo but could instead have a pinned version required. While we do depend
> on a lot of py4j internal APIs, version pinning should be sufficient to
> ensure functionality (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>


[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and
improved from) picloud. Since then other projects have found cloudpickle
useful and a fork of cloudpickle  was
created and is now maintained as its own library
 (with better test coverage and
resulting bug fixes I understand). We've had a few PRs backporting fixes
from the cloudpickle project into Spark's local copy of cloudpickle - how
would people feel about moving to taking an explicit (pinned) dependency on
cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for
users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our
repo but could instead have a pinned version required. While we do depend
on a lot of py4j internal APIs, version pinning should be sufficient to
ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Herman van Hövell tot Westerflier
Congrats Takuya!

On Mon, Feb 13, 2017 at 11:27 PM, Neelesh Salian 
wrote:

> Congratulations, Takuya!
>
> On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin  wrote:
>
>> Hi all,
>>
>> Takuya-san has recently been elected an Apache Spark committer. He's been
>> active in the SQL area and writes very small, surgical patches that are
>> high quality. Please join me in congratulating Takuya-san!
>>
>>
>>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Neelesh Salian
Congratulations, Takuya!

On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin  wrote:

> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's been
> active in the SQL area and writes very small, surgical patches that are
> high quality. Please join me in congratulating Takuya-san!
>
>
>


-- 
Regards,
Neelesh S. Salian


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Burak Yavuz
Congrats Takuya!

On Mon, Feb 13, 2017 at 2:17 PM, Dilip Biswal  wrote:

> Congratulations, Takuya!
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980 <(408)%20463-4980>
> dbis...@us.ibm.com
>
>
>
> - Original message -
> From: Takeshi Yamamuro 
> To: dev 
> Cc:
> Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committer
> Date: Mon, Feb 13, 2017 2:14 PM
>
> congrats!
>
>
> On Tue, Feb 14, 2017 at 6:05 AM, Sam Elamin 
> wrote:
>
> Congrats Takuya-san! Clearly well deserved! Well done :)
>
> On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
> Congratulations!
>
>
> On 02/13/2017 08:16 PM, Reynold Xin wrote:
> > Hi all,
> >
> > Takuya-san has recently been elected an Apache Spark committer. He's
> > been active in the SQL area and writes very small, surgical patches
> > that are high quality. Please join me in congratulating Takuya-san!
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Dilip Biswal
Congratulations, Takuya!
 
Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com
 
 
- Original message -From: Takeshi Yamamuro To: dev Cc:Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committerDate: Mon, Feb 13, 2017 2:14 PM 
congrats!
 
 
On Tue, Feb 14, 2017 at 6:05 AM, Sam Elamin  wrote:

Congrats Takuya-san! Clearly well deserved! Well done :) 
 
On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz  wrote:

Congratulations!
On 02/13/2017 08:16 PM, Reynold Xin wrote:> Hi all,>> Takuya-san has recently been elected an Apache Spark committer. He's> been active in the SQL area and writes very small, surgical patches> that are high quality. Please join me in congratulating Takuya-san!> -To unsubscribe e-mail: dev-unsubscribe@spark.apache.org  

 --

---Takeshi Yamamuro
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Takeshi Yamamuro
congrats!


On Tue, Feb 14, 2017 at 6:05 AM, Sam Elamin  wrote:

> Congrats Takuya-san! Clearly well deserved! Well done :)
>
> On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Congratulations!
>>
>>
>> On 02/13/2017 08:16 PM, Reynold Xin wrote:
>> > Hi all,
>> >
>> > Takuya-san has recently been elected an Apache Spark committer. He's
>> > been active in the SQL area and writes very small, surgical patches
>> > that are high quality. Please join me in congratulating Takuya-san!
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Sam Elamin
Congrats Takuya-san! Clearly well deserved! Well done :)

On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz 
wrote:

> Congratulations!
>
>
> On 02/13/2017 08:16 PM, Reynold Xin wrote:
> > Hi all,
> >
> > Takuya-san has recently been elected an Apache Spark committer. He's
> > been active in the SQL area and writes very small, surgical patches
> > that are high quality. Please join me in congratulating Takuya-san!
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Maciej Szymkiewicz
Congratulations!


On 02/13/2017 08:16 PM, Reynold Xin wrote:
> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's
> been active in the SQL area and writes very small, surgical patches
> that are high quality. Please join me in congratulating Takuya-san!
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Nicholas Chammas
Congratulations, Takuya! 

On Mon, Feb 13, 2017 at 2:34 PM Felix Cheung 
wrote:

> Congratulations!
>
>
> --
> *From:* Xuefu Zhang 
> *Sent:* Monday, February 13, 2017 11:29:12 AM
> *To:* Xiao Li
> *Cc:* Holden Karau; Reynold Xin; dev@spark.apache.org
> *Subject:* Re: welcoming Takuya Ueshin as a new Apache Spark committer
>
> Congratulations, Takuya!
>
> --Xuefu
>
> On Mon, Feb 13, 2017 at 11:25 AM, Xiao Li  wrote:
>
> Congratulations, Takuya!
>
> Xiao
>
> 2017-02-13 11:24 GMT-08:00 Holden Karau :
>
> Congratulations Takuya-san :D!
>
> On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin  wrote:
>
> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's been
> active in the SQL area and writes very small, surgical patches that are
> high quality. Please join me in congratulating Takuya-san!
>
>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>


Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Felix Cheung
Congratulations!



From: Xuefu Zhang 
Sent: Monday, February 13, 2017 11:29:12 AM
To: Xiao Li
Cc: Holden Karau; Reynold Xin; dev@spark.apache.org
Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committer

Congratulations, Takuya!

--Xuefu

On Mon, Feb 13, 2017 at 11:25 AM, Xiao Li 
> wrote:
Congratulations, Takuya!

Xiao

2017-02-13 11:24 GMT-08:00 Holden Karau 
>:
Congratulations Takuya-san :D!

On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin 
> wrote:
Hi all,

Takuya-san has recently been elected an Apache Spark committer. He's been 
active in the SQL area and writes very small, surgical patches that are high 
quality. Please join me in congratulating Takuya-san!





--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Xuefu Zhang
Congratulations, Takuya!

--Xuefu

On Mon, Feb 13, 2017 at 11:25 AM, Xiao Li  wrote:

> Congratulations, Takuya!
>
> Xiao
>
> 2017-02-13 11:24 GMT-08:00 Holden Karau :
>
>> Congratulations Takuya-san :D!
>>
>> On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin 
>> wrote:
>>
>>> Hi all,
>>>
>>> Takuya-san has recently been elected an Apache Spark committer. He's
>>> been active in the SQL area and writes very small, surgical patches that
>>> are high quality. Please join me in congratulating Takuya-san!
>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Reynold Xin
Hi all,

Takuya-san has recently been elected an Apache Spark committer. He's been
active in the SQL area and writes very small, surgical patches that are
high quality. Please join me in congratulating Takuya-san!


Re: Add hive-site.xml at runtime

2017-02-13 Thread Ryan Blue
Shivam,

We add hive-site.xml at runtime. We use --driver-class-path to add it to
the driver and --jars to add it for the executors.

rb

On Sun, Feb 12, 2017 at 9:10 AM, Shivam Sharma <28shivamsha...@gmail.com>
wrote:

> Hi,
>
> I have multiple hive configurations(hive-site.xml) and because of that
> only I am not able to add any hive configuration in spark *conf* directory.
> I want to add this configuration file at start of any *spark-submit* or
> *spark-shell*. This conf file is huge so *--conf* is not a option for me.
>
> Note :- I have tried u...@spark.apache.org and my mail was bouncing each
> time so Sean Owen suggested to mail dev.(https://issues.apache.
> org/jira/browse/SPARK-19546). Please give solution to above ticket also
> if possible.
>
> Thanks
>
> --
> Shivam Sharma
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Spark Improvement Proposals

2017-02-13 Thread Reynold Xin
Here's a new draft that incorporated most of the feedback:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

I added a specific role for SPIP Author and another one for SPIP Shepherd.

On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li  wrote:

> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
> 2017-02-11 7:57 GMT-08:00 Cody Koeninger :
>
>> At the spark summit this week, everyone from PMC members to users I had
>> never met before were asking me about the Spark improvement proposals
>> idea.  It's clear that it's a real community need.
>>
>> But it's been almost half a year, and nothing visible has been done.
>>
>> Reynold, are you going to do this?
>>
>> If so, when?
>>
>> If not, why?
>>
>> You already did the right thing by including long-deserved committers.
>> Please keep doing the right thing for the community.
>>
>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin  wrote:
>>
>>> +1 on all counts (consensus, time bound, define roles)
>>>
>>> I can update the doc in the next few days and share back. Then maybe we
>>> can just officially vote on this. As Tim suggested, we might not get it
>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>
>>>
>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter 
>>> wrote:
>>>
 Hi Cody,
 thank you for bringing up this topic, I agree it is very important to
 keep a cohesive community around some common, fluid goals. Here are a few
 comments about the current document:

 1. name: it should not overlap with an existing one such as SIP. Can
 you imagine someone trying to discuss a scala spore proposal for spark?
 "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
 sounds great.

 2. roles: at a high level, SPIPs are meant to reach consensus for
 technical decisions with a lasting impact. As such, the template should
 emphasize the role of the various parties during this process:

  - the SPIP author is responsible for building consensus. She is the
 champion driving the process forward and is responsible for ensuring that
 the SPIP follows the general guidelines. The author should be identified in
 the SPIP. The authorship of a SPIP can be transferred if the current author
 is not interested and someone else wants to move the SPIP forward. There
 should probably be 2-3 authors at most for each SPIP.

  - someone with voting power should probably shepherd the SPIP (and be
 recorded as such): ensuring that the final decision over the SPIP is
 recorded (rejected, accepted, etc.), and advising about the technical
 quality of the SPIP: this person need not be a champion for the SPIP or
 contribute to it, but rather makes sure it stands a chance of being
 approved when the vote happens. Also, if the author cannot find anyone who
 would want to take this role, this proposal is likely to be rejected 
 anyway.

  - users, committers, contributors have the roles already outlined in
 the document

 3. timeline: ideally, once a SPIP has been offered for voting, it
 should move swiftly into either being accepted or rejected, so that we do
 not end up with a distracting long tail of half-hearted proposals.

 These rules are meant to be flexible, but the current document should
 be clear about who is in charge of a SPIP, and the state it is currently 
 in.

 We have had long discussions over some very important questions such as
 

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-13 Thread StanZhai
I've filed a JIRA about this problem. 
https://issues.apache.org/jira/browse/SPARK-19532
  

I've tried to set `spark.speculation` to `false`, but the off-heap also
exceed about 10G after triggering a FullGC to the Executor
process(--executor-memory 30G), as follow:

test@test Online ~ $ ps aux | grep CoarseGrainedExecutorBackend
test  105371  106 21.5 67325492 42621992 ?   Sl   15:20  55:14
/home/test/service/jdk/bin/java -cp
/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
-Xmx30720M -Dspark.driver.port=9835 -Dtag=spark_2_1_test -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@172.16.34.235:9835 --executor-id 4 --hostname
test-192 --cores 36 --app-id app-20170213152037-0043 --worker-url
spark://Worker@test-192:33890

So, I think these are also other reasons for this problem.

We have been trying to upgrade our Spark from the releasing of Spark 2.1.0.

This version is unstable and not available for us because of the memory
problems, we should pay attention to this.


StanZhai wrote
> From thread dump page of Executor of WebUI, I found that there are about
> 1300 threads named  "DataStreamer for file
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
> in TIMED_WAITING state like this:

 
> 
> The exceed off-heap memory may be caused by these abnormal threads. 
> 
> This problem occurs only when writing data to the Hadoop(tasks may be
> killed by Executor during writing).
> 
> Could this be related to 
> https://issues.apache.org/jira/browse/HDFS-9812
>   
> ?
> 
> It's may be a bug of Spark when killing tasks during writing data. What's
> the difference between Spark 1.6.x and 2.1.0 in killing tasks?
> 
> This is a critical issue, I've worked on this for days.
> 
> Any help?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org