[jira] [Assigned] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22395:


Assignee: (was: Apache Spark)

> Fix the behavior of timestamp values for Pandas to respect session timezone
> ---
>
> Key: SPARK-22395
> URL: https://issues.apache.org/jira/browse/SPARK-22395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> When converting Pandas DataFrame/Series from/to Spark DataFrame using 
> {{toPandas()}} or pandas udfs, timestamp values behave to respect Python 
> system timezone instead of session timezone.
> For example, let's say we use {{"America/Los_Angeles"}} as session timezone 
> and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, 
> I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}.
> The timestamp value from current {{toPandas()}} will be the following:
> {noformat}
> >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) 
> >>> as ts")
> >>> df.show()
> +---+
> | ts|
> +---+
> |1970-01-01 00:00:01|
> +---+
> >>> df.toPandas()
>ts
> 0 1970-01-01 17:00:01
> {noformat}
> As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it 
> respects Python timezone.
> As we discussed in https://github.com/apache/spark/pull/18664, we consider 
> this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224399#comment-16224399
 ] 

Apache Spark commented on SPARK-22395:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19607

> Fix the behavior of timestamp values for Pandas to respect session timezone
> ---
>
> Key: SPARK-22395
> URL: https://issues.apache.org/jira/browse/SPARK-22395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> When converting Pandas DataFrame/Series from/to Spark DataFrame using 
> {{toPandas()}} or pandas udfs, timestamp values behave to respect Python 
> system timezone instead of session timezone.
> For example, let's say we use {{"America/Los_Angeles"}} as session timezone 
> and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, 
> I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}.
> The timestamp value from current {{toPandas()}} will be the following:
> {noformat}
> >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) 
> >>> as ts")
> >>> df.show()
> +---+
> | ts|
> +---+
> |1970-01-01 00:00:01|
> +---+
> >>> df.toPandas()
>ts
> 0 1970-01-01 17:00:01
> {noformat}
> As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it 
> respects Python timezone.
> As we discussed in https://github.com/apache/spark/pull/18664, we consider 
> this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22395:


Assignee: Apache Spark

> Fix the behavior of timestamp values for Pandas to respect session timezone
> ---
>
> Key: SPARK-22395
> URL: https://issues.apache.org/jira/browse/SPARK-22395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> When converting Pandas DataFrame/Series from/to Spark DataFrame using 
> {{toPandas()}} or pandas udfs, timestamp values behave to respect Python 
> system timezone instead of session timezone.
> For example, let's say we use {{"America/Los_Angeles"}} as session timezone 
> and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, 
> I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}.
> The timestamp value from current {{toPandas()}} will be the following:
> {noformat}
> >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) 
> >>> as ts")
> >>> df.show()
> +---+
> | ts|
> +---+
> |1970-01-01 00:00:01|
> +---+
> >>> df.toPandas()
>ts
> 0 1970-01-01 17:00:01
> {noformat}
> As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it 
> respects Python timezone.
> As we discussed in https://github.com/apache/spark/pull/18664, we consider 
> this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7019) Build docs on doc changes

2017-10-29 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224388#comment-16224388
 ] 

Xin Lu commented on SPARK-7019:
---

recent pr here:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83085/consoleFull


Building Unidoc API Documentation

[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments:  
-Phadoop-2.6 -Phive-thriftserver -Pflume -Pkinesis-asl -Pyarn -Pkafka-0-8 
-Phive -Pmesos unidoc
Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.

> Build docs on doc changes
> -
>
> Key: SPARK-7019
> URL: https://issues.apache.org/jira/browse/SPARK-7019
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Brennon York
>
> Currently when a pull request changes the {{docs/}} directory, the docs 
> aren't actually built. When a PR is submitted the {{git}} history should be 
> checked to see if any doc changes were made and, if so, properly build the 
> docs and report any issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7019) Build docs on doc changes

2017-10-29 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224386#comment-16224386
 ] 

Xin Lu commented on SPARK-7019:
---

It looks like unidoc is running on new PRs now.  Maybe this can be closed now?

> Build docs on doc changes
> -
>
> Key: SPARK-7019
> URL: https://issues.apache.org/jira/browse/SPARK-7019
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Brennon York
>
> Currently when a pull request changes the {{docs/}} directory, the docs 
> aren't actually built. When a PR is submitted the {{git}} history should be 
> checked to see if any doc changes were made and, if so, properly build the 
> docs and report any issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224374#comment-16224374
 ] 

Ohad Raviv commented on SPARK-21657:


ok i found the relevant rule:
{code:java|title=Optimizer.scala.java|borderStyle=solid}
// Turn off `join` for Generate if no column from it's child is used
case p @ Project(_, g: Generate)
if g.join && !g.outer && p.references.subsetOf(g.generatedSet) =>
  p.copy(child = g.copy(join = false))
{code}
I'm not sure yet why it doesn't work.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224365#comment-16224365
 ] 

Ohad Raviv commented on SPARK-21657:


After futher investigating I believe that my assesment is correct, the former 
case creates a generator with join=true while the later with join=false, as you 
can see in plans above (I also debugged). this causes the very long array of 
size 100k to be duplicated 100k times and afterwards get pruned because its 
column is not in the final projection. 
I'm not sure what's the best way to address this issue - ammend the generate 
operator according to the projection.
in the meanwhile, in our case, I worked around that by manually adding the 
outer fields into each of structs of the array and then exploded only the 
array. it's an ugly solution but reduces our query time from 6 hours to about 2 
mins.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-10-29 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358
 ] 

Xin Lu edited comment on SPARK-2 at 10/30/17 4:12 AM:
--

I checked the dependencies and it looks like lz4-java already updated to 1.4.0: 
https://github.com/apache/spark/blob/master/pom.xml#L538

lz4 1.4.0 was released august 2nd and looks like it included the patch above. 
This is probably resolvable now. 

This should be a dupe of this issue which will be fixed in 2.3.0: 
https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f

[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)


was (Author: xynny):
I checked the dependencies and it looks like lz4-java already updated to 1.4.0: 
https://github.com/apache/spark/blob/master/pom.xml#L538

lz4 1.4.0 was released august 2nd and looks like it included the patch above. 
This is probably resolvable now. 

This should be a dupe of this: 
https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f

[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Priority: Minor
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-10-29 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-22395:
-

 Summary: Fix the behavior of timestamp values for Pandas to 
respect session timezone
 Key: SPARK-22395
 URL: https://issues.apache.org/jira/browse/SPARK-22395
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


When converting Pandas DataFrame/Series from/to Spark DataFrame using 
{{toPandas()}} or pandas udfs, timestamp values behave to respect Python system 
timezone instead of session timezone.


For example, let's say we use {{"America/Los_Angeles"}} as session timezone and 
have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, I'm in 
Japan so Python timezone would be {{"Asia/Tokyo"}}.

The timestamp value from current {{toPandas()}} will be the following:

{noformat}
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as 
>>> ts")
>>> df.show()
+---+
| ts|
+---+
|1970-01-01 00:00:01|
+---+

>>> df.toPandas()
   ts
0 1970-01-01 17:00:01
{noformat}

As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it respects 
Python timezone.


As we discussed in https://github.com/apache/spark/pull/18664, we consider this 
behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-10-29 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358
 ] 

Xin Lu edited comment on SPARK-2 at 10/30/17 4:04 AM:
--

I checked the dependencies and it looks like lz4-java already updated to 1.4.0: 
https://github.com/apache/spark/blob/master/pom.xml#L538

lz4 1.4.0 was released august 2nd and looks like it included the patch above. 
This is probably resolvable now. 

This should be a dupe of this: 
https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f

[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0)


was (Author: xynny):
I checked the dependencies and it looks like lz4-java already updated to 1.4.0: 
https://github.com/apache/spark/blob/master/pom.xml#L538

lz4 1.4.0 was released august 2nd and looks like it included the patch above. 
This is probably resolvable now. 

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Priority: Minor
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-10-29 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358
 ] 

Xin Lu commented on SPARK-2:


I checked the dependencies and it looks like lz4-java already updated to 1.4.0: 
https://github.com/apache/spark/blob/master/pom.xml#L538

lz4 1.4.0 was released august 2nd and looks like it included the patch above. 
This is probably resolvable now. 

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Priority: Minor
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22333) ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224342#comment-16224342
 ] 

Apache Spark commented on SPARK-22333:
--

User 'DonnyZone' has created a pull request for this issue:
https://github.com/apache/spark/pull/19606

> ColumnReference should get higher priority than 
> timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)
> -
>
> Key: SPARK-22333
> URL: https://issues.apache.org/jira/browse/SPARK-22333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Feng Zhu
>Assignee: Feng Zhu
> Fix For: 2.3.0
>
>
> In our cluster, there is a table "T" with column named as "current_date". 
> When we select data from this column with SQL:
> {code:sql}
> select current_date from T
> {code}
> We get the wrong answer, as the column is translated as CURRENT_DATE() 
> function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21625) Add incompatible Hive UDF describe to DOC

2017-10-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21625:

Description: 
SQRT:

{code:sql}
hive> select SQRT(-10.0);
OK
NULL
Time taken: 0.384 seconds, Fetched: 1 row(s)
{code}

{code:sql}
spark-sql> select SQRT(-10.0);
NaN
Time taken: 0.096 seconds, Fetched 1 row(s)
17/10/30 10:52:50 INFO SparkSQLCLIDriver: Time taken: 0.096 seconds, Fetched 1 
row(s)
spark-sql> 
{code}


ACOS, ASIN:
https://issues.apache.org/jira/browse/HIVE-17240

  was:
Both Hive and MySQL are null:

{code:sql}
hive> select SQRT(-10.0);
OK
NULL
Time taken: 0.384 seconds, Fetched: 1 row(s)
{code}


{code:sql}
mysql> select sqrt(-10.0);
+---+
| sqrt(-10.0) |
+---+
|  NULL |
+---+
1 row in set (0.00 sec)
{code}



> Add incompatible Hive UDF describe to DOC
> -
>
> Key: SPARK-21625
> URL: https://issues.apache.org/jira/browse/SPARK-21625
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> SQRT:
> {code:sql}
> hive> select SQRT(-10.0);
> OK
> NULL
> Time taken: 0.384 seconds, Fetched: 1 row(s)
> {code}
> {code:sql}
> spark-sql> select SQRT(-10.0);
> NaN
> Time taken: 0.096 seconds, Fetched 1 row(s)
> 17/10/30 10:52:50 INFO SparkSQLCLIDriver: Time taken: 0.096 seconds, Fetched 
> 1 row(s)
> spark-sql> 
> {code}
> 
> ACOS, ASIN:
> https://issues.apache.org/jira/browse/HIVE-17240



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21625) sqrt(negative number) should be null

2017-10-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21625:

Component/s: (was: SQL)
 Documentation

> sqrt(negative number) should be null
> 
>
> Key: SPARK-21625
> URL: https://issues.apache.org/jira/browse/SPARK-21625
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> Both Hive and MySQL are null:
> {code:sql}
> hive> select SQRT(-10.0);
> OK
> NULL
> Time taken: 0.384 seconds, Fetched: 1 row(s)
> {code}
> {code:sql}
> mysql> select sqrt(-10.0);
> +---+
> | sqrt(-10.0) |
> +---+
> |  NULL |
> +---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21625) Add incompatible Hive UDF describe to DOC

2017-10-29 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21625:

Summary: Add incompatible Hive UDF describe to DOC  (was: sqrt(negative 
number) should be null)

> Add incompatible Hive UDF describe to DOC
> -
>
> Key: SPARK-21625
> URL: https://issues.apache.org/jira/browse/SPARK-21625
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>
> Both Hive and MySQL are null:
> {code:sql}
> hive> select SQRT(-10.0);
> OK
> NULL
> Time taken: 0.384 seconds, Fetched: 1 row(s)
> {code}
> {code:sql}
> mysql> select sqrt(-10.0);
> +---+
> | sqrt(-10.0) |
> +---+
> |  NULL |
> +---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-29 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-22379.
---
Resolution: Resolved

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from pyspark.sql.functions import pandas_udf, col
> @@ -3478,16 

[jira] [Commented] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-29 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224333#comment-16224333
 ] 

Takuya Ueshin commented on SPARK-22379:
---

Issue resolved by pull request 19595
https://github.com/apache/spark/pull/19595

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def 

[jira] [Updated] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-29 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-22379:
--
Fix Version/s: 2.3.0

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from pyspark.sql.functions import pandas_udf, col
> @@ -3478,16 

[jira] [Assigned] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests

2017-10-29 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-22379:
-

Assignee: Hyukjin Kwon

> Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
> 
>
> Key: SPARK-22379
> URL: https://issues.apache.org/jira/browse/SPARK-22379
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> Looks there are some duplication in sql/tests.py:
> {code}
> diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
> index 98afae662b4..6812da6b309 100644
> --- a/python/pyspark/sql/tests.py
> +++ b/python/pyspark/sql/tests.py
> @@ -179,6 +179,18 @@ class MyObject(object):
>  self.value = value
> +class ReusedSQLTestCase(ReusedPySparkTestCase):
> +@classmethod
> +def setUpClass(cls):
> +ReusedPySparkTestCase.setUpClass()
> +cls.spark = SparkSession(cls.sc)
> +
> +@classmethod
> +def tearDownClass(cls):
> +ReusedPySparkTestCase.tearDownClass()
> +cls.spark.stop()
> +
> +
>  class DataTypeTests(unittest.TestCase):
>  # regression test for SPARK-6055
>  def test_data_type_eq(self):
> @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase):
>  self.assertRaises(TypeError, struct_field.typeName)
> -class SQLTests(ReusedPySparkTestCase):
> +class SQLTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
>  os.unlink(cls.tempdir.name)
> -cls.spark = SparkSession(cls.sc)
>  cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
>  cls.df = cls.spark.createDataFrame(cls.testData)
>  @classmethod
>  def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  shutil.rmtree(cls.tempdir.name, ignore_errors=True)
>  def test_sqlcontext_reuses_sparksession(self):
> @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests):
>  self.assertTrue(os.path.exists(metastore_path))
> -class SQLTests2(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class SQLTests2(ReusedSQLTestCase):
>  # We can't include this test into SQLTests because we will stop class's 
> SparkContext and cause
>  # other tests failed.
> @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase):
>  @unittest.skipIf(not _have_arrow, "Arrow not installed")
> -class ArrowTests(ReusedPySparkTestCase):
> +class ArrowTests(ReusedSQLTestCase):
>  @classmethod
>  def setUpClass(cls):
>  from datetime import datetime
> -ReusedPySparkTestCase.setUpClass()
> +ReusedSQLTestCase.setUpClass()
>  # Synchronize default timezone between Python and Java
>  cls.tz_prev = os.environ.get("TZ", None)  # save current tz if set
> @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase):
>  os.environ["TZ"] = tz
>  time.tzset()
> -cls.spark = SparkSession(cls.sc)
>  cls.spark.conf.set("spark.sql.session.timeZone", tz)
>  cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>  cls.schema = StructType([
> @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  if cls.tz_prev is not None:
>  os.environ["TZ"] = cls.tz_prev
>  time.tzset()
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +ReusedSQLTestCase.tearDownClass()
>  def assertFramesEqual(self, df_with_arrow, df_without):
>  msg = ("DataFrame from Arrow is not equal" +
> @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase):
>  @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
> installed")
> -class VectorizedUDFTests(ReusedPySparkTestCase):
> -
> -@classmethod
> -def setUpClass(cls):
> -ReusedPySparkTestCase.setUpClass()
> -cls.spark = SparkSession(cls.sc)
> -
> -@classmethod
> -def tearDownClass(cls):
> -ReusedPySparkTestCase.tearDownClass()
> -cls.spark.stop()
> +class VectorizedUDFTests(ReusedSQLTestCase):
>  def test_vectorized_udf_basic(self):
>  from pyspark.sql.functions import pandas_udf, col
> @@ -3478,16 +3466,7 @@ class 

[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224301#comment-16224301
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

well uninstall is just removing `sparkCachePath()/` -- Should be 
relatively easy to put together ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22394:


Assignee: (was: Apache Spark)

> Redundant synchronization for metastore access
> --
>
> Key: SPARK-22394
> URL: https://issues.apache.org/jira/browse/SPARK-22394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Before Spark 2.x, synchronization for metastore access was protected at 
> [line229 in ClientWrapper  
> |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
>  (now it's at [line203 in HiveClientWrapper 
> |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
>  After Spark 2.x, HiveExternalCatalog was introduced by 
> [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra 
> level of synchronization was added at 
> [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
>  That is, now we have two levels of synchronization: one is 
> HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. 
> But since both HiveExternalCatalog and IsolatedClientLoader are shared among 
> all spark sessions, I think the extra level of synchronization in 
> HiveExternalCatalog is redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224296#comment-16224296
 ] 

Apache Spark commented on SPARK-22394:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19605

> Redundant synchronization for metastore access
> --
>
> Key: SPARK-22394
> URL: https://issues.apache.org/jira/browse/SPARK-22394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Before Spark 2.x, synchronization for metastore access was protected at 
> [line229 in ClientWrapper  
> |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
>  (now it's at [line203 in HiveClientWrapper 
> |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
>  After Spark 2.x, HiveExternalCatalog was introduced by 
> [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra 
> level of synchronization was added at 
> [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
>  That is, now we have two levels of synchronization: one is 
> HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. 
> But since both HiveExternalCatalog and IsolatedClientLoader are shared among 
> all spark sessions, I think the extra level of synchronization in 
> HiveExternalCatalog is redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22394:


Assignee: Apache Spark

> Redundant synchronization for metastore access
> --
>
> Key: SPARK-22394
> URL: https://issues.apache.org/jira/browse/SPARK-22394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Before Spark 2.x, synchronization for metastore access was protected at 
> [line229 in ClientWrapper  
> |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
>  (now it's at [line203 in HiveClientWrapper 
> |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
>  After Spark 2.x, HiveExternalCatalog was introduced by 
> [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra 
> level of synchronization was added at 
> [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
>  That is, now we have two levels of synchronization: one is 
> HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. 
> But since both HiveExternalCatalog and IsolatedClientLoader are shared among 
> all spark sessions, I think the extra level of synchronization in 
> HiveExternalCatalog is redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224289#comment-16224289
 ] 

Apache Spark commented on SPARK-22291:
--

User 'jmchung' has created a pull request for this issue:
https://github.com/apache/spark/pull/19604

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at 

[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error

2017-10-29 Thread guoxiaolongzte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224284#comment-16224284
 ] 

guoxiaolongzte commented on SPARK-22365:


You need to provide a snapshot to help other people understand your reason, 
thank you.

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {{java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-22308:
--

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
> Fix For: 2.3.0
>
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224248#comment-16224248
 ] 

Jen-Ming Chung commented on SPARK-22291:


Thank you all :)

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224247#comment-16224247
 ] 

Liang-Chi Hsieh commented on SPARK-22291:
-

Thanks [~hyukjin.kwon].

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22291:


Assignee: Jen-Ming Chung  (was: Fabio J. Walter)

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224246#comment-16224246
 ] 

Hyukjin Kwon commented on SPARK-22291:
--

I happened to see this comment first and just updated.

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224245#comment-16224245
 ] 

Liang-Chi Hsieh commented on SPARK-22291:
-

[~cloud_fan] The Assignee should be [~jmchung]. Thanks.

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Fabio J. Walter
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Reopened] (SPARK-15689) Data source API v2

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-15689:
-

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-10-29 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224224#comment-16224224
 ] 

Wenchen Fan commented on SPARK-15689:
-

ah missed this one, reopen this ticket. My concern is that, follow-ups should 
not block 2.3 release, while the basic data source v2 infrastructure should

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224194#comment-16224194
 ] 

Sean Owen commented on SPARK-22393:
---

I'm guessing it's something to do with how it overrides the shell 
initialization or classloader. It could be worth trying the 2.12 build and 
shell as the shell integration is a little less hacky. But really no idea off 
the top of my head.

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224158#comment-16224158
 ] 

Ryan Williams commented on SPARK-22393:
---

Everything works fine in a Scala shell ({{scala -cp 
$SPARK_HOME/jars/spark-core_2.11-2.2.0.jar}}) and via {{sbt console}} in a 
project that depends on Spark, so the problem seems specific to {{spark-shell}}.

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224139#comment-16224139
 ] 

Felix Cheung commented on SPARK-22344:
--

Kinda we don't have any uninstall feature though



> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2017-10-29 Thread Matteo Cossu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224124#comment-16224124
 ] 

Matteo Cossu commented on SPARK-2465:
-

For example, with this limitation it is not possible to use 
_monotonically_increasing_id_ to generate the ids, since they are longs. 
Therefore, one should go back to RDD to use ZipWithIndex.

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
> MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-10-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224122#comment-16224122
 ] 

Reynold Xin commented on SPARK-15689:
-

Why not put all of them as subtasks here?

Also https://issues.apache.org/jira/browse/SPARK-22078 is not done.


> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224120#comment-16224120
 ] 

Wenchen Fan commented on SPARK-22394:
-

looks like so, can you send a PR? thanks!

> Redundant synchronization for metastore access
> --
>
> Key: SPARK-22394
> URL: https://issues.apache.org/jira/browse/SPARK-22394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Before Spark 2.x, synchronization for metastore access was protected at 
> [line229 in ClientWrapper  
> |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
>  (now it's at [line203 in HiveClientWrapper 
> |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
>  After Spark 2.x, HiveExternalCatalog was introduced by 
> [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra 
> level of synchronization was added at 
> [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
>  That is, now we have two levels of synchronization: one is 
> HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. 
> But since both HiveExternalCatalog and IsolatedClientLoader are shared among 
> all spark sessions, I think the extra level of synchronization in 
> HiveExternalCatalog is redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22291.
-
   Resolution: Fixed
 Assignee: Fabio J. Walter
Fix Version/s: 2.3.0

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Fabio J. Walter
>  Labels: patch, postgresql, sql
> Fix For: 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 

[jira] [Updated] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22393:
--
Priority: Minor  (was: Major)

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224040#comment-16224040
 ] 

Sean Owen commented on SPARK-22393:
---

That's a weird one. {{class P(p: org.apache.spark.Partition)}} works fine as 
does {{ {import org.apache.spark.Partition; class P(p: Partition)} }}. I think 
this is some subtlety of how the scala shell interpreter works.

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224028#comment-16224028
 ] 

Zhenhua Wang commented on SPARK-22394:
--

[~cloud_fan] [~smilegator] [~rxin] Do I understand it correctly, or do I miss 
something?

> Redundant synchronization for metastore access
> --
>
> Key: SPARK-22394
> URL: https://issues.apache.org/jira/browse/SPARK-22394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Before Spark 2.x, synchronization for metastore access was protected at 
> [line229 in ClientWrapper  
> |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
>  (now it's at [line203 in HiveClientWrapper 
> |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
>  After Spark 2.x, HiveExternalCatalog was introduced by 
> [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra 
> level of synchronization was added at 
> [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
>  That is, now we have two levels of synchronization: one is 
> HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. 
> But since both HiveExternalCatalog and IsolatedClientLoader are shared among 
> all spark sessions, I think the extra level of synchronization in 
> HiveExternalCatalog is redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22394) Redundant synchronization for metastore access

2017-10-29 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-22394:


 Summary: Redundant synchronization for metastore access
 Key: SPARK-22394
 URL: https://issues.apache.org/jira/browse/SPARK-22394
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


Before Spark 2.x, synchronization for metastore access was protected at 
[line229 in ClientWrapper  
|https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229]
 (now it's at [line203 in HiveClientWrapper 
|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]).
 After Spark 2.x, HiveExternalCatalog was introduced by 
[SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra level 
of synchronization was added at 
[line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95].
 That is, now we have two levels of synchronization: one is HiveExternalCatalog 
and the other is IsolatedClientLoader in HiveClientImpl. But since both 
HiveExternalCatalog and IsolatedClientLoader are shared among all spark 
sessions, I think the extra level of synchronization in HiveExternalCatalog is 
redundant, thus can be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-22393:
--
Affects Version/s: (was: 2.0.0)
   2.0.2

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-10-29 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-22393:
-

 Summary: spark-shell can't find imported types in class 
constructors, extends clause
 Key: SPARK-22393
 URL: https://issues.apache.org/jira/browse/SPARK-22393
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.2.0, 2.1.2, 2.0.0
Reporter: Ryan Williams


{code}
$ spark-shell
…
scala> import org.apache.spark.Partition
import org.apache.spark.Partition

scala> class P(p: Partition)
:11: error: not found: type Partition
   class P(p: Partition)
  ^

scala> class P(val index: Int) extends Partition
:11: error: not found: type Partition
   class P(val index: Int) extends Partition
   ^
{code}

Any class that I {{import}} gives "not found: type ___" when used as a 
parameter to a class, or in an extends clause; this applies to classes I import 
from JARs I provide via {{--jars}} as well as core Spark classes as above.

This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15689) Data source API v2

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15689.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-10-29 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223988#comment-16223988
 ] 

Wenchen Fan commented on SPARK-15689:
-

The basic read/write interfaces are done, I'm resolving this ticket and track 
the follow-ups in https://issues.apache.org/jira/browse/SPARK-22386

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22392) columnar reader interface

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22392:
---

 Summary: columnar reader interface 
 Key: SPARK-22392
 URL: https://issues.apache.org/jira/browse/SPARK-22392
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22391:

Description: please refer to the discussion in the dev list with this 
email: *[discuss] Data Source V2 write path*  (was: please refer to the 
discussion and the dev list with this email: *[discuss] Data Source V2 write 
path*)

> add `MetadataCreationSupport` trait to separate data and metadata handling at 
> write path
> 
>
> Key: SPARK-22391
> URL: https://issues.apache.org/jira/browse/SPARK-22391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> please refer to the discussion in the dev list with this email: *[discuss] 
> Data Source V2 write path*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223987#comment-16223987
 ] 

Wenchen Fan commented on SPARK-21657:
-

I'd say they are different issues, and I haven't figured out the reason for 
this issue yet, and wanna fix that small issue first.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22391:

Description: please refer to the discussion and the dev list with this 
email: *[discuss] Data Source V2 write path*

> add `MetadataCreationSupport` trait to separate data and metadata handling at 
> write path
> 
>
> Key: SPARK-22391
> URL: https://issues.apache.org/jira/browse/SPARK-22391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> please refer to the discussion and the dev list with this email: *[discuss] 
> Data Source V2 write path*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22391:
---

 Summary: add `MetadataCreationSupport` trait to separate data and 
metadata handling at write path
 Key: SPARK-22391
 URL: https://issues.apache.org/jira/browse/SPARK-22391
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22389) partitioning reporting

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22389:
---

Assignee: Wenchen Fan

> partitioning reporting
> --
>
> Key: SPARK-22389
> URL: https://issues.apache.org/jira/browse/SPARK-22389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> We should allow data source to report partitioning and avoid shuffle at Spark 
> side



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22390) Aggregate push down

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22390:
---

 Summary: Aggregate push down
 Key: SPARK-22390
 URL: https://issues.apache.org/jira/browse/SPARK-22390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22389) partitioning reporting

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22389:
---

 Summary: partitioning reporting
 Key: SPARK-22389
 URL: https://issues.apache.org/jira/browse/SPARK-22389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan


We should allow data source to report partitioning and avoid shuffle at Spark 
side



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22387:

Description: 
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data source.

One another thing also worth to think: sometimes it's really annoying if users 
have a typo in the config key and spend a lot of time to figure out why things 
don't work as expected. We should allow data source to validate the given 
options and throw exception if an option can't be recognized.

  was:
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data source.

One another thing also worth to think: sometimes it's really annoying if users 
have a type in the config key and spent 


> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a typo in the config key and spend a lot of time to figure out why 
> things don't work as expected. We should allow data source to validate the 
> given options and throw exception if an option can't be recognized.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22388) Limit push down

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22388:
---

 Summary: Limit push down
 Key: SPARK-22388
 URL: https://issues.apache.org/jira/browse/SPARK-22388
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22387:

Description: 
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data source.

One another thing also worth to think: sometimes it's really annoying if users 
have a type in the config key and spent 

  was:
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data source.

One another thing also worth to think: sometimes it's really awful if users 
have a type in the config key and spent 


> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a type in the config key and spent 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22387:

Description: 
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data source.

One another thing also worth to think: sometimes it's really awful if users 
have a type in the config key and spent 

  was:
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data sourcde.


> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really awful if users 
> have a type in the config key and spent 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22387:

Description: 
This is an open discussion. The general idea is we should allow users to set 
some common configs in session conf so that they don't need to type them again 
and again for each data source operations.

Proposal 1:
propagate every session config which starts with {{spark.datasource.config.}} 
to data source options. The downside is, users may only want to set some common 
configs for a specific data source.

Proposal 2:
propagate session config which starts with 
{{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. 
One downside is, some data source may not have a short name and makes the 
config key pretty long, e.g. 
{{spark.datasource.config.com.company.foo.bar.key1}}.

Proposal 3:
Introduce a trait `WithSessionConfig` which defines session config key prefix. 
Then we can pick session configs with this key-prefix and propagate it to this 
particular data sourcde.

> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data sourcde.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22387) propagate session configs to data source read/write options

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22387:
---

 Summary: propagate session configs to data source read/write 
options
 Key: SPARK-22387
 URL: https://issues.apache.org/jira/browse/SPARK-22387
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22386) Data Source V2 improvements

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22386:
---

 Summary: Data Source V2 improvements
 Key: SPARK-22386
 URL: https://issues.apache.org/jira/browse/SPARK-22386
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223971#comment-16223971
 ] 

Sean Owen commented on SPARK-21657:
---

Thanks [~cloud_fan] for the fast look. You're saying that 
https://issues.apache.org/jira/browse/SPARK-22385 is a superset of this issue?

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22385) MapObjects should not access list element by index

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22385:


Assignee: Wenchen Fan  (was: Apache Spark)

> MapObjects should not access list element by index
> --
>
> Key: SPARK-22385
> URL: https://issues.apache.org/jira/browse/SPARK-22385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22385) MapObjects should not access list element by index

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22385:


Assignee: Apache Spark  (was: Wenchen Fan)

> MapObjects should not access list element by index
> --
>
> Key: SPARK-22385
> URL: https://issues.apache.org/jira/browse/SPARK-22385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22385) MapObjects should not access list element by index

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223958#comment-16223958
 ] 

Apache Spark commented on SPARK-22385:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19603

> MapObjects should not access list element by index
> --
>
> Key: SPARK-22385
> URL: https://issues.apache.org/jira/browse/SPARK-22385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22385) MapObjects should not access list element by index

2017-10-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22385:

Issue Type: Improvement  (was: Bug)

> MapObjects should not access list element by index
> --
>
> Key: SPARK-22385
> URL: https://issues.apache.org/jira/browse/SPARK-22385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22385) MapObjects should not access list element by index

2017-10-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22385:
---

 Summary: MapObjects should not access list element by index
 Key: SPARK-22385
 URL: https://issues.apache.org/jira/browse/SPARK-22385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223946#comment-16223946
 ] 

Ohad Raviv commented on SPARK-21657:


Sure,
the plan for
{code:java}
val df_exploded = df.select(expr("c1"), 
explode($"c_arr").as("c2")).selectExpr("c1" ,"c2.*")
{code}
is 
{noformat}
== Parsed Logical Plan ==
'Project [unresolvedalias('c1, None), ArrayBuffer(c2).*]
+- Project [c1#6, c2#25]
   +- Generate explode(c_arr#7), true, false, [c2#25]
  +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7]
 +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true) AS _1#3, 
mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class 
scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else 
named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._3, true), _4, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._4, true)), 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#4]
+- ExternalRDD [obj#2]

== Analyzed Logical Plan ==
c1: string, _1: string, _2: string, _3: string, _4: string
Project [c1#6, c2#25._1 AS _1#40, c2#25._2 AS _2#41, c2#25._3 AS _3#42, 
c2#25._4 AS _4#43]
+- Project [c1#6, c2#25]
   +- Generate explode(c_arr#7), true, false, [c2#25]
  +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7]
 +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true) AS _1#3, 
mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class 
scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else 
named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._3, true), _4, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._4, true)), 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#4]
+- ExternalRDD [obj#2]

== Optimized Logical Plan ==
Project [c1#6, c2#25._1 AS _1#40, c2#25._2 AS _2#41, c2#25._3 AS _3#42, 
c2#25._4 AS _4#43]
+- Generate explode(c_arr#7), true, false, [c2#25]
   +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7]
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, scala.Tuple2, true])._1, true) AS _1#3, 
mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class 
scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else 
named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, 
MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), 

[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223927#comment-16223927
 ] 

Sean Owen commented on SPARK-21657:
---

Can you paste the plans? this difference might be down to a different cause.

The linear-time-access List issue still look worth solving. [~hvanhovell] 
[~cloud_fan] are either of you familiar with how the explode code is generated? 
I also couldn't quite figure out what was generating access to a linked list 
(immutable.List) where a random-access collection looks more appropriate.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0

2017-10-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22380.
---
Resolution: Won't Fix

You need to shade your dependencies in your app, not Spark. Look at the 
maven-shade-plugin.
I think this kind of update would have to follow an update in Hadoop as well, 
which may happen in 3.0, but then that's something that would take place far 
down the line for Spark 3.x

> Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
> ---
>
> Key: SPARK-22380
> URL: https://issues.apache.org/jira/browse/SPARK-22380
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Deploy
>Affects Versions: 1.6.1, 2.2.0
> Environment: Cloudera 5.13.x
> Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354
> And anything beyond Spark 2.2.0
>Reporter: Maziyar PANAHI
>Priority: Blocker
>
> Hi,
> This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and 
> 2.2+) due to incompatibilities in the protobuf version used by 
> com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The 
> version of protobuf has been set to 2.5.0 in the global properties, and this 
> is stated in the pom.xml file.
> The error that refers to this dependency:
> {code:java}
> java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
> 
> com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object;
>  @3: invokevirtual
>   Reason:
> Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current 
> frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
>   Current Frame:
> bci: @3
> flags: { }
> locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
> stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 
> 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
>   Bytecode:
> 0x000: 2a2b 1cb6 0024 b0
>   at edu.stanford.nlp.simple.Document.(Document.java:433)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:118)
>   at edu.stanford.nlp.simple.Sentence.(Sentence.java:126)
>   ... 56 elided
> {code}
> Is it possible to upgrade this dependency to the latest (3.4) or any 
> workaround besides manually removing protobuf-java-2.5.0.jar and adding 
> protobuf-java-3.4.0.jar?
> You can follow the discussion of how this upgrade would fix the issue:
> https://github.com/stanfordnlp/CoreNLP/issues/556
> Many thanks,
> Maziyar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-29 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223902#comment-16223902
 ] 

Ohad Raviv commented on SPARK-21657:


I Switched to toArray instead of toList in the above code and I did get an 
improvement by factor of 2. but we still remain with the main bottleneck.
now the diff in the above example between:
{code:java}
val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2"))
{code}
and:
{code:java}
val df_exploded = df.select(explode($"c_arr").as("c2"))
{code}
is 128 secs vs. 3 secs.

Again I profiled the former and saw that all the time got consumed in:
org.apache.spark.unsafe.Platform.copyMemory()   97.548096   23,991 ms 
(97.5%)   

the obvious diff between the execution plans is that the former has two 
WholeStageCodeGen plans and the later just one.
I didn't exactly understood the generated code but I would guess that what 
happens is that in the problematic case the generated explode code is actually 
multiplying the long array to all the exploded rows and only filters it in the 
end.
Please see if you can verify it or think on a workaround for it.



> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22375.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2017-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223880#comment-16223880
 ] 

Apache Spark commented on SPARK-22384:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/19602

> Refine partition pruning when attribute is wrapped in Cast
> --
>
> Key: SPARK-22384
> URL: https://issues.apache.org/jira/browse/SPARK-22384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> Sql below will get all partitions from metastore, which put much burden on 
> metastore;
> {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
> {{SELECT * from test where dt=2017}}
> The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} 
> and {{HiveShim}} fails to generate a proper partition filter.
> Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
> my warehouse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22384:


Assignee: (was: Apache Spark)

> Refine partition pruning when attribute is wrapped in Cast
> --
>
> Key: SPARK-22384
> URL: https://issues.apache.org/jira/browse/SPARK-22384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> Sql below will get all partitions from metastore, which put much burden on 
> metastore;
> {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
> {{SELECT * from test where dt=2017}}
> The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} 
> and {{HiveShim}} fails to generate a proper partition filter.
> Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
> my warehouse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2017-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22384:


Assignee: Apache Spark

> Refine partition pruning when attribute is wrapped in Cast
> --
>
> Key: SPARK-22384
> URL: https://issues.apache.org/jira/browse/SPARK-22384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: Apache Spark
>
> Sql below will get all partitions from metastore, which put much burden on 
> metastore;
> {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
> {{SELECT * from test where dt=2017}}
> The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} 
> and {{HiveShim}} fails to generate a proper partition filter.
> Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
> my warehouse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast

2017-10-29 Thread jin xing (JIRA)
jin xing created SPARK-22384:


 Summary: Refine partition pruning when attribute is wrapped in Cast
 Key: SPARK-22384
 URL: https://issues.apache.org/jira/browse/SPARK-22384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: jin xing


Sql below will get all partitions from metastore, which put much burden on 
metastore;
{{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}}
{{SELECT * from test where dt=2017}}

The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} and 
{{HiveShim}} fails to generate a proper partition filter.

Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in 
my warehouse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223877#comment-16223877
 ] 

Hyukjin Kwon commented on SPARK-22375:
--

Fixed in https://github.com/apache/spark/pull/19597

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process

2017-10-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22375:
-
External issue URL: https://github.com/pypa/setuptools/issues/391

> Test script can fail if eggs are installed by setup.py during test process
> --
>
> Key: SPARK-22375
> URL: https://issues.apache.org/jira/browse/SPARK-22375
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
> Environment: OSX 10.12.6
>Reporter: Joel Croteau
>Priority: Trivial
>
> Running ./dev/run-tests may install missing Python packages as part of it's 
> setup process. setup.py can cache these in python/.eggs, and since the 
> lint-python script checks any file with the .py extension anywhere in the 
> Spark project, it will check files in .eggs and will fail if any of these do 
> not meet style criteria, even though these are not part of the project 
> lint-spark should exclude python/.eggs from its search directories.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org