[jira] [Assigned] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22395: Assignee: (was: Apache Spark) > Fix the behavior of timestamp values for Pandas to respect session timezone > --- > > Key: SPARK-22395 > URL: https://issues.apache.org/jira/browse/SPARK-22395 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > When converting Pandas DataFrame/Series from/to Spark DataFrame using > {{toPandas()}} or pandas udfs, timestamp values behave to respect Python > system timezone instead of session timezone. > For example, let's say we use {{"America/Los_Angeles"}} as session timezone > and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, > I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. > The timestamp value from current {{toPandas()}} will be the following: > {noformat} > >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) > >>> as ts") > >>> df.show() > +---+ > | ts| > +---+ > |1970-01-01 00:00:01| > +---+ > >>> df.toPandas() >ts > 0 1970-01-01 17:00:01 > {noformat} > As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it > respects Python timezone. > As we discussed in https://github.com/apache/spark/pull/18664, we consider > this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224399#comment-16224399 ] Apache Spark commented on SPARK-22395: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/19607 > Fix the behavior of timestamp values for Pandas to respect session timezone > --- > > Key: SPARK-22395 > URL: https://issues.apache.org/jira/browse/SPARK-22395 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > When converting Pandas DataFrame/Series from/to Spark DataFrame using > {{toPandas()}} or pandas udfs, timestamp values behave to respect Python > system timezone instead of session timezone. > For example, let's say we use {{"America/Los_Angeles"}} as session timezone > and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, > I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. > The timestamp value from current {{toPandas()}} will be the following: > {noformat} > >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) > >>> as ts") > >>> df.show() > +---+ > | ts| > +---+ > |1970-01-01 00:00:01| > +---+ > >>> df.toPandas() >ts > 0 1970-01-01 17:00:01 > {noformat} > As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it > respects Python timezone. > As we discussed in https://github.com/apache/spark/pull/18664, we consider > this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22395: Assignee: Apache Spark > Fix the behavior of timestamp values for Pandas to respect session timezone > --- > > Key: SPARK-22395 > URL: https://issues.apache.org/jira/browse/SPARK-22395 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark > > When converting Pandas DataFrame/Series from/to Spark DataFrame using > {{toPandas()}} or pandas udfs, timestamp values behave to respect Python > system timezone instead of session timezone. > For example, let's say we use {{"America/Los_Angeles"}} as session timezone > and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, > I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. > The timestamp value from current {{toPandas()}} will be the following: > {noformat} > >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) > >>> as ts") > >>> df.show() > +---+ > | ts| > +---+ > |1970-01-01 00:00:01| > +---+ > >>> df.toPandas() >ts > 0 1970-01-01 17:00:01 > {noformat} > As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it > respects Python timezone. > As we discussed in https://github.com/apache/spark/pull/18664, we consider > this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7019) Build docs on doc changes
[ https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224388#comment-16224388 ] Xin Lu commented on SPARK-7019: --- recent pr here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83085/consoleFull Building Unidoc API Documentation [info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Phive-thriftserver -Pflume -Pkinesis-asl -Pyarn -Pkafka-0-8 -Phive -Pmesos unidoc Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. > Build docs on doc changes > - > > Key: SPARK-7019 > URL: https://issues.apache.org/jira/browse/SPARK-7019 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Brennon York > > Currently when a pull request changes the {{docs/}} directory, the docs > aren't actually built. When a PR is submitted the {{git}} history should be > checked to see if any doc changes were made and, if so, properly build the > docs and report any issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7019) Build docs on doc changes
[ https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224386#comment-16224386 ] Xin Lu commented on SPARK-7019: --- It looks like unidoc is running on new PRs now. Maybe this can be closed now? > Build docs on doc changes > - > > Key: SPARK-7019 > URL: https://issues.apache.org/jira/browse/SPARK-7019 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Brennon York > > Currently when a pull request changes the {{docs/}} directory, the docs > aren't actually built. When a PR is submitted the {{git}} history should be > checked to see if any doc changes were made and, if so, properly build the > docs and report any issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224374#comment-16224374 ] Ohad Raviv commented on SPARK-21657: ok i found the relevant rule: {code:java|title=Optimizer.scala.java|borderStyle=solid} // Turn off `join` for Generate if no column from it's child is used case p @ Project(_, g: Generate) if g.join && !g.outer && p.references.subsetOf(g.generatedSet) => p.copy(child = g.copy(join = false)) {code} I'm not sure yet why it doesn't work. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224365#comment-16224365 ] Ohad Raviv commented on SPARK-21657: After futher investigating I believe that my assesment is correct, the former case creates a generator with join=true while the later with join=false, as you can see in plans above (I also debugged). this causes the very long array of size 100k to be duplicated 100k times and afterwards get pruned because its column is not in the final projection. I'm not sure what's the best way to address this issue - ammend the generate operator according to the projection. in the meanwhile, in our case, I worked around that by manually adding the outer fields into each of structs of the array and then exploded only the array. it's an ugly solution but reduces our query time from 6 hours to about 2 mins. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358 ] Xin Lu edited comment on SPARK-2 at 10/30/17 4:12 AM: -- I checked the dependencies and it looks like lz4-java already updated to 1.4.0: https://github.com/apache/spark/blob/master/pom.xml#L538 lz4 1.4.0 was released august 2nd and looks like it included the patch above. This is probably resolvable now. This should be a dupe of this issue which will be fixed in 2.3.0: https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) was (Author: xynny): I checked the dependencies and it looks like lz4-java already updated to 1.4.0: https://github.com/apache/spark/blob/master/pom.xml#L538 lz4 1.4.0 was released august 2nd and looks like it included the patch above. This is probably resolvable now. This should be a dupe of this: https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) > Spark Hive tests aborted due to lz4-java on ppc64le > --- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi >Priority: Minor > Labels: ppc64le > Attachments: hs_err_pid.log > > > The tests are getting aborted in Spark Hive project with the following error : > {code:borderStyle=solid} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0 > # > # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build > 1.8.0_111-8u111-b14-3~14.04.1-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x56f114] > {code} > In the thread log file, I found the following traces : > Event: 3669.042 Thread 0x3fff89976800 Exception 'java/lang/NoClassDefFoundError': Could not initialize class > net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at > [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp, > line 890] > This error is due to the lz4-java (version 1.3.0), which doesn’t have support > for ppc64le.PFA the thread log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone
Takuya Ueshin created SPARK-22395: - Summary: Fix the behavior of timestamp values for Pandas to respect session timezone Key: SPARK-22395 URL: https://issues.apache.org/jira/browse/SPARK-22395 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin When converting Pandas DataFrame/Series from/to Spark DataFrame using {{toPandas()}} or pandas udfs, timestamp values behave to respect Python system timezone instead of session timezone. For example, let's say we use {{"America/Los_Angeles"}} as session timezone and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. The timestamp value from current {{toPandas()}} will be the following: {noformat} >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) as >>> ts") >>> df.show() +---+ | ts| +---+ |1970-01-01 00:00:01| +---+ >>> df.toPandas() ts 0 1970-01-01 17:00:01 {noformat} As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it respects Python timezone. As we discussed in https://github.com/apache/spark/pull/18664, we consider this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358 ] Xin Lu edited comment on SPARK-2 at 10/30/17 4:04 AM: -- I checked the dependencies and it looks like lz4-java already updated to 1.4.0: https://github.com/apache/spark/blob/master/pom.xml#L538 lz4 1.4.0 was released august 2nd and looks like it included the patch above. This is probably resolvable now. This should be a dupe of this: https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f [SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) was (Author: xynny): I checked the dependencies and it looks like lz4-java already updated to 1.4.0: https://github.com/apache/spark/blob/master/pom.xml#L538 lz4 1.4.0 was released august 2nd and looks like it included the patch above. This is probably resolvable now. > Spark Hive tests aborted due to lz4-java on ppc64le > --- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi >Priority: Minor > Labels: ppc64le > Attachments: hs_err_pid.log > > > The tests are getting aborted in Spark Hive project with the following error : > {code:borderStyle=solid} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0 > # > # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build > 1.8.0_111-8u111-b14-3~14.04.1-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x56f114] > {code} > In the thread log file, I found the following traces : > Event: 3669.042 Thread 0x3fff89976800 Exception 'java/lang/NoClassDefFoundError': Could not initialize class > net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at > [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp, > line 890] > This error is due to the lz4-java (version 1.3.0), which doesn’t have support > for ppc64le.PFA the thread log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224358#comment-16224358 ] Xin Lu commented on SPARK-2: I checked the dependencies and it looks like lz4-java already updated to 1.4.0: https://github.com/apache/spark/blob/master/pom.xml#L538 lz4 1.4.0 was released august 2nd and looks like it included the patch above. This is probably resolvable now. > Spark Hive tests aborted due to lz4-java on ppc64le > --- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0 > Environment: Ubuntu 14.04 ppc64le > $ java -version > openjdk version "1.8.0_111" > OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14) > OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) >Reporter: Sonia Garudi >Priority: Minor > Labels: ppc64le > Attachments: hs_err_pid.log > > > The tests are getting aborted in Spark Hive project with the following error : > {code:borderStyle=solid} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0 > # > # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build > 1.8.0_111-8u111-b14-3~14.04.1-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x56f114] > {code} > In the thread log file, I found the following traces : > Event: 3669.042 Thread 0x3fff89976800 Exception 'java/lang/NoClassDefFoundError': Could not initialize class > net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at > [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp, > line 890] > This error is due to the lz4-java (version 1.3.0), which doesn’t have support > for ppc64le.PFA the thread log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22333) ColumnReference should get higher priority than timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP)
[ https://issues.apache.org/jira/browse/SPARK-22333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224342#comment-16224342 ] Apache Spark commented on SPARK-22333: -- User 'DonnyZone' has created a pull request for this issue: https://github.com/apache/spark/pull/19606 > ColumnReference should get higher priority than > timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) > - > > Key: SPARK-22333 > URL: https://issues.apache.org/jira/browse/SPARK-22333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0 >Reporter: Feng Zhu >Assignee: Feng Zhu > Fix For: 2.3.0 > > > In our cluster, there is a table "T" with column named as "current_date". > When we select data from this column with SQL: > {code:sql} > select current_date from T > {code} > We get the wrong answer, as the column is translated as CURRENT_DATE() > function. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21625) Add incompatible Hive UDF describe to DOC
[ https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21625: Description: SQRT: {code:sql} hive> select SQRT(-10.0); OK NULL Time taken: 0.384 seconds, Fetched: 1 row(s) {code} {code:sql} spark-sql> select SQRT(-10.0); NaN Time taken: 0.096 seconds, Fetched 1 row(s) 17/10/30 10:52:50 INFO SparkSQLCLIDriver: Time taken: 0.096 seconds, Fetched 1 row(s) spark-sql> {code} ACOS, ASIN: https://issues.apache.org/jira/browse/HIVE-17240 was: Both Hive and MySQL are null: {code:sql} hive> select SQRT(-10.0); OK NULL Time taken: 0.384 seconds, Fetched: 1 row(s) {code} {code:sql} mysql> select sqrt(-10.0); +---+ | sqrt(-10.0) | +---+ | NULL | +---+ 1 row in set (0.00 sec) {code} > Add incompatible Hive UDF describe to DOC > - > > Key: SPARK-21625 > URL: https://issues.apache.org/jira/browse/SPARK-21625 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > SQRT: > {code:sql} > hive> select SQRT(-10.0); > OK > NULL > Time taken: 0.384 seconds, Fetched: 1 row(s) > {code} > {code:sql} > spark-sql> select SQRT(-10.0); > NaN > Time taken: 0.096 seconds, Fetched 1 row(s) > 17/10/30 10:52:50 INFO SparkSQLCLIDriver: Time taken: 0.096 seconds, Fetched > 1 row(s) > spark-sql> > {code} > > ACOS, ASIN: > https://issues.apache.org/jira/browse/HIVE-17240 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21625) sqrt(negative number) should be null
[ https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21625: Component/s: (was: SQL) Documentation > sqrt(negative number) should be null > > > Key: SPARK-21625 > URL: https://issues.apache.org/jira/browse/SPARK-21625 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > Both Hive and MySQL are null: > {code:sql} > hive> select SQRT(-10.0); > OK > NULL > Time taken: 0.384 seconds, Fetched: 1 row(s) > {code} > {code:sql} > mysql> select sqrt(-10.0); > +---+ > | sqrt(-10.0) | > +---+ > | NULL | > +---+ > 1 row in set (0.00 sec) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21625) Add incompatible Hive UDF describe to DOC
[ https://issues.apache.org/jira/browse/SPARK-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-21625: Summary: Add incompatible Hive UDF describe to DOC (was: sqrt(negative number) should be null) > Add incompatible Hive UDF describe to DOC > - > > Key: SPARK-21625 > URL: https://issues.apache.org/jira/browse/SPARK-21625 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Yuming Wang > > Both Hive and MySQL are null: > {code:sql} > hive> select SQRT(-10.0); > OK > NULL > Time taken: 0.384 seconds, Fetched: 1 row(s) > {code} > {code:sql} > mysql> select sqrt(-10.0); > +---+ > | sqrt(-10.0) | > +---+ > | NULL | > +---+ > 1 row in set (0.00 sec) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
[ https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-22379. --- Resolution: Resolved > Reduce duplication setUpClass and tearDownClass in PySpark SQL tests > > > Key: SPARK-22379 > URL: https://issues.apache.org/jira/browse/SPARK-22379 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.0 > > > Looks there are some duplication in sql/tests.py: > {code} > diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py > index 98afae662b4..6812da6b309 100644 > --- a/python/pyspark/sql/tests.py > +++ b/python/pyspark/sql/tests.py > @@ -179,6 +179,18 @@ class MyObject(object): > self.value = value > +class ReusedSQLTestCase(ReusedPySparkTestCase): > +@classmethod > +def setUpClass(cls): > +ReusedPySparkTestCase.setUpClass() > +cls.spark = SparkSession(cls.sc) > + > +@classmethod > +def tearDownClass(cls): > +ReusedPySparkTestCase.tearDownClass() > +cls.spark.stop() > + > + > class DataTypeTests(unittest.TestCase): > # regression test for SPARK-6055 > def test_data_type_eq(self): > @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase): > self.assertRaises(TypeError, struct_field.typeName) > -class SQLTests(ReusedPySparkTestCase): > +class SQLTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > cls.tempdir = tempfile.NamedTemporaryFile(delete=False) > os.unlink(cls.tempdir.name) > -cls.spark = SparkSession(cls.sc) > cls.testData = [Row(key=i, value=str(i)) for i in range(100)] > cls.df = cls.spark.createDataFrame(cls.testData) > @classmethod > def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > shutil.rmtree(cls.tempdir.name, ignore_errors=True) > def test_sqlcontext_reuses_sparksession(self): > @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests): > self.assertTrue(os.path.exists(metastore_path)) > -class SQLTests2(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class SQLTests2(ReusedSQLTestCase): > # We can't include this test into SQLTests because we will stop class's > SparkContext and cause > # other tests failed. > @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase): > @unittest.skipIf(not _have_arrow, "Arrow not installed") > -class ArrowTests(ReusedPySparkTestCase): > +class ArrowTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > from datetime import datetime > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > # Synchronize default timezone between Python and Java > cls.tz_prev = os.environ.get("TZ", None) # save current tz if set > @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase): > os.environ["TZ"] = tz > time.tzset() > -cls.spark = SparkSession(cls.sc) > cls.spark.conf.set("spark.sql.session.timeZone", tz) > cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true") > cls.schema = StructType([ > @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase): > if cls.tz_prev is not None: > os.environ["TZ"] = cls.tz_prev > time.tzset() > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > def assertFramesEqual(self, df_with_arrow, df_without): > msg = ("DataFrame from Arrow is not equal" + > @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase): > @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not > installed") > -class VectorizedUDFTests(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class VectorizedUDFTests(ReusedSQLTestCase): > def test_vectorized_udf_basic(self): > from pyspark.sql.functions import pandas_udf, col > @@ -3478,16
[jira] [Commented] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
[ https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224333#comment-16224333 ] Takuya Ueshin commented on SPARK-22379: --- Issue resolved by pull request 19595 https://github.com/apache/spark/pull/19595 > Reduce duplication setUpClass and tearDownClass in PySpark SQL tests > > > Key: SPARK-22379 > URL: https://issues.apache.org/jira/browse/SPARK-22379 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.0 > > > Looks there are some duplication in sql/tests.py: > {code} > diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py > index 98afae662b4..6812da6b309 100644 > --- a/python/pyspark/sql/tests.py > +++ b/python/pyspark/sql/tests.py > @@ -179,6 +179,18 @@ class MyObject(object): > self.value = value > +class ReusedSQLTestCase(ReusedPySparkTestCase): > +@classmethod > +def setUpClass(cls): > +ReusedPySparkTestCase.setUpClass() > +cls.spark = SparkSession(cls.sc) > + > +@classmethod > +def tearDownClass(cls): > +ReusedPySparkTestCase.tearDownClass() > +cls.spark.stop() > + > + > class DataTypeTests(unittest.TestCase): > # regression test for SPARK-6055 > def test_data_type_eq(self): > @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase): > self.assertRaises(TypeError, struct_field.typeName) > -class SQLTests(ReusedPySparkTestCase): > +class SQLTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > cls.tempdir = tempfile.NamedTemporaryFile(delete=False) > os.unlink(cls.tempdir.name) > -cls.spark = SparkSession(cls.sc) > cls.testData = [Row(key=i, value=str(i)) for i in range(100)] > cls.df = cls.spark.createDataFrame(cls.testData) > @classmethod > def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > shutil.rmtree(cls.tempdir.name, ignore_errors=True) > def test_sqlcontext_reuses_sparksession(self): > @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests): > self.assertTrue(os.path.exists(metastore_path)) > -class SQLTests2(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class SQLTests2(ReusedSQLTestCase): > # We can't include this test into SQLTests because we will stop class's > SparkContext and cause > # other tests failed. > @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase): > @unittest.skipIf(not _have_arrow, "Arrow not installed") > -class ArrowTests(ReusedPySparkTestCase): > +class ArrowTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > from datetime import datetime > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > # Synchronize default timezone between Python and Java > cls.tz_prev = os.environ.get("TZ", None) # save current tz if set > @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase): > os.environ["TZ"] = tz > time.tzset() > -cls.spark = SparkSession(cls.sc) > cls.spark.conf.set("spark.sql.session.timeZone", tz) > cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true") > cls.schema = StructType([ > @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase): > if cls.tz_prev is not None: > os.environ["TZ"] = cls.tz_prev > time.tzset() > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > def assertFramesEqual(self, df_with_arrow, df_without): > msg = ("DataFrame from Arrow is not equal" + > @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase): > @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not > installed") > -class VectorizedUDFTests(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class VectorizedUDFTests(ReusedSQLTestCase): > def
[jira] [Updated] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
[ https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-22379: -- Fix Version/s: 2.3.0 > Reduce duplication setUpClass and tearDownClass in PySpark SQL tests > > > Key: SPARK-22379 > URL: https://issues.apache.org/jira/browse/SPARK-22379 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.0 > > > Looks there are some duplication in sql/tests.py: > {code} > diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py > index 98afae662b4..6812da6b309 100644 > --- a/python/pyspark/sql/tests.py > +++ b/python/pyspark/sql/tests.py > @@ -179,6 +179,18 @@ class MyObject(object): > self.value = value > +class ReusedSQLTestCase(ReusedPySparkTestCase): > +@classmethod > +def setUpClass(cls): > +ReusedPySparkTestCase.setUpClass() > +cls.spark = SparkSession(cls.sc) > + > +@classmethod > +def tearDownClass(cls): > +ReusedPySparkTestCase.tearDownClass() > +cls.spark.stop() > + > + > class DataTypeTests(unittest.TestCase): > # regression test for SPARK-6055 > def test_data_type_eq(self): > @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase): > self.assertRaises(TypeError, struct_field.typeName) > -class SQLTests(ReusedPySparkTestCase): > +class SQLTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > cls.tempdir = tempfile.NamedTemporaryFile(delete=False) > os.unlink(cls.tempdir.name) > -cls.spark = SparkSession(cls.sc) > cls.testData = [Row(key=i, value=str(i)) for i in range(100)] > cls.df = cls.spark.createDataFrame(cls.testData) > @classmethod > def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > shutil.rmtree(cls.tempdir.name, ignore_errors=True) > def test_sqlcontext_reuses_sparksession(self): > @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests): > self.assertTrue(os.path.exists(metastore_path)) > -class SQLTests2(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class SQLTests2(ReusedSQLTestCase): > # We can't include this test into SQLTests because we will stop class's > SparkContext and cause > # other tests failed. > @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase): > @unittest.skipIf(not _have_arrow, "Arrow not installed") > -class ArrowTests(ReusedPySparkTestCase): > +class ArrowTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > from datetime import datetime > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > # Synchronize default timezone between Python and Java > cls.tz_prev = os.environ.get("TZ", None) # save current tz if set > @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase): > os.environ["TZ"] = tz > time.tzset() > -cls.spark = SparkSession(cls.sc) > cls.spark.conf.set("spark.sql.session.timeZone", tz) > cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true") > cls.schema = StructType([ > @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase): > if cls.tz_prev is not None: > os.environ["TZ"] = cls.tz_prev > time.tzset() > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > def assertFramesEqual(self, df_with_arrow, df_without): > msg = ("DataFrame from Arrow is not equal" + > @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase): > @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not > installed") > -class VectorizedUDFTests(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class VectorizedUDFTests(ReusedSQLTestCase): > def test_vectorized_udf_basic(self): > from pyspark.sql.functions import pandas_udf, col > @@ -3478,16
[jira] [Assigned] (SPARK-22379) Reduce duplication setUpClass and tearDownClass in PySpark SQL tests
[ https://issues.apache.org/jira/browse/SPARK-22379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-22379: - Assignee: Hyukjin Kwon > Reduce duplication setUpClass and tearDownClass in PySpark SQL tests > > > Key: SPARK-22379 > URL: https://issues.apache.org/jira/browse/SPARK-22379 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > Looks there are some duplication in sql/tests.py: > {code} > diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py > index 98afae662b4..6812da6b309 100644 > --- a/python/pyspark/sql/tests.py > +++ b/python/pyspark/sql/tests.py > @@ -179,6 +179,18 @@ class MyObject(object): > self.value = value > +class ReusedSQLTestCase(ReusedPySparkTestCase): > +@classmethod > +def setUpClass(cls): > +ReusedPySparkTestCase.setUpClass() > +cls.spark = SparkSession(cls.sc) > + > +@classmethod > +def tearDownClass(cls): > +ReusedPySparkTestCase.tearDownClass() > +cls.spark.stop() > + > + > class DataTypeTests(unittest.TestCase): > # regression test for SPARK-6055 > def test_data_type_eq(self): > @@ -214,21 +226,19 @@ class DataTypeTests(unittest.TestCase): > self.assertRaises(TypeError, struct_field.typeName) > -class SQLTests(ReusedPySparkTestCase): > +class SQLTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > cls.tempdir = tempfile.NamedTemporaryFile(delete=False) > os.unlink(cls.tempdir.name) > -cls.spark = SparkSession(cls.sc) > cls.testData = [Row(key=i, value=str(i)) for i in range(100)] > cls.df = cls.spark.createDataFrame(cls.testData) > @classmethod > def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > shutil.rmtree(cls.tempdir.name, ignore_errors=True) > def test_sqlcontext_reuses_sparksession(self): > @@ -2623,17 +2633,7 @@ class HiveSparkSubmitTests(SparkSubmitTests): > self.assertTrue(os.path.exists(metastore_path)) > -class SQLTests2(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class SQLTests2(ReusedSQLTestCase): > # We can't include this test into SQLTests because we will stop class's > SparkContext and cause > # other tests failed. > @@ -3082,12 +3082,12 @@ class DataTypeVerificationTests(unittest.TestCase): > @unittest.skipIf(not _have_arrow, "Arrow not installed") > -class ArrowTests(ReusedPySparkTestCase): > +class ArrowTests(ReusedSQLTestCase): > @classmethod > def setUpClass(cls): > from datetime import datetime > -ReusedPySparkTestCase.setUpClass() > +ReusedSQLTestCase.setUpClass() > # Synchronize default timezone between Python and Java > cls.tz_prev = os.environ.get("TZ", None) # save current tz if set > @@ -3095,7 +3095,6 @@ class ArrowTests(ReusedPySparkTestCase): > os.environ["TZ"] = tz > time.tzset() > -cls.spark = SparkSession(cls.sc) > cls.spark.conf.set("spark.sql.session.timeZone", tz) > cls.spark.conf.set("spark.sql.execution.arrow.enabled", "true") > cls.schema = StructType([ > @@ -3116,8 +3115,7 @@ class ArrowTests(ReusedPySparkTestCase): > if cls.tz_prev is not None: > os.environ["TZ"] = cls.tz_prev > time.tzset() > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +ReusedSQLTestCase.tearDownClass() > def assertFramesEqual(self, df_with_arrow, df_without): > msg = ("DataFrame from Arrow is not equal" + > @@ -3169,17 +3167,7 @@ class ArrowTests(ReusedPySparkTestCase): > @unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not > installed") > -class VectorizedUDFTests(ReusedPySparkTestCase): > - > -@classmethod > -def setUpClass(cls): > -ReusedPySparkTestCase.setUpClass() > -cls.spark = SparkSession(cls.sc) > - > -@classmethod > -def tearDownClass(cls): > -ReusedPySparkTestCase.tearDownClass() > -cls.spark.stop() > +class VectorizedUDFTests(ReusedSQLTestCase): > def test_vectorized_udf_basic(self): > from pyspark.sql.functions import pandas_udf, col > @@ -3478,16 +3466,7 @@ class
[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224301#comment-16224301 ] Shivaram Venkataraman commented on SPARK-22344: --- well uninstall is just removing `sparkCachePath()/` -- Should be relatively easy to put together ? > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22394) Redundant synchronization for metastore access
[ https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22394: Assignee: (was: Apache Spark) > Redundant synchronization for metastore access > -- > > Key: SPARK-22394 > URL: https://issues.apache.org/jira/browse/SPARK-22394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Before Spark 2.x, synchronization for metastore access was protected at > [line229 in ClientWrapper > |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] > (now it's at [line203 in HiveClientWrapper > |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). > After Spark 2.x, HiveExternalCatalog was introduced by > [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra > level of synchronization was added at > [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. > That is, now we have two levels of synchronization: one is > HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. > But since both HiveExternalCatalog and IsolatedClientLoader are shared among > all spark sessions, I think the extra level of synchronization in > HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access
[ https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224296#comment-16224296 ] Apache Spark commented on SPARK-22394: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/19605 > Redundant synchronization for metastore access > -- > > Key: SPARK-22394 > URL: https://issues.apache.org/jira/browse/SPARK-22394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Before Spark 2.x, synchronization for metastore access was protected at > [line229 in ClientWrapper > |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] > (now it's at [line203 in HiveClientWrapper > |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). > After Spark 2.x, HiveExternalCatalog was introduced by > [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra > level of synchronization was added at > [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. > That is, now we have two levels of synchronization: one is > HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. > But since both HiveExternalCatalog and IsolatedClientLoader are shared among > all spark sessions, I think the extra level of synchronization in > HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22394) Redundant synchronization for metastore access
[ https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22394: Assignee: Apache Spark > Redundant synchronization for metastore access > -- > > Key: SPARK-22394 > URL: https://issues.apache.org/jira/browse/SPARK-22394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > Before Spark 2.x, synchronization for metastore access was protected at > [line229 in ClientWrapper > |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] > (now it's at [line203 in HiveClientWrapper > |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). > After Spark 2.x, HiveExternalCatalog was introduced by > [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra > level of synchronization was added at > [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. > That is, now we have two levels of synchronization: one is > HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. > But since both HiveExternalCatalog and IsolatedClientLoader are shared among > all spark sessions, I think the extra level of synchronization in > HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224289#comment-16224289 ] Apache Spark commented on SPARK-22291: -- User 'jmchung' has created a pull request for this issue: https://github.com/apache/spark/pull/19604 > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at
[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224284#comment-16224284 ] guoxiaolongzte commented on SPARK-22365: You need to provide a snapshot to help other people understand your reason, thank you. > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {{java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite
[ https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-22308: -- > Support unit tests of spark code using ScalaTest using suites other than > FunSuite > - > > Key: SPARK-22308 > URL: https://issues.apache.org/jira/browse/SPARK-22308 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core, SQL, Tests >Affects Versions: 2.2.0 >Reporter: Nathan Kronenfeld >Assignee: Nathan Kronenfeld >Priority: Minor > Labels: scalatest, test-suite, test_issue > Fix For: 2.3.0 > > > External codebases that have spark code can test it using SharedSparkContext, > no matter how they write their scalatests - basing on FunSuite, FunSpec, > FlatSpec, or WordSpec. > SharedSQLContext only supports FunSuite. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224248#comment-16224248 ] Jen-Ming Chung commented on SPARK-22291: Thank you all :) > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224247#comment-16224247 ] Liang-Chi Hsieh commented on SPARK-22291: - Thanks [~hyukjin.kwon]. > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Assigned] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22291: Assignee: Jen-Ming Chung (was: Fabio J. Walter) > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224246#comment-16224246 ] Hyukjin Kwon commented on SPARK-22291: -- I happened to see this comment first and just updated. > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Jen-Ming Chung > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Commented] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224245#comment-16224245 ] Liang-Chi Hsieh commented on SPARK-22291: - [~cloud_fan] The Assignee should be [~jmchung]. Thanks. > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Fabio J. Walter > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Reopened] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-15689: - > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224224#comment-16224224 ] Wenchen Fan commented on SPARK-15689: - ah missed this one, reopen this ticket. My concern is that, follow-ups should not block 2.3 release, while the basic data source v2 infrastructure should > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224194#comment-16224194 ] Sean Owen commented on SPARK-22393: --- I'm guessing it's something to do with how it overrides the shell initialization or classloader. It could be worth trying the 2.12 build and shell as the shell integration is a little less hacky. But really no idea off the top of my head. > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224158#comment-16224158 ] Ryan Williams commented on SPARK-22393: --- Everything works fine in a Scala shell ({{scala -cp $SPARK_HOME/jars/spark-core_2.11-2.2.0.jar}}) and via {{sbt console}} in a project that depends on Spark, so the problem seems specific to {{spark-shell}}. > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224139#comment-16224139 ] Felix Cheung commented on SPARK-22344: -- Kinda we don't have any uninstall feature though > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS
[ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224124#comment-16224124 ] Matteo Cossu commented on SPARK-2465: - For example, with this limitation it is not possible to use _monotonically_increasing_id_ to generate the ids, since they are longs. Therefore, one should go back to RDD to use ZipWithIndex. > Use long as user / item ID for ALS > -- > > Key: SPARK-2465 > URL: https://issues.apache.org/jira/browse/SPARK-2465 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.1 >Reporter: Sean Owen >Priority: Minor > Attachments: ALS using MEMORY_AND_DISK.png, ALS using > MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png > > > I'd like to float this for consideration: use longs instead of ints for user > and product IDs in the ALS implementation. > The main reason for is that identifiers are not generally numeric at all, and > will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits > means collisions are likely after hundreds of thousands of users and items, > which is not unrealistic. Hashing to 64 bits pushes this back to billions. > It would also mean numeric IDs that happen to be larger than the largest int > can be used directly as identifiers. > On the downside of course: 8 bytes instead of 4 bytes of memory used per > Rating. > Thoughts? I will post a PR so as to show what the change would be. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224122#comment-16224122 ] Reynold Xin commented on SPARK-15689: - Why not put all of them as subtasks here? Also https://issues.apache.org/jira/browse/SPARK-22078 is not done. > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access
[ https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224120#comment-16224120 ] Wenchen Fan commented on SPARK-22394: - looks like so, can you send a PR? thanks! > Redundant synchronization for metastore access > -- > > Key: SPARK-22394 > URL: https://issues.apache.org/jira/browse/SPARK-22394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Before Spark 2.x, synchronization for metastore access was protected at > [line229 in ClientWrapper > |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] > (now it's at [line203 in HiveClientWrapper > |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). > After Spark 2.x, HiveExternalCatalog was introduced by > [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra > level of synchronization was added at > [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. > That is, now we have two levels of synchronization: one is > HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. > But since both HiveExternalCatalog and IsolatedClientLoader are shared among > all spark sessions, I think the extra level of synchronization in > HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22291. - Resolution: Fixed Assignee: Fabio J. Walter Fix Version/s: 2.3.0 > Postgresql UUID[] to Cassandra: Conversion Error > > > Key: SPARK-22291 > URL: https://issues.apache.org/jira/browse/SPARK-22291 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, > Cassandra 3 >Reporter: Fabio J. Walter >Assignee: Fabio J. Walter > Labels: patch, postgresql, sql > Fix For: 2.3.0 > > Attachments: > org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png > > > My job reads data from a PostgreSQL table that contains columns of user_ids > uuid[] type, so that I'm getting the error above when I'm trying to save data > on Cassandra. > However, the creation of this same table on Cassandra works fine! user_ids > list. > I can't change the type on the source table, because I'm reading data from a > legacy system. > I've been looking at point printed on log, on class > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala > Stacktrace on Spark: > {noformat} > Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to > [Ljava.lang.String; > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at
[jira] [Updated] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22393: -- Priority: Minor (was: Major) > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224040#comment-16224040 ] Sean Owen commented on SPARK-22393: --- That's a weird one. {{class P(p: org.apache.spark.Partition)}} works fine as does {{ {import org.apache.spark.Partition; class P(p: Partition)} }}. I think this is some subtlety of how the scala shell interpreter works. > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22394) Redundant synchronization for metastore access
[ https://issues.apache.org/jira/browse/SPARK-22394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224028#comment-16224028 ] Zhenhua Wang commented on SPARK-22394: -- [~cloud_fan] [~smilegator] [~rxin] Do I understand it correctly, or do I miss something? > Redundant synchronization for metastore access > -- > > Key: SPARK-22394 > URL: https://issues.apache.org/jira/browse/SPARK-22394 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Before Spark 2.x, synchronization for metastore access was protected at > [line229 in ClientWrapper > |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] > (now it's at [line203 in HiveClientWrapper > |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). > After Spark 2.x, HiveExternalCatalog was introduced by > [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra > level of synchronization was added at > [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. > That is, now we have two levels of synchronization: one is > HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. > But since both HiveExternalCatalog and IsolatedClientLoader are shared among > all spark sessions, I think the extra level of synchronization in > HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22394) Redundant synchronization for metastore access
Zhenhua Wang created SPARK-22394: Summary: Redundant synchronization for metastore access Key: SPARK-22394 URL: https://issues.apache.org/jira/browse/SPARK-22394 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang Before Spark 2.x, synchronization for metastore access was protected at [line229 in ClientWrapper |https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L229] (now it's at [line203 in HiveClientWrapper |https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L203]). After Spark 2.x, HiveExternalCatalog was introduced by [SPARK-13080|https://github.com/apache/spark/pull/11293], where an extra level of synchronization was added at [line95|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L95]. That is, now we have two levels of synchronization: one is HiveExternalCatalog and the other is IsolatedClientLoader in HiveClientImpl. But since both HiveExternalCatalog and IsolatedClientLoader are shared among all spark sessions, I think the extra level of synchronization in HiveExternalCatalog is redundant, thus can be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-22393: -- Affects Version/s: (was: 2.0.0) 2.0.2 > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
Ryan Williams created SPARK-22393: - Summary: spark-shell can't find imported types in class constructors, extends clause Key: SPARK-22393 URL: https://issues.apache.org/jira/browse/SPARK-22393 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.2.0, 2.1.2, 2.0.0 Reporter: Ryan Williams {code} $ spark-shell … scala> import org.apache.spark.Partition import org.apache.spark.Partition scala> class P(p: Partition) :11: error: not found: type Partition class P(p: Partition) ^ scala> class P(val index: Int) extends Partition :11: error: not found: type Partition class P(val index: Int) extends Partition ^ {code} Any class that I {{import}} gives "not found: type ___" when used as a parameter to a class, or in an extends clause; this applies to classes I import from JARs I provide via {{--jars}} as well as core Spark classes as above. This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15689. - Resolution: Fixed Fix Version/s: 2.3.0 > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223988#comment-16223988 ] Wenchen Fan commented on SPARK-15689: - The basic read/write interfaces are done, I'm resolving this ticket and track the follow-ups in https://issues.apache.org/jira/browse/SPARK-22386 > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan > Labels: SPIP, releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22392) columnar reader interface
Wenchen Fan created SPARK-22392: --- Summary: columnar reader interface Key: SPARK-22392 URL: https://issues.apache.org/jira/browse/SPARK-22392 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path
[ https://issues.apache.org/jira/browse/SPARK-22391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22391: Description: please refer to the discussion in the dev list with this email: *[discuss] Data Source V2 write path* (was: please refer to the discussion and the dev list with this email: *[discuss] Data Source V2 write path*) > add `MetadataCreationSupport` trait to separate data and metadata handling at > write path > > > Key: SPARK-22391 > URL: https://issues.apache.org/jira/browse/SPARK-22391 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > please refer to the discussion in the dev list with this email: *[discuss] > Data Source V2 write path* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223987#comment-16223987 ] Wenchen Fan commented on SPARK-21657: - I'd say they are different issues, and I haven't figured out the reason for this issue yet, and wanna fix that small issue first. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path
[ https://issues.apache.org/jira/browse/SPARK-22391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22391: Description: please refer to the discussion and the dev list with this email: *[discuss] Data Source V2 write path* > add `MetadataCreationSupport` trait to separate data and metadata handling at > write path > > > Key: SPARK-22391 > URL: https://issues.apache.org/jira/browse/SPARK-22391 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > please refer to the discussion and the dev list with this email: *[discuss] > Data Source V2 write path* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22391) add `MetadataCreationSupport` trait to separate data and metadata handling at write path
Wenchen Fan created SPARK-22391: --- Summary: add `MetadataCreationSupport` trait to separate data and metadata handling at write path Key: SPARK-22391 URL: https://issues.apache.org/jira/browse/SPARK-22391 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22389) partitioning reporting
[ https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22389: --- Assignee: Wenchen Fan > partitioning reporting > -- > > Key: SPARK-22389 > URL: https://issues.apache.org/jira/browse/SPARK-22389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > We should allow data source to report partitioning and avoid shuffle at Spark > side -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22390) Aggregate push down
Wenchen Fan created SPARK-22390: --- Summary: Aggregate push down Key: SPARK-22390 URL: https://issues.apache.org/jira/browse/SPARK-22390 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22389) partitioning reporting
Wenchen Fan created SPARK-22389: --- Summary: partitioning reporting Key: SPARK-22389 URL: https://issues.apache.org/jira/browse/SPARK-22389 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan We should allow data source to report partitioning and avoid shuffle at Spark side -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options
[ https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22387: Description: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data source. One another thing also worth to think: sometimes it's really annoying if users have a typo in the config key and spend a lot of time to figure out why things don't work as expected. We should allow data source to validate the given options and throw exception if an option can't be recognized. was: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data source. One another thing also worth to think: sometimes it's really annoying if users have a type in the config key and spent > propagate session configs to data source read/write options > --- > > Key: SPARK-22387 > URL: https://issues.apache.org/jira/browse/SPARK-22387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > This is an open discussion. The general idea is we should allow users to set > some common configs in session conf so that they don't need to type them > again and again for each data source operations. > Proposal 1: > propagate every session config which starts with {{spark.datasource.config.}} > to data source options. The downside is, users may only want to set some > common configs for a specific data source. > Proposal 2: > propagate session config which starts with > {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} > operations. One downside is, some data source may not have a short name and > makes the config key pretty long, e.g. > {{spark.datasource.config.com.company.foo.bar.key1}}. > Proposal 3: > Introduce a trait `WithSessionConfig` which defines session config key > prefix. Then we can pick session configs with this key-prefix and propagate > it to this particular data source. > One another thing also worth to think: sometimes it's really annoying if > users have a typo in the config key and spend a lot of time to figure out why > things don't work as expected. We should allow data source to validate the > given options and throw exception if an option can't be recognized. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22388) Limit push down
Wenchen Fan created SPARK-22388: --- Summary: Limit push down Key: SPARK-22388 URL: https://issues.apache.org/jira/browse/SPARK-22388 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options
[ https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22387: Description: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data source. One another thing also worth to think: sometimes it's really annoying if users have a type in the config key and spent was: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data source. One another thing also worth to think: sometimes it's really awful if users have a type in the config key and spent > propagate session configs to data source read/write options > --- > > Key: SPARK-22387 > URL: https://issues.apache.org/jira/browse/SPARK-22387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > This is an open discussion. The general idea is we should allow users to set > some common configs in session conf so that they don't need to type them > again and again for each data source operations. > Proposal 1: > propagate every session config which starts with {{spark.datasource.config.}} > to data source options. The downside is, users may only want to set some > common configs for a specific data source. > Proposal 2: > propagate session config which starts with > {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} > operations. One downside is, some data source may not have a short name and > makes the config key pretty long, e.g. > {{spark.datasource.config.com.company.foo.bar.key1}}. > Proposal 3: > Introduce a trait `WithSessionConfig` which defines session config key > prefix. Then we can pick session configs with this key-prefix and propagate > it to this particular data source. > One another thing also worth to think: sometimes it's really annoying if > users have a type in the config key and spent -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options
[ https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22387: Description: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data source. One another thing also worth to think: sometimes it's really awful if users have a type in the config key and spent was: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data sourcde. > propagate session configs to data source read/write options > --- > > Key: SPARK-22387 > URL: https://issues.apache.org/jira/browse/SPARK-22387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > This is an open discussion. The general idea is we should allow users to set > some common configs in session conf so that they don't need to type them > again and again for each data source operations. > Proposal 1: > propagate every session config which starts with {{spark.datasource.config.}} > to data source options. The downside is, users may only want to set some > common configs for a specific data source. > Proposal 2: > propagate session config which starts with > {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} > operations. One downside is, some data source may not have a short name and > makes the config key pretty long, e.g. > {{spark.datasource.config.com.company.foo.bar.key1}}. > Proposal 3: > Introduce a trait `WithSessionConfig` which defines session config key > prefix. Then we can pick session configs with this key-prefix and propagate > it to this particular data source. > One another thing also worth to think: sometimes it's really awful if users > have a type in the config key and spent -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options
[ https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22387: Description: This is an open discussion. The general idea is we should allow users to set some common configs in session conf so that they don't need to type them again and again for each data source operations. Proposal 1: propagate every session config which starts with {{spark.datasource.config.}} to data source options. The downside is, users may only want to set some common configs for a specific data source. Proposal 2: propagate session config which starts with {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} operations. One downside is, some data source may not have a short name and makes the config key pretty long, e.g. {{spark.datasource.config.com.company.foo.bar.key1}}. Proposal 3: Introduce a trait `WithSessionConfig` which defines session config key prefix. Then we can pick session configs with this key-prefix and propagate it to this particular data sourcde. > propagate session configs to data source read/write options > --- > > Key: SPARK-22387 > URL: https://issues.apache.org/jira/browse/SPARK-22387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > > This is an open discussion. The general idea is we should allow users to set > some common configs in session conf so that they don't need to type them > again and again for each data source operations. > Proposal 1: > propagate every session config which starts with {{spark.datasource.config.}} > to data source options. The downside is, users may only want to set some > common configs for a specific data source. > Proposal 2: > propagate session config which starts with > {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} > operations. One downside is, some data source may not have a short name and > makes the config key pretty long, e.g. > {{spark.datasource.config.com.company.foo.bar.key1}}. > Proposal 3: > Introduce a trait `WithSessionConfig` which defines session config key > prefix. Then we can pick session configs with this key-prefix and propagate > it to this particular data sourcde. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22387) propagate session configs to data source read/write options
Wenchen Fan created SPARK-22387: --- Summary: propagate session configs to data source read/write options Key: SPARK-22387 URL: https://issues.apache.org/jira/browse/SPARK-22387 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22386) Data Source V2 improvements
Wenchen Fan created SPARK-22386: --- Summary: Data Source V2 improvements Key: SPARK-22386 URL: https://issues.apache.org/jira/browse/SPARK-22386 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223971#comment-16223971 ] Sean Owen commented on SPARK-21657: --- Thanks [~cloud_fan] for the fast look. You're saying that https://issues.apache.org/jira/browse/SPARK-22385 is a superset of this issue? > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22385) MapObjects should not access list element by index
[ https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22385: Assignee: Wenchen Fan (was: Apache Spark) > MapObjects should not access list element by index > -- > > Key: SPARK-22385 > URL: https://issues.apache.org/jira/browse/SPARK-22385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22385) MapObjects should not access list element by index
[ https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22385: Assignee: Apache Spark (was: Wenchen Fan) > MapObjects should not access list element by index > -- > > Key: SPARK-22385 > URL: https://issues.apache.org/jira/browse/SPARK-22385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22385) MapObjects should not access list element by index
[ https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223958#comment-16223958 ] Apache Spark commented on SPARK-22385: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19603 > MapObjects should not access list element by index > -- > > Key: SPARK-22385 > URL: https://issues.apache.org/jira/browse/SPARK-22385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22385) MapObjects should not access list element by index
[ https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22385: Issue Type: Improvement (was: Bug) > MapObjects should not access list element by index > -- > > Key: SPARK-22385 > URL: https://issues.apache.org/jira/browse/SPARK-22385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22385) MapObjects should not access list element by index
Wenchen Fan created SPARK-22385: --- Summary: MapObjects should not access list element by index Key: SPARK-22385 URL: https://issues.apache.org/jira/browse/SPARK-22385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223946#comment-16223946 ] Ohad Raviv commented on SPARK-21657: Sure, the plan for {code:java} val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2")).selectExpr("c1" ,"c2.*") {code} is {noformat} == Parsed Logical Plan == 'Project [unresolvedalias('c1, None), ArrayBuffer(c2).*] +- Project [c1#6, c2#25] +- Generate explode(c_arr#7), true, false, [c2#25] +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true) AS _1#3, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._3, true), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._4, true)), assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#4] +- ExternalRDD [obj#2] == Analyzed Logical Plan == c1: string, _1: string, _2: string, _3: string, _4: string Project [c1#6, c2#25._1 AS _1#40, c2#25._2 AS _2#41, c2#25._3 AS _3#42, c2#25._4 AS _4#43] +- Project [c1#6, c2#25] +- Generate explode(c_arr#7), true, false, [c2#25] +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true) AS _1#3, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._3, true), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._4, true)), assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#4] +- ExternalRDD [obj#2] == Optimized Logical Plan == Project [c1#6, c2#25._1 AS _1#40, c2#25._2 AS _2#41, c2#25._3 AS _3#42, c2#25._4 AS _4#43] +- Generate explode(c_arr#7), true, false, [c2#25] +- Project [_1#3 AS c1#6, _2#4 AS c_arr#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true) AS _1#3, mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._1, true), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4), true))._2, true), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull0, ObjectType(class scala.Tuple4),
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223927#comment-16223927 ] Sean Owen commented on SPARK-21657: --- Can you paste the plans? this difference might be down to a different cause. The linear-time-access List issue still look worth solving. [~hvanhovell] [~cloud_fan] are either of you familiar with how the explode code is generated? I also couldn't quite figure out what was generating access to a linked list (immutable.List) where a random-access collection looks more appropriate. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22380) Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-22380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22380. --- Resolution: Won't Fix You need to shade your dependencies in your app, not Spark. Look at the maven-shade-plugin. I think this kind of update would have to follow an update in Hadoop as well, which may happen in 3.0, but then that's something that would take place far down the line for Spark 3.x > Upgrade protobuf-java (com.google.protobuf) version from 2.5.0 to 3.4.0 > --- > > Key: SPARK-22380 > URL: https://issues.apache.org/jira/browse/SPARK-22380 > Project: Spark > Issue Type: Dependency upgrade > Components: Deploy >Affects Versions: 1.6.1, 2.2.0 > Environment: Cloudera 5.13.x > Spark 2.2.0.cloudera1-1.cdh5.12.0.p0.142354 > And anything beyond Spark 2.2.0 >Reporter: Maziyar PANAHI >Priority: Blocker > > Hi, > This upgrade is needed when we try to use CoreNLP 3.8 with Spark (1.6+ and > 2.2+) due to incompatibilities in the protobuf version used by > com.google.protobuf and the one is used in latest Stanford CoreNLP (3.8). The > version of protobuf has been set to 2.5.0 in the global properties, and this > is stated in the pom.xml file. > The error that refers to this dependency: > {code:java} > java.lang.VerifyError: Bad type on operand stack > Exception Details: > Location: > > com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; > @3: invokevirtual > Reason: > Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current > frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite' > Current Frame: > bci: @3 > flags: { } > locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', > 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } > stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', > 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer } > Bytecode: > 0x000: 2a2b 1cb6 0024 b0 > at edu.stanford.nlp.simple.Document.(Document.java:433) > at edu.stanford.nlp.simple.Sentence.(Sentence.java:118) > at edu.stanford.nlp.simple.Sentence.(Sentence.java:126) > ... 56 elided > {code} > Is it possible to upgrade this dependency to the latest (3.4) or any > workaround besides manually removing protobuf-java-2.5.0.jar and adding > protobuf-java-3.4.0.jar? > You can follow the discussion of how this upgrade would fix the issue: > https://github.com/stanfordnlp/CoreNLP/issues/556 > Many thanks, > Maziyar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223902#comment-16223902 ] Ohad Raviv commented on SPARK-21657: I Switched to toArray instead of toList in the above code and I did get an improvement by factor of 2. but we still remain with the main bottleneck. now the diff in the above example between: {code:java} val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2")) {code} and: {code:java} val df_exploded = df.select(explode($"c_arr").as("c2")) {code} is 128 secs vs. 3 secs. Again I profiled the former and saw that all the time got consumed in: org.apache.spark.unsafe.Platform.copyMemory() 97.548096 23,991 ms (97.5%) the obvious diff between the execution plans is that the former has two WholeStageCodeGen plans and the later just one. I didn't exactly understood the generated code but I would guess that what happens is that in the problematic case the generated explode code is actually multiplying the long array to all the exploded rows and only filters it in the end. Please see if you can verify it or think on a workaround for it. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process
[ https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22375. -- Resolution: Fixed Fix Version/s: 2.3.0 > Test script can fail if eggs are installed by setup.py during test process > -- > > Key: SPARK-22375 > URL: https://issues.apache.org/jira/browse/SPARK-22375 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 > Environment: OSX 10.12.6 >Reporter: Joel Croteau >Priority: Trivial > Fix For: 2.3.0 > > > Running ./dev/run-tests may install missing Python packages as part of it's > setup process. setup.py can cache these in python/.eggs, and since the > lint-python script checks any file with the .py extension anywhere in the > Spark project, it will check files in .eggs and will fail if any of these do > not meet style criteria, even though these are not part of the project > lint-spark should exclude python/.eggs from its search directories. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast
[ https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223880#comment-16223880 ] Apache Spark commented on SPARK-22384: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/19602 > Refine partition pruning when attribute is wrapped in Cast > -- > > Key: SPARK-22384 > URL: https://issues.apache.org/jira/browse/SPARK-22384 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > Sql below will get all partitions from metastore, which put much burden on > metastore; > {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}} > {{SELECT * from test where dt=2017}} > The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} > and {{HiveShim}} fails to generate a proper partition filter. > Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in > my warehouse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast
[ https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22384: Assignee: (was: Apache Spark) > Refine partition pruning when attribute is wrapped in Cast > -- > > Key: SPARK-22384 > URL: https://issues.apache.org/jira/browse/SPARK-22384 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > Sql below will get all partitions from metastore, which put much burden on > metastore; > {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}} > {{SELECT * from test where dt=2017}} > The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} > and {{HiveShim}} fails to generate a proper partition filter. > Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in > my warehouse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast
[ https://issues.apache.org/jira/browse/SPARK-22384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22384: Assignee: Apache Spark > Refine partition pruning when attribute is wrapped in Cast > -- > > Key: SPARK-22384 > URL: https://issues.apache.org/jira/browse/SPARK-22384 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: Apache Spark > > Sql below will get all partitions from metastore, which put much burden on > metastore; > {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}} > {{SELECT * from test where dt=2017}} > The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} > and {{HiveShim}} fails to generate a proper partition filter. > Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in > my warehouse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22384) Refine partition pruning when attribute is wrapped in Cast
jin xing created SPARK-22384: Summary: Refine partition pruning when attribute is wrapped in Cast Key: SPARK-22384 URL: https://issues.apache.org/jira/browse/SPARK-22384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: jin xing Sql below will get all partitions from metastore, which put much burden on metastore; {{CREATE TABLE test (value INT) PARTITIONED BY (dt STRING)}} {{SELECT * from test where dt=2017}} The reason is that the the analyzed attribute {{dt}} is wrapped in {{Cast}} and {{HiveShim}} fails to generate a proper partition filter. Could we fix this? Sql like {{SELECT * from test where dt=2017}} is common in my warehouse. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process
[ https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223877#comment-16223877 ] Hyukjin Kwon commented on SPARK-22375: -- Fixed in https://github.com/apache/spark/pull/19597 > Test script can fail if eggs are installed by setup.py during test process > -- > > Key: SPARK-22375 > URL: https://issues.apache.org/jira/browse/SPARK-22375 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 > Environment: OSX 10.12.6 >Reporter: Joel Croteau >Priority: Trivial > > Running ./dev/run-tests may install missing Python packages as part of it's > setup process. setup.py can cache these in python/.eggs, and since the > lint-python script checks any file with the .py extension anywhere in the > Spark project, it will check files in .eggs and will fail if any of these do > not meet style criteria, even though these are not part of the project > lint-spark should exclude python/.eggs from its search directories. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22375) Test script can fail if eggs are installed by setup.py during test process
[ https://issues.apache.org/jira/browse/SPARK-22375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22375: - External issue URL: https://github.com/pypa/setuptools/issues/391 > Test script can fail if eggs are installed by setup.py during test process > -- > > Key: SPARK-22375 > URL: https://issues.apache.org/jira/browse/SPARK-22375 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 > Environment: OSX 10.12.6 >Reporter: Joel Croteau >Priority: Trivial > > Running ./dev/run-tests may install missing Python packages as part of it's > setup process. setup.py can cache these in python/.eggs, and since the > lint-python script checks any file with the .py extension anywhere in the > Spark project, it will check files in .eggs and will fail if any of these do > not meet style criteria, even though these are not part of the project > lint-spark should exclude python/.eggs from its search directories. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org