[jira] [Updated] (SPARK-30120) LSH approxNearestNeighbors should use TopByKeyAggregator when numNearestNeighbors is small
[ https://issues.apache.org/jira/browse/SPARK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-30120: - Description: ping [~huaxingao] > LSH approxNearestNeighbors should use TopByKeyAggregator when > numNearestNeighbors is small > -- > > Key: SPARK-30120 > URL: https://issues.apache.org/jira/browse/SPARK-30120 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > ping [~huaxingao] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30120) LSH approxNearestNeighbors should use TopByKeyAggregator when numNearestNeighbors is small
zhengruifeng created SPARK-30120: Summary: LSH approxNearestNeighbors should use TopByKeyAggregator when numNearestNeighbors is small Key: SPARK-30120 URL: https://issues.apache.org/jira/browse/SPARK-30120 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987593#comment-16987593 ] jugosag commented on SPARK-28921: - We are also observing this on two of our clusters, both set up with Rancher, one on Kubernetes version 1.15.5 the other on 1.14.6. Interestingly, on Minikube with version 1.15.4 it works. Problem is that the Spark version which has the fix (2.4.5) has not been released yet and the 3.0.0 preview from Nov 6th has the same problem. When will 2.4.5 be released? > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Paul Schweigert >Assignee: Andy Grove >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
[ https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30099. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26734 [https://github.com/apache/spark/pull/26734] > Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming > > > Key: SPARK-30099 > URL: https://issues.apache.org/jira/browse/SPARK-30099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: jobit mathew >Priority: Minor > Fix For: 3.0.0 > > > Spark SQL > explain extended select * from any non existing table shows duplicate > AnalysisExceptions. > {code:java} > spark-sql>explain extended select * from wrong > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation `wrong` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > Time taken: 6.0 seconds, Fetched 1 row(s) > 19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 > row > (s) > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
[ https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30099: --- Assignee: jobit mathew > Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming > > > Key: SPARK-30099 > URL: https://issues.apache.org/jira/browse/SPARK-30099 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: jobit mathew >Priority: Minor > > Spark SQL > explain extended select * from any non existing table shows duplicate > AnalysisExceptions. > {code:java} > spark-sql>explain extended select * from wrong > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation `wrong` > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line > 1 p > os 31 > Time taken: 6.0 seconds, Fetched 1 row(s) > 19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 > row > (s) > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30082) Zeros are being treated as NaNs
[ https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30082: Fix Version/s: 2.4.5 > Zeros are being treated as NaNs > --- > > Key: SPARK-30082 > URL: https://issues.apache.org/jira/browse/SPARK-30082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: John Ayad >Assignee: John Ayad >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > If you attempt to run > {code:java} > df = df.replace(float('nan'), somethingToReplaceWith) > {code} > It will replace all {{0}} s in columns of type {{Integer}} > Example code snippet to repro this: > {code:java} > from pyspark.sql import SQLContext > spark = SQLContext(sc).sparkSession > df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > df.show() > df = df.replace(float('nan'), 5) > df.show() > {code} > Here's the output I get when I run this code: > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > Using Python version 3.7.5 (default, Nov 1 2019 02:16:32) > SparkSession available as 'spark'. > >>> from pyspark.sql import SQLContext > >>> spark = SQLContext(sc).sparkSession > >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|0| > |2|3| > |3|0| > +-+-+ > >>> df = df.replace(float('nan'), 5) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|5| > |2|3| > |3|5| > +-+-+ > >>> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30119) Support pagination for spark streaming tab
[ https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jobit mathew updated SPARK-30119: - Affects Version/s: 3.0.0 > Support pagination for spark streaming tab > --- > > Key: SPARK-30119 > URL: https://issues.apache.org/jira/browse/SPARK-30119 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30119) Support pagination for spark streaming tab
[ https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987522#comment-16987522 ] Rakesh Raushan commented on SPARK-30119: I will raise PR for this one. > Support pagination for spark streaming tab > --- > > Key: SPARK-30119 > URL: https://issues.apache.org/jira/browse/SPARK-30119 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30119) Support pagination for spark streaming tab
jobit mathew created SPARK-30119: Summary: Support pagination for spark streaming tab Key: SPARK-30119 URL: https://issues.apache.org/jira/browse/SPARK-30119 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: jobit mathew Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27547) fix DataFrame self-join problems
[ https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27547. - Resolution: Duplicate > fix DataFrame self-join problems > > > Key: SPARK-27547 > URL: https://issues.apache.org/jira/browse/SPARK-27547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987491#comment-16987491 ] Lantao Jin edited comment on SPARK-30118 at 12/4/19 3:42 AM: - I think it had been fixed. I cannot reproduce it in master code base. was (Author: cltlfcjin): I think it had been fixed in latest version. Could you try it in 2.4 or 3.0.0-preview? > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987491#comment-16987491 ] Lantao Jin commented on SPARK-30118: I think it had been fixed in latest version. Could you try it in 2.4 or 3.0.0-preview? > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30113) Document mergeSchema option in Python Orc APIs
[ https://issues.apache.org/jira/browse/SPARK-30113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30113. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26755 [https://github.com/apache/spark/pull/26755] > Document mergeSchema option in Python Orc APIs > -- > > Key: SPARK-30113 > URL: https://issues.apache.org/jira/browse/SPARK-30113 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30113) Document mergeSchema option in Python Orc APIs
[ https://issues.apache.org/jira/browse/SPARK-30113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30113: Assignee: Nicholas Chammas > Document mergeSchema option in Python Orc APIs > -- > > Key: SPARK-30113 > URL: https://issues.apache.org/jira/browse/SPARK-30113 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987456#comment-16987456 ] huangtianhua commented on SPARK-29106: -- [~shaneknapp],:) So seems you're back:) Could you please to add the new arm instance into amplab, it has high performance, maven test costs about 5 hours, the instance info I have sent to your email few days ago. And if you have any problem please contact me, thank you very much! > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, > SparkR-and-pyspark36-testing.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs
[ https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30091: Assignee: Nicholas Chammas > Document mergeSchema option directly in the Python Parquet APIs > --- > > Key: SPARK-30091 > URL: https://issues.apache.org/jira/browse/SPARK-30091 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > > [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet] > Strangely, the `mergeSchema` option is mentioned in the docstring but not > implemented in the method signature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs
[ https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30091. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26730 [https://github.com/apache/spark/pull/26730] > Document mergeSchema option directly in the Python Parquet APIs > --- > > Key: SPARK-30091 > URL: https://issues.apache.org/jira/browse/SPARK-30091 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.0.0 > > > [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet] > Strangely, the `mergeSchema` option is mentioned in the docstring but not > implemented in the method signature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27547) fix DataFrame self-join problems
[ https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987444#comment-16987444 ] Nicholas Chammas commented on SPARK-27547: -- Should this be marked as resolved by [#25107|https://github.com/apache/spark/pull/25107]? > fix DataFrame self-join problems > > > Key: SPARK-27547 > URL: https://issues.apache.org/jira/browse/SPARK-27547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29667. -- Resolution: Duplicate I am resolving this as a duplicate of SPARK-29860. > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > {code} > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > {code} > the sql and clause > {code} > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > {code} > Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql > ran just fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987440#comment-16987440 ] Hyukjin Kwon commented on SPARK-29667: -- I think this issue is a duplicate of SPARK-29860. [The PR against SPARK-29860|https://github.com/apache/spark/pull/26485] fixes this issue. There are also related JIRAs and PRs such as [SPARK-25056|https://github.com/apache/spark/pull/22038] and [SPARK-22413|https://github.com/apache/spark/pull/19635]. Seems like we should better actively review than just leaving them indefinitely. CC [~yumwang], [~mgaido], [~smilegator], [~cloud_fan]. > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > {code} > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > {code} > the sql and clause > {code} > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > {code} > Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql > ran just fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987435#comment-16987435 ] Shane Knapp commented on SPARK-29106: - nice catch... fixed and relaunched. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, > SparkR-and-pyspark36-testing.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30110) Support type judgment for ArrayData
[ https://issues.apache.org/jira/browse/SPARK-30110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng resolved SPARK-30110. Resolution: Won't Fix > Support type judgment for ArrayData > --- > > Key: SPARK-30110 > URL: https://issues.apache.org/jira/browse/SPARK-30110 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.0.0 > > > ArrayData is only some interfaces for getting data, such as: getBoolean, > getByte. > When the element type of array is unknow, I want to judgment the element type > first. > I have PR working in process: > [https://github.com/beliefer/spark/commit/5787c6f062795aa7931c58ac2302ba607d3a97aa] > The array function ArrayNDims is used to get the Dimension of array. > If ArrayData can support some interfaces like isBoolean, isByte, isArray, My > job will be easier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)
[ https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987432#comment-16987432 ] Guy Huinen commented on SPARK-30062: ok > bug with DB2Driver using mode("overwrite") option("truncate",True) > -- > > Key: SPARK-30062 > URL: https://issues.apache.org/jira/browse/SPARK-30062 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Guy Huinen >Priority: Major > Labels: db2, pyspark > > using DB2Driver using mode("overwrite") option("truncate",True) gives sql > error > > {code:java} > dfClient.write\ > .format("jdbc")\ > .mode("overwrite")\ > .option('driver', 'com.ibm.db2.jcc.DB2Driver')\ > .option("url","jdbc:db2://")\ > .option("user","xxx")\ > .option("password","")\ > .option("dbtable","")\ > .option("truncate",True)\{code} > > gives the error below > in summary i belief the semicolon is misplaced or malformated > > {code:java} > EXPO.EXPO#CMR_STG;IMMEDIATE{code} > > > full error > {code:java} > An error occurred while calling o47.save. : > com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, > SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, > DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at > com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) > at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at > com.ibm.db2.jcc.am.kh.d(kh.java:2776) at > com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) > at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) > at com.ibm.db2.jcc.t4.av.h(av.java:124) at > com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at > com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) > at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748){code} > -- This message was sent by Atlassian Jira
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987429#comment-16987429 ] John Zhuge commented on SPARK-30118: I am running Spark master with Hive 1.2.1. Same issue in Spark 2.3. > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987428#comment-16987428 ] John Zhuge commented on SPARK-30118: {code:java} spark-sql> DESC FORMATTED jzhuge.v1; foo1string NULL # Detailed Table Information Databasejzhuge Table v1 Owner jzhuge Created TimeTue Dec 03 17:53:59 PST 2019 Last Access UNKNOWN Created By Spark 3.0.0-SNAPSHOT TypeVIEW View Text SELECT 'foo' foo1 View Original Text SELECT 'foo' foo1 View Default Database default View Query Output Columns [foo2] Table Properties[transient_lastDdlTime=1575424439, view.query.out.col.0=foo2, view.query.out.numCols=1, view.default.database=default] Locationfile://tmp/jzhuge/v1 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1] {code} > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30118) ALTER VIEW QUERY does not work
John Zhuge created SPARK-30118: -- Summary: ALTER VIEW QUERY does not work Key: SPARK-30118 URL: https://issues.apache.org/jira/browse/SPARK-30118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted state. {code:sql} spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; spark-sql> SHOW CREATE TABLE jzhuge.v1; CREATE VIEW `jzhuge`.`v1`(foo1) AS SELECT 'foo' foo1 spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; spark-sql> SHOW CREATE TABLE jzhuge.v1; CREATE VIEW `jzhuge`.`v1`(foo1) AS SELECT 'foo' foo1 spark-sql> TABLE jzhuge.v1; Error in query: Attribute with name 'foo2' is not found in '(foo1)';; SubqueryAlias `jzhuge`.`v1` +- View (`jzhuge`.`v1`, [foo1#33]) +- Project [foo AS foo1#34] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30111) spark R dockerfile fails to build
[ https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp resolved SPARK-30111. - Resolution: Fixed > spark R dockerfile fails to build > - > > Key: SPARK-30111 > URL: https://issues.apache.org/jira/browse/SPARK-30111 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Ilan Filonenko >Priority: Major > > all recent k8s builds have been failing when trying to build the sparkR > dockerfile: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/] > [~ifilonenko] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30111) spark R dockerfile fails to build
[ https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp reassigned SPARK-30111: --- Assignee: Ilan Filonenko > spark R dockerfile fails to build > - > > Key: SPARK-30111 > URL: https://issues.apache.org/jira/browse/SPARK-30111 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Ilan Filonenko >Priority: Major > > all recent k8s builds have been failing when trying to build the sparkR > dockerfile: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/] > [~ifilonenko] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27025. -- Resolution: Duplicate Actually, I think this was done by SPARK-27659 .. > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-27025: -- > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)
[ https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987424#comment-16987424 ] Hyukjin Kwon commented on SPARK-30062: -- Can you open a PR to fix DB2 dialect then? > bug with DB2Driver using mode("overwrite") option("truncate",True) > -- > > Key: SPARK-30062 > URL: https://issues.apache.org/jira/browse/SPARK-30062 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Guy Huinen >Priority: Major > Labels: db2, pyspark > > using DB2Driver using mode("overwrite") option("truncate",True) gives sql > error > > {code:java} > dfClient.write\ > .format("jdbc")\ > .mode("overwrite")\ > .option('driver', 'com.ibm.db2.jcc.DB2Driver')\ > .option("url","jdbc:db2://")\ > .option("user","xxx")\ > .option("password","")\ > .option("dbtable","")\ > .option("truncate",True)\{code} > > gives the error below > in summary i belief the semicolon is misplaced or malformated > > {code:java} > EXPO.EXPO#CMR_STG;IMMEDIATE{code} > > > full error > {code:java} > An error occurred while calling o47.save. : > com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, > SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, > DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at > com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) > at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at > com.ibm.db2.jcc.am.kh.d(kh.java:2776) at > com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) > at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) > at com.ibm.db2.jcc.t4.av.h(av.java:124) at > com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at > com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) > at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748){code} > --
[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql
[ https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29591: - Description: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* {code:java} CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); {code} *You can list the columns in a different order if you wish or even omit some columns,* {code:java} INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37); {code} *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. {code:java} create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> {code} was: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* {code:java} CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); {code} *You can list the columns in a different order if you wish or even omit some columns,* {code:java} INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37); {code} *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. {code:java} create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> {code} pos : point-of-sale > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgresql > --- > > Key: SPARK-29591 > URL: https://issues.apache.org/jira/browse/SPARK-29591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Major > > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgre sql. > *In postgre sql* > {code:java} > CREATE TABLE weather ( > city varchar(80), > temp_lo int, – low temperature > temp_hi int, – high temperature > prcp real, – precipitation > date date > ); > {code} > *You can list the columns in a different order if you wish or even omit some > columns,* > {code:java} > INSERT INTO weather (date, city, temp_hi, temp_lo) > VALUES ('1994-11-29', 'Hayward', 54, 37); > {code} > > *Spark SQL* > But in spark sql is not allowing to insert data in different order or omit > any column.Better to support this as it can save time if we can not predict > any specific column value or if some value is fixed always. > {code:java} > create table jobit(id int,name string); > > insert into jobit values(1,"Ankit"); > Time taken: 0.548 seconds > spark-sql> *insert into jobit (id) values(1);* > *Error in query:* > mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', > 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) > == SQL == > insert into jobit (id) values(1) > ---^^^ > spark-sql> *insert into jobit (name,id) values("Ankit",1);* > *Error in query:* > mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', > 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) >
[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql
[ https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29591: - Description: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* {code} CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); {code} *You can list the columns in a different order if you wish or even omit some columns,* {code} INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37) {code}; *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. {code} create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> {code} was: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); *You can list the columns in a different order if you wish or even omit some columns,* INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37); *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgresql > --- > > Key: SPARK-29591 > URL: https://issues.apache.org/jira/browse/SPARK-29591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Major > > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgre sql. > *In postgre sql* > {code} > CREATE TABLE weather ( > city varchar(80), > temp_lo int, – low temperature > temp_hi int, – high temperature > prcp real, – precipitation > date date > ); > {code} > *You can list the columns in a different order if you wish or even omit some > columns,* > {code} > INSERT INTO weather (date, city, temp_hi, temp_lo) > VALUES ('1994-11-29', 'Hayward', 54, 37) > {code}; > *Spark SQL* > But in spark sql is not allowing to insert data in different order or omit > any column.Better to support this as it can save time if we can not predict > any specific column value or if some value is fixed always. > {code} > create table jobit(id int,name string); > > insert into jobit values(1,"Ankit"); > Time taken: 0.548 seconds > spark-sql> *insert into jobit (id) values(1);* > *Error in query:* > mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', > 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) > == SQL == > insert into jobit (id) values(1) > ---^^^ > spark-sql> *insert into jobit (name,id) values("Ankit",1);* > *Error in query:* > mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', > 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) > == SQL == > insert into jobit (name,id) values("Ankit",1) > ---^^^ > spark-sql> >
[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql
[ https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29591: - Description: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* {code:java} CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); {code} *You can list the columns in a different order if you wish or even omit some columns,* {code:java} INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37); {code} *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. {code:java} create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> {code} pos : point-of-sale was: Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgre sql. *In postgre sql* {code} CREATE TABLE weather ( city varchar(80), temp_lo int, – low temperature temp_hi int, – high temperature prcp real, – precipitation date date ); {code} *You can list the columns in a different order if you wish or even omit some columns,* {code} INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37) {code}; *Spark SQL* But in spark sql is not allowing to insert data in different order or omit any column.Better to support this as it can save time if we can not predict any specific column value or if some value is fixed always. {code} create table jobit(id int,name string); > insert into jobit values(1,"Ankit"); Time taken: 0.548 seconds spark-sql> *insert into jobit (id) values(1);* *Error in query:* mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (id) values(1) ---^^^ spark-sql> *insert into jobit (name,id) values("Ankit",1);* *Error in query:* mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) == SQL == insert into jobit (name,id) values("Ankit",1) ---^^^ spark-sql> {code} > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgresql > --- > > Key: SPARK-29591 > URL: https://issues.apache.org/jira/browse/SPARK-29591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Major > > Support data insertion in a different order if you wish or even omit some > columns in spark sql also like postgre sql. > *In postgre sql* > {code:java} > CREATE TABLE weather ( > city varchar(80), > temp_lo int, – low temperature > temp_hi int, – high temperature > prcp real, – precipitation > date date > ); > {code} > *You can list the columns in a different order if you wish or even omit some > columns,* > {code:java} > INSERT INTO weather (date, city, temp_hi, temp_lo) > VALUES ('1994-11-29', 'Hayward', 54, 37); > {code} > *Spark SQL* > But in spark sql is not allowing to insert data in different order or omit > any column.Better to support this as it can save time if we can not predict > any specific column value or if some value is fixed always. > {code:java} > create table jobit(id int,name string); > > insert into jobit values(1,"Ankit"); > Time taken: 0.548 seconds > spark-sql> *insert into jobit (id) values(1);* > *Error in query:* > mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', > 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) > == SQL == > insert into jobit (id) values(1) > ---^^^ > spark-sql> *insert into jobit (name,id) values("Ankit",1);* > *Error in query:* > mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', > 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19) > == SQL == >
[jira] [Assigned] (SPARK-30109) PCA use BLAS.gemv with sparse vector
[ https://issues.apache.org/jira/browse/SPARK-30109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30109: Assignee: zhengruifeng > PCA use BLAS.gemv with sparse vector > > > Key: SPARK-30109 > URL: https://issues.apache.org/jira/browse/SPARK-30109 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > When PCA was first impled in > [SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time > Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it > worked around it by applying a complex matrix multiplication. > Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], > BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30109) PCA use BLAS.gemv with sparse vector
[ https://issues.apache.org/jira/browse/SPARK-30109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30109. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26745 [https://github.com/apache/spark/pull/26745] > PCA use BLAS.gemv with sparse vector > > > Key: SPARK-30109 > URL: https://issues.apache.org/jira/browse/SPARK-30109 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > When PCA was first impled in > [SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time > Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it > worked around it by applying a complex matrix multiplication. > Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], > BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30117) Improve limit only query on Hive table and view
Lantao Jin created SPARK-30117: -- Summary: Improve limit only query on Hive table and view Key: SPARK-30117 URL: https://issues.apache.org/jira/browse/SPARK-30117 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30116) Improve limit only query on views
Lantao Jin created SPARK-30116: -- Summary: Improve limit only query on views Key: SPARK-30116 URL: https://issues.apache.org/jira/browse/SPARK-30116 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30115) Improve limit only query on datasource table
Lantao Jin created SPARK-30115: -- Summary: Improve limit only query on datasource table Key: SPARK-30115 URL: https://issues.apache.org/jira/browse/SPARK-30115 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30114) Improve LIMIT only query by partial listing files
[ https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-30114: --- Summary: Improve LIMIT only query by partial listing files (was: Optimize LIMIT only query by partial listing files) > Improve LIMIT only query by partial listing files > - > > Key: SPARK-30114 > URL: https://issues.apache.org/jira/browse/SPARK-30114 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT > operation like > 1) SELECT * FROM TABLE_A LIMIT N > 2) SELECT colA FROM TABLE_A LIMIT N > 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N > If the TABLE_A is a large table (a RDD with thousands and thousands of > partitions), the execution time would be very big since it has to list all > files to build a RDD before execution. But almost time, the N is just like > 10, 100, 1000, not very big. We don't need to scan all files. This > optimization will create a *SinglePartitionReadRDD* to address it. > In our production result, this optimization benefits a lot. The duration time > of simple query with LIMIT could reduce 5~10 times. For example, before this > optimization, a query on a table which has about one hundred thousands files > would run over 30 seconds, after applying this optimization, the time > decreased to 5 seconds. > Should support both Spark DataSource Table and Hive Table which can be > converted to DataSource table. > Should support bucket table, partition table, normal table. > Should support different file formats like parquet, orc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29600: - Description: {code} SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); {code} throws Exception in 3.0 where as in 2.3.2 is working fine. Spark 3.0 output: {code} 0: jdbc:hive2://10.18.19.208:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array, decimal(1,1)].; line 1 pos 7; 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), cast(0.033 as decimal(13,3))), 0.2), None)] {code} Spark 2.3.2 output {code} 0: jdbc:hive2://10.18.18.214:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), CAST(0.2 AS DECIMAL(13,3)))| |true| 1 row selected (0.18 seconds) {code} was: SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws Exception in 3.0 where as in 2.3.2 is working fine. Spark 3.0 output: 0: jdbc:hive2://10.18.19.208:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); Error: org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array, decimal(1,1)].; line 1 pos 7; 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), cast(0.033 as decimal(13,3))), 0.2), None)] Spark 2.3.2 output 0: jdbc:hive2://10.18.18.214:23040/default> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), CAST(0.2 AS DECIMAL(13,3)))| |true| 1 row selected (0.18 seconds) > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > {code} > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > {code} > throws Exception in 3.0 where as in 2.3.2 is working fine. > Spark 3.0 output: > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), > CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS > DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS > DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function > array_contains should have been array followed by a value with same element > type, but it's [array, decimal(1,1)].; line 1 pos 7; > 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), > cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as > decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), > cast(0.033 as decimal(13,3))), 0.2), None)] > {code} > Spark 2.3.2 output > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), > CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS > DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), > CAST(0.2
[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29667: - Description: Ran into error on this sql Mismatched columns: {code} [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] the sql and clause AND a.id in (select id from db1.table1 where col1 = 1 group by id) {code} Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql ran just fine. Can the sql engine cast implicitly in this case? was: Ran into error on this sql Mismatched columns: [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] the sql and clause AND a.id in (select id from db1.table1 where col1 = 1 group by id) Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just fine. Can the sql engine cast implicitly in this case? > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > {code} > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > the sql and clause > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > {code} > Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql > ran just fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30114) Optimize LIMIT only query by partial listing files
[ https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-30114: --- Description: We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation like 1) SELECT * FROM TABLE_A LIMIT N 2) SELECT colA FROM TABLE_A LIMIT N 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution time would be very big since it has to list all files to build a RDD before execution. But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files. This optimization will create a *SinglePartitionReadRDD* to address it. In our production result, this optimization benefits a lot. The duration time of simple query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table which has about one hundred thousands files would run over 30 seconds, after applying this optimization, the time decreased to 5 seconds. Should support both Spark DataSource Table and Hive Table which can be converted to DataSource table. Should support bucket table, partition table, normal table. Should support different file formats like parquet, orc. was: We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation. When we execute some queries like 1) SELECT * FROM TABLE_A LIMIT N 2) SELECT colA FROM TABLE_A LIMIT N 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution time would be very big since it has to list all files to build a RDD before execution. But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files. This optimization will create a *SinglePartitionReadRDD* to address it. In our production result, this optimization benefits a lot. The duration time of simple query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table which has about one hundred thousands files would run over 30 seconds, after applying this optimization, the time decreased to 5 seconds. Should support both Spark DataSource Table and Hive Table which can be converted to DataSource table. Should support bucket table, partition table, normal table. Should support different file formats like parquet, orc. > Optimize LIMIT only query by partial listing files > -- > > Key: SPARK-30114 > URL: https://issues.apache.org/jira/browse/SPARK-30114 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT > operation like > 1) SELECT * FROM TABLE_A LIMIT N > 2) SELECT colA FROM TABLE_A LIMIT N > 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N > If the TABLE_A is a large table (a RDD with thousands and thousands of > partitions), the execution time would be very big since it has to list all > files to build a RDD before execution. But almost time, the N is just like > 10, 100, 1000, not very big. We don't need to scan all files. This > optimization will create a *SinglePartitionReadRDD* to address it. > In our production result, this optimization benefits a lot. The duration time > of simple query with LIMIT could reduce 5~10 times. For example, before this > optimization, a query on a table which has about one hundred thousands files > would run over 30 seconds, after applying this optimization, the time > decreased to 5 seconds. > Should support both Spark DataSource Table and Hive Table which can be > converted to DataSource table. > Should support bucket table, partition table, normal table. > Should support different file formats like parquet, orc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29667: - Description: Ran into error on this sql Mismatched columns: {code} [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] {code} the sql and clause {code} AND a.id in (select id from db1.table1 where col1 = 1 group by id) {code} Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql ran just fine. Can the sql engine cast implicitly in this case? was: Ran into error on this sql Mismatched columns: {code} [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] the sql and clause AND a.id in (select id from db1.table1 where col1 = 1 group by id) {code} Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql ran just fine. Can the sql engine cast implicitly in this case? > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > {code} > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > {code} > the sql and clause > {code} > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > {code} > Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql > ran just fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987418#comment-16987418 ] Hyukjin Kwon commented on SPARK-30063: -- Closing this assuming the issue was resolved. > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at >
[jira] [Resolved] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30063. -- Resolution: Not A Problem > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at >
[jira] [Updated] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs
[ https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-30091: - Summary: Document mergeSchema option directly in the Python Parquet APIs (was: Document mergeSchema option directly in the Python API) > Document mergeSchema option directly in the Python Parquet APIs > --- > > Key: SPARK-30091 > URL: https://issues.apache.org/jira/browse/SPARK-30091 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet] > Strangely, the `mergeSchema` option is mentioned in the docstring but not > implemented in the method signature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30114) Optimize LIMIT only query by partial listing files
Lantao Jin created SPARK-30114: -- Summary: Optimize LIMIT only query by partial listing files Key: SPARK-30114 URL: https://issues.apache.org/jira/browse/SPARK-30114 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT operation. When we execute some queries like 1) SELECT * FROM TABLE_A LIMIT N 2) SELECT colA FROM TABLE_A LIMIT N 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N If the TABLE_A is a large table (a RDD with thousands and thousands of partitions), the execution time would be very big since it has to list all files to build a RDD before execution. But almost time, the N is just like 10, 100, 1000, not very big. We don't need to scan all files. This optimization will create a *SinglePartitionReadRDD* to address it. In our production result, this optimization benefits a lot. The duration time of simple query with LIMIT could reduce 5~10 times. For example, before this optimization, a query on a table which has about one hundred thousands files would run over 30 seconds, after applying this optimization, the time decreased to 5 seconds. Should support both Spark DataSource Table and Hive Table which can be converted to DataSource table. Should support bucket table, partition table, normal table. Should support different file formats like parquet, orc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30113) Document mergeSchema option in Python Orc APIs
Nicholas Chammas created SPARK-30113: Summary: Document mergeSchema option in Python Orc APIs Key: SPARK-30113 URL: https://issues.apache.org/jira/browse/SPARK-30113 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.4.4, 3.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987417#comment-16987417 ] Hyukjin Kwon commented on SPARK-30063: -- {quote} Set the PYTHONHASHSEED environment variable such that the whole cluster has the same value. I suggest that the original value is still random, for the same reason why it's random in the first place. This would alleviate a lot of crazy PySpark bugs that appear like concurrency bugs. {quote} {{PYTHONHASHSEED}} I think we already do it. {quote} Make Spark friendlier to different Arrow versions, or force override the PyArrow version so that it's harder to get it wrong. {quote} Yeah, we're trying to put a lot of efforts. https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x is one of the efforts we put. This incompatiblity was from Arrow for clarification. {quote} Consider upgrading Arrow. Newer versions are even friendlier to use {quote} We already did SPARK-29376. > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987413#comment-16987413 ] Hyukjin Kwon commented on SPARK-29988: -- Sure, take your time. just as a reminder, Hive version can be configured by this env AMPLAB_JENKINS_BUILD_HIVE_PROFILE. (see also https://github.com/apache/spark/commit/4a73bed3180aeb79c92bb19aea2ac5a97899731a# ) > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987407#comment-16987407 ] huangtianhua commented on SPARK-29106: -- [~shaneknapp], ok, thanks, sorry and I found the mvn command of arm test still is "./build/mvn test -Paarch64 -Phadoop-2.7 {color:#de350b}-Phive1.2{color} -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos" , it should be {color:#de350b}-Phive-1.2{color}? > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, > SparkR-and-pyspark36-testing.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29903) Add documentation for recursiveFileLookup
[ https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29903. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/26718 > Add documentation for recursiveFileLookup > - > > Key: SPARK-29903 > URL: https://issues.apache.org/jira/browse/SPARK-29903 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.0.0 > > > SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively > loading data from a source directory. There is currently no documentation for > this option. > We should document this both for the DataFrame API as well as for SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29903) Add documentation for recursiveFileLookup
[ https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29903: Assignee: Nicholas Chammas > Add documentation for recursiveFileLookup > - > > Key: SPARK-29903 > URL: https://issues.apache.org/jira/browse/SPARK-29903 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > > SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively > loading data from a source directory. There is currently no documentation for > this option. > We should document this both for the DataFrame API as well as for SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30112) Insert overwrite should be able to overwrite to same table under dynamic partition overwrite
L. C. Hsieh created SPARK-30112: --- Summary: Insert overwrite should be able to overwrite to same table under dynamic partition overwrite Key: SPARK-30112 URL: https://issues.apache.org/jira/browse/SPARK-30112 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30111) spark R dockerfile fails to build
[ https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987372#comment-16987372 ] Ilan Filonenko commented on SPARK-30111: The error seems to be from: {code:yaml} Step 6/12 : RUN apt install -y python python-pip && apt install -y python3 python3-pip && rm -r /usr/lib/python*/ensurepip && pip install --upgrade pip setuptools && rm -r /root/.cache && rm -rf /var/cache/apt/* ---> Running in f3d520c3435b {code} so this is relating to pyspark dockerfile The error is because: *404 Not Found [IP: 151.101.188.204 80]* > spark R dockerfile fails to build > - > > Key: SPARK-30111 > URL: https://issues.apache.org/jira/browse/SPARK-30111 > Project: Spark > Issue Type: Bug > Components: Build, jenkins, Kubernetes >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > all recent k8s builds have been failing when trying to build the sparkR > dockerfile: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/] > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/] > [~ifilonenko] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30051) Clean up hadoop-3.2 transitive dependencies
[ https://issues.apache.org/jira/browse/SPARK-30051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30051. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26742 [https://github.com/apache/spark/pull/26742] > Clean up hadoop-3.2 transitive dependencies > > > Key: SPARK-30051 > URL: https://issues.apache.org/jira/browse/SPARK-30051 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30051) Clean up hadoop-3.2 transitive dependencies
[ https://issues.apache.org/jira/browse/SPARK-30051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30051: - Assignee: Dongjoon Hyun > Clean up hadoop-3.2 transitive dependencies > > > Key: SPARK-30051 > URL: https://issues.apache.org/jira/browse/SPARK-30051 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30060) Uniform naming for Spark Metrics configuration parameters
[ https://issues.apache.org/jira/browse/SPARK-30060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30060. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26692 [https://github.com/apache/spark/pull/26692] > Uniform naming for Spark Metrics configuration parameters > - > > Key: SPARK-30060 > URL: https://issues.apache.org/jira/browse/SPARK-30060 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > Currently Spark has a few parameters to enable/disable metrics reporting. > Their naming pattern is not uniform and this can create confusion. Currently > we have: > {{spark.metrics.static.sources.enabled}} > {{spark.app.status.metrics.enabled}} > {{spark.sql.streaming.metricsEnabled}} > This proposes to introduce a naming convention for such parameters: > spark.metrics.sourceNameCamelCase.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30060) Uniform naming for Spark Metrics configuration parameters
[ https://issues.apache.org/jira/browse/SPARK-30060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30060: - Assignee: Luca Canali > Uniform naming for Spark Metrics configuration parameters > - > > Key: SPARK-30060 > URL: https://issues.apache.org/jira/browse/SPARK-30060 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > > Currently Spark has a few parameters to enable/disable metrics reporting. > Their naming pattern is not uniform and this can create confusion. Currently > we have: > {{spark.metrics.static.sources.enabled}} > {{spark.app.status.metrics.enabled}} > {{spark.sql.streaming.metricsEnabled}} > This proposes to introduce a naming convention for such parameters: > spark.metrics.sourceNameCamelCase.enabled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30106) DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be tested
[ https://issues.apache.org/jira/browse/SPARK-30106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30106: - Assignee: deshanxiao > DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be > tested > --- > > Key: SPARK-30106 > URL: https://issues.apache.org/jira/browse/SPARK-30106 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: deshanxiao >Assignee: deshanxiao >Priority: Minor > > The test "no predicate on the dimension table is not be tested" has no > partiton key. We can change the sql to test it. > {code:java} > Given("no predicate on the dimension table") > withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") { > val df = sql( > """ > |SELECT * FROM fact_sk f > |JOIN dim_store s > |ON f.date_id = s.store_id > """.stripMargin) > checkPartitionPruningPredicate(df, false, false) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30106) DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be tested
[ https://issues.apache.org/jira/browse/SPARK-30106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30106. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26744 [https://github.com/apache/spark/pull/26744] > DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be > tested > --- > > Key: SPARK-30106 > URL: https://issues.apache.org/jira/browse/SPARK-30106 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: deshanxiao >Assignee: deshanxiao >Priority: Minor > Fix For: 3.0.0 > > > The test "no predicate on the dimension table is not be tested" has no > partiton key. We can change the sql to test it. > {code:java} > Given("no predicate on the dimension table") > withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") { > val df = sql( > """ > |SELECT * FROM fact_sk f > |JOIN dim_store s > |ON f.date_id = s.store_id > """.stripMargin) > checkPartitionPruningPredicate(df, false, false) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987252#comment-16987252 ] Bryan Cutler commented on SPARK-29748: -- [~zero323] I made some updates to the PR with remove the _LegacyRow and option for OrderedDict, and also like you suggested for Python 3.6 will automatically fall back to legacy behavior of sorting and print a warning to the user. > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987230#comment-16987230 ] Tim Kellogg commented on SPARK-30063: - Improvement suggestions * Set the PYTHONHASHSEED environment variable such that the whole cluster has the same value. I suggest that the original value is still random, for the same reason why it's random in the first place. This would alleviate a lot of crazy PySpark bugs that appear like concurrency bugs. * Make Spark friendlier to different Arrow versions, or force override the PyArrow version so that it's harder to get it wrong. * Consider upgrading Arrow. Newer versions are even friendlier to use. I'd love to help out, even contribute if possible. > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at >
[jira] [Comment Edited] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987120#comment-16987120 ] Aman Omer edited comment on SPARK-29667 at 12/3/19 7:03 PM: Actually there is another JIRA for decimal type mismatching due to precision https://issues.apache.org/jira/browse/SPARK-29600 was (Author: aman_omer): Actually there is also JIRA for decimal type mismatching due to precision https://issues.apache.org/jira/browse/SPARK-29600 > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > the sql and clause > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just > fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987154#comment-16987154 ] Bryan Cutler commented on SPARK-30063: -- I haven't looked at your bug report in detail but you are right that there was a change in the Arrow message format for 0.15.0+. To maintain compatibility with current versions of Spark an environment variable can be set by adding it to conf/spark-env.sh : {{ARROW_PRE_0_15_IPC_FORMAT=1}} It's described a little more in the docs here [https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x] Could you try this out and see if it solves your issue? > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at >
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987148#comment-16987148 ] Shane Knapp commented on SPARK-29988: - i probably won't get around to this until next week... > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30111) spark R dockerfile fails to build
Shane Knapp created SPARK-30111: --- Summary: spark R dockerfile fails to build Key: SPARK-30111 URL: https://issues.apache.org/jira/browse/SPARK-30111 Project: Spark Issue Type: Bug Components: Build, jenkins, Kubernetes Affects Versions: 3.0.0 Reporter: Shane Knapp all recent k8s builds have been failing when trying to build the sparkR dockerfile: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/] [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/] [~ifilonenko] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987143#comment-16987143 ] Shane Knapp commented on SPARK-29106: - got it, thanks! > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, > SparkR-and-pyspark36-testing.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987120#comment-16987120 ] Aman Omer commented on SPARK-29667: --- Actually there is also JIRA for decimal type mismatching due to precision https://issues.apache.org/jira/browse/SPARK-29600 > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > the sql and clause > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just > fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator
[ https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987119#comment-16987119 ] Aman Omer commented on SPARK-29667: --- Hi [~jessielin], did you observed same issue with any other data type? {code:java} SELECT * FROM (VALUES (1.0), (2.0) AS t1(col1)) WHERE col1 IN (SELECT * FROM (VALUES (1.0), (2.0), (3.00) AS t2(col2))); {code} Above query will fail with following error {noformat} Mismatched columns: [(__auto_generated_subquery_name.`col1`:decimal(2,1), __auto_generated_subquery_name.`col2`:decimal(3,2))] Left side: [decimal(2,1)]. Right side: [decimal(3,2)]. {noformat} Since, `3.00` in subquery will upcast all element in column to decimal(3,2). > implicitly convert mismatched datatypes on right side of "IN" operator > -- > > Key: SPARK-29667 > URL: https://issues.apache.org/jira/browse/SPARK-29667 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jessie Lin >Priority: Minor > > Ran into error on this sql > Mismatched columns: > [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] > the sql and clause > AND a.id in (select id from db1.table1 where col1 = 1 group by id) > Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just > fine. Can the sql engine cast implicitly in this case? > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30012) Change classes extending scala collection classes to work with 2.13
[ https://issues.apache.org/jira/browse/SPARK-30012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30012. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26728 [https://github.com/apache/spark/pull/26728] > Change classes extending scala collection classes to work with 2.13 > --- > > Key: SPARK-30012 > URL: https://issues.apache.org/jira/browse/SPARK-30012 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > Several classes in the code base extend Scala collection classes, like: > - ExternalAppendOnlyMap > - CaseInsensitiveMap > - AttributeMap > - (probably more I haven't gotten to yet) > The collection hierarchy changed in 2.13, and like elsewhere, I don't see > that it's possible to use a compat library or clever coding to bridge the > difference. I think we need to maintain two implementations in parallel > source trees, possibly with a common ancestor with the commonalities. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30012) Change classes extending scala collection classes to work with 2.13
[ https://issues.apache.org/jira/browse/SPARK-30012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30012: - Assignee: Sean R. Owen > Change classes extending scala collection classes to work with 2.13 > --- > > Key: SPARK-30012 > URL: https://issues.apache.org/jira/browse/SPARK-30012 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Several classes in the code base extend Scala collection classes, like: > - ExternalAppendOnlyMap > - CaseInsensitiveMap > - AttributeMap > - (probably more I haven't gotten to yet) > The collection hierarchy changed in 2.13, and like elsewhere, I don't see > that it's possible to use a compat library or clever coding to bridge the > difference. I think we need to maintain two implementations in parallel > source trees, possibly with a common ancestor with the commonalities. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29477: - Priority: Minor (was: Major) > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29477: Assignee: Rakesh Raushan > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Rakesh Raushan >Priority: Major > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29477) Improve tooltip information for Streaming Tab
[ https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29477. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26467 [https://github.com/apache/spark/pull/26467] > Improve tooltip information for Streaming Tab > -- > > Key: SPARK-29477 > URL: https://issues.apache.org/jira/browse/SPARK-29477 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Rakesh Raushan >Priority: Major > Fix For: 3.0.0 > > > Active Batches Table and Completed Batches can be re look and put proper tip > for Batch Time and Record column -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30082) Zeros are being treated as NaNs
[ https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30082. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26738 [https://github.com/apache/spark/pull/26738] > Zeros are being treated as NaNs > --- > > Key: SPARK-30082 > URL: https://issues.apache.org/jira/browse/SPARK-30082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: John Ayad >Priority: Major > Fix For: 3.0.0 > > > If you attempt to run > {code:java} > df = df.replace(float('nan'), somethingToReplaceWith) > {code} > It will replace all {{0}} s in columns of type {{Integer}} > Example code snippet to repro this: > {code:java} > from pyspark.sql import SQLContext > spark = SQLContext(sc).sparkSession > df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > df.show() > df = df.replace(float('nan'), 5) > df.show() > {code} > Here's the output I get when I run this code: > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > Using Python version 3.7.5 (default, Nov 1 2019 02:16:32) > SparkSession available as 'spark'. > >>> from pyspark.sql import SQLContext > >>> spark = SQLContext(sc).sparkSession > >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|0| > |2|3| > |3|0| > +-+-+ > >>> df = df.replace(float('nan'), 5) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|5| > |2|3| > |3|5| > +-+-+ > >>> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987037#comment-16987037 ] Thomas Graves commented on SPARK-18886: --- Note there is discussion on this subject on prs: [https://github.com/apache/spark/pull/26633] (hack to work around it for a particular RDD) PR with proposed solution - but really more discussion solution: [https://github.com/apache/spark/pull/26696] My proposal I believe is similar to Kay's where we use slots and track the delay per slot. I haven't looked at the code in specific detail, especially in the FairScheduler where most of the issues in the conversations above were mentioned. One way around this is we have different policies and allow users to configure, or have one for FairScheduler and one for fifo. > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Major > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". *NOTE*: due to SPARK-18967, avoid > setting the {{spark.locality.wait=0}} -- instead, use > {{spark.locality.wait=1ms}}. > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of partitions. (Shuffle > locality preference is assigned if any node has more than 20% of the shuffle > input data -- by chance, you may have one node just above that threshold, and > all other nodes just below it.) In this case, you can turn off locality > preference for shuffle data by setting > {{spark.shuffle.reduceLocality.enabled=false}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30082) Zeros are being treated as NaNs
[ https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30082: --- Assignee: John Ayad > Zeros are being treated as NaNs > --- > > Key: SPARK-30082 > URL: https://issues.apache.org/jira/browse/SPARK-30082 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: John Ayad >Assignee: John Ayad >Priority: Major > Fix For: 3.0.0 > > > If you attempt to run > {code:java} > df = df.replace(float('nan'), somethingToReplaceWith) > {code} > It will replace all {{0}} s in columns of type {{Integer}} > Example code snippet to repro this: > {code:java} > from pyspark.sql import SQLContext > spark = SQLContext(sc).sparkSession > df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > df.show() > df = df.replace(float('nan'), 5) > df.show() > {code} > Here's the output I get when I run this code: > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > Using Python version 3.7.5 (default, Nov 1 2019 02:16:32) > SparkSession available as 'spark'. > >>> from pyspark.sql import SQLContext > >>> spark = SQLContext(sc).sparkSession > >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value")) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|0| > |2|3| > |3|0| > +-+-+ > >>> df = df.replace(float('nan'), 5) > >>> df.show() > +-+-+ > |index|value| > +-+-+ > |1|5| > |2|3| > |3|5| > +-+-+ > >>> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30083) visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking
[ https://issues.apache.org/jira/browse/SPARK-30083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30083: --- Assignee: Kent Yao > visitArithmeticUnary should wrap PLUS case with UnaryPositive for type > checking > --- > > Key: SPARK-30083 > URL: https://issues.apache.org/jira/browse/SPARK-30083 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > For PLUS case, visitArithmeticUnary do not wrap the expr with UnaryPositive, > so it escapes from type checking -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30083) visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking
[ https://issues.apache.org/jira/browse/SPARK-30083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30083. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26716 [https://github.com/apache/spark/pull/26716] > visitArithmeticUnary should wrap PLUS case with UnaryPositive for type > checking > --- > > Key: SPARK-30083 > URL: https://issues.apache.org/jira/browse/SPARK-30083 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > For PLUS case, visitArithmeticUnary do not wrap the expr with UnaryPositive, > so it escapes from type checking -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30048. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26681 [https://github.com/apache/spark/pull/26681] > Enable aggregates with interval type values for RelationalGroupedDataset > - > > Key: SPARK-30048 > URL: https://issues.apache.org/jira/browse/SPARK-30048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Now the min/max/sum/avg are support for intervals, we should also enable it > in RelationalGroupedDataset -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30048: --- Assignee: Kent Yao > Enable aggregates with interval type values for RelationalGroupedDataset > - > > Key: SPARK-30048 > URL: https://issues.apache.org/jira/browse/SPARK-30048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Now the min/max/sum/avg are support for intervals, we should also enable it > in RelationalGroupedDataset -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986859#comment-16986859 ] Jungtaek Lim commented on SPARK-30101: -- I'm not aware of how configuration page is constructed, but as the page guides on retrieving the list of configurations for Spark SQL, it doesn't enumerate these configs in the global configuration page as Spark SQL has too many configuration by itself. [https://spark.apache.org/docs/latest/configuration.html#spark-sql] There's a doc already describing the `spark.sql.shuffle.partitions`, though that is introduced as "other configuration options". We may deal with it we strongly agree about needs for prioritizing this. [http://spark.apache.org/docs/latest/sql-performance-tuning.html] Btw, `coalesce` may cover the needs of adding optional parameter, but I guess someone may not feel convenient with this. > spark.sql.shuffle.partitions is not in Configuration docs, but a very > critical parameter > > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. > ... > According to below comments, it uses spark.sql.shuffle.partitions, which > needs documenting in configuration. > > Default number of partitions in RDDs returned by transformations like join, > > reduceByKey, and parallelize when not set by user. > in https://spark.apache.org/docs/latest/configuration.html should say > > Default number of partitions in RDDs, but not DS/DF (see > > spark.sql.shuffle.partitions) returned by transformations like join, > > reduceByKey, and parallelize when not set by user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986817#comment-16986817 ] Ruben Berenguel commented on SPARK-30063: - Wow, this looks bad for now (since grouped_aggs are one of the "super cool things you can do now in Arrow/Pandas"), means we may need some kind of regression testing for Arrow "usage" in PySpark or some stronger version pinning. Tagging more knowledgeable people in this area of Spark so they are aware too: [~hyukjin.kwon] [~bryanc] Thanks [~tkellogg] this definitely seems to be enough to get it forward (way more than enough, big thanks!) > Failure when returning a value from multiple Pandas UDFs > > > Key: SPARK-30063 > URL: https://issues.apache.org/jira/browse/SPARK-30063 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3, 2.4.4 > Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on > both 2.4.3 and 2.4.4 >Reporter: Tim Kellogg >Priority: Major > Attachments: spark-debug.txt, variety-of-schemas.ipynb > > > I have 20 Pandas UDFs that I'm trying to evaluate all at the same time. > * PandasUDFType.GROUPED_AGG > * 3 columns in the input data frame being serialized over Arrow to Python > worker. See below for clarification. > * All functions take 2 parameters, some combination of the 3 received as > Arrow input. > * Varying return types, see details below. > _*I get an IllegalArgumentException on the Scala side of the worker when > deserializing from Python.*_ > h2. Exception & Stack Trace > {code:java} > 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) > java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at > org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) > at > org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) > at > org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, > localhost, executor driver): java.lang.IllegalArgumentException > at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) > at > org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) > at >
[jira] [Created] (SPARK-30110) Support type judgment for ArrayData
jiaan.geng created SPARK-30110: -- Summary: Support type judgment for ArrayData Key: SPARK-30110 URL: https://issues.apache.org/jira/browse/SPARK-30110 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: jiaan.geng Fix For: 3.0.0 ArrayData is only some interfaces for getting data, such as: getBoolean, getByte. When the element type of array is unknow, I want to judgment the element type first. I have PR working in process: [https://github.com/beliefer/spark/commit/5787c6f062795aa7931c58ac2302ba607d3a97aa] The array function ArrayNDims is used to get the Dimension of array. If ArrayData can support some interfaces like isBoolean, isByte, isArray, My job will be easier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30109) PCA use BLAS.gemv with sparse vector
zhengruifeng created SPARK-30109: Summary: PCA use BLAS.gemv with sparse vector Key: SPARK-30109 URL: https://issues.apache.org/jira/browse/SPARK-30109 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng When PCA was first impled in [SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it worked around it by applying a complex matrix multiplication. Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30108) Add robust accumulator for observable metrics
Herman van Hövell created SPARK-30108: - Summary: Add robust accumulator for observable metrics Key: SPARK-30108 URL: https://issues.apache.org/jira/browse/SPARK-30108 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Herman van Hövell -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29348) Add observable metrics
[ https://issues.apache.org/jira/browse/SPARK-29348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-29348. --- Fix Version/s: 3.0.0 Resolution: Fixed > Add observable metrics > -- > > Key: SPARK-29348 > URL: https://issues.apache.org/jira/browse/SPARK-29348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30107) Expose nested schema pruning to all V2 sources
[ https://issues.apache.org/jira/browse/SPARK-30107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986756#comment-16986756 ] Anton Okolnychyi commented on SPARK-30107: -- I'll submit a PR > Expose nested schema pruning to all V2 sources > -- > > Key: SPARK-30107 > URL: https://issues.apache.org/jira/browse/SPARK-30107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > I think it would be great to expose the existing logic for nested schema > pruning to all V2 sources, which is in line with the description of > {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable > of pruning nested columns will benefit from this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30107) Expose nested schema pruning to all V2 sources
[ https://issues.apache.org/jira/browse/SPARK-30107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-30107: - Description: I think it would be great to expose the existing logic for nested schema pruning to all V2 sources, which is in line with the description of {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable of pruning nested columns will benefit from this. (was: I think it would be great to expose the existing logic for nested schema pruning to all V2 sources, which is in line with the description of `SupportsPushDownRequiredColumns` . That way, all sources that are capable of pruning nested columns will benefit from this.) > Expose nested schema pruning to all V2 sources > -- > > Key: SPARK-30107 > URL: https://issues.apache.org/jira/browse/SPARK-30107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Anton Okolnychyi >Priority: Major > > I think it would be great to expose the existing logic for nested schema > pruning to all V2 sources, which is in line with the description of > {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable > of pruning nested columns will benefit from this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30107) Expose nested schema pruning to all V2 sources
Anton Okolnychyi created SPARK-30107: Summary: Expose nested schema pruning to all V2 sources Key: SPARK-30107 URL: https://issues.apache.org/jira/browse/SPARK-30107 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Anton Okolnychyi I think it would be great to expose the existing logic for nested schema pruning to all V2 sources, which is in line with the description of `SupportsPushDownRequiredColumns` . That way, all sources that are capable of pruning nested columns will benefit from this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986743#comment-16986743 ] huangtianhua commented on SPARK-29106: -- [~shaneknapp], now we don't have to install leveldbjni-all manually, the pr has been merged [https://github.com/apache/spark/pull/26636], please add profile -Paarch64 when running maven commands for arm testing jobs, thanks. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, > SparkR-and-pyspark36-testing.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29537) throw exception when user defined a wrong base path
[ https://issues.apache.org/jira/browse/SPARK-29537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29537: --- Assignee: wuyi > throw exception when user defined a wrong base path > --- > > Key: SPARK-29537 > URL: https://issues.apache.org/jira/browse/SPARK-29537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > > When user gives base path which is not an ancestor directory for the input > paths, we should throw exception to let user know. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29537) throw exception when user defined a wrong base path
[ https://issues.apache.org/jira/browse/SPARK-29537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29537. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26195 [https://github.com/apache/spark/pull/26195] > throw exception when user defined a wrong base path > --- > > Key: SPARK-29537 > URL: https://issues.apache.org/jira/browse/SPARK-29537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > > When user gives base path which is not an ancestor directory for the input > paths, we should throw exception to let user know. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-30101: Description: I'm creating a `SparkSession` like this: ``` SparkSession .builder().appName("foo").master("local") .config("spark.default.parallelism", 2).getOrCreate() ``` when I run ``` ((1 to 10) ++ (1 to 10)).toDS().distinct().count() ``` I get 200 partitions ``` 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks ... 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/200) ``` It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work correctly. Finally I notice that the good old `RDD` interface has a `distinct` that accepts `numPartitions` partitions, while `Dataset` does not. ... According to below comments, it uses spark.sql.shuffle.partitions, which needs documenting in configuration. > Default number of partitions in RDDs returned by transformations like join, > reduceByKey, and parallelize when not set by user. in https://spark.apache.org/docs/latest/configuration.html should say > Default number of partitions in RDDs, but not DS/DF (see > spark.sql.shuffle.partitions) returned by transformations like join, > reduceByKey, and parallelize when not set by user. was: I'm creating a `SparkSession` like this: ``` SparkSession .builder().appName("foo").master("local") .config("spark.default.parallelism", 2).getOrCreate() ``` when I run ``` ((1 to 10) ++ (1 to 10)).toDS().distinct().count() ``` I get 200 partitions ``` 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks ... 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/200) ``` It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work correctly. Finally I notice that the good old `RDD` interface has a `distinct` that accepts `numPartitions` partitions, while `Dataset` does not. > spark.sql.shuffle.partitions is not in Configuration docs, but a very > critical parameter > > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. > ... > According to below comments, it uses spark.sql.shuffle.partitions, which > needs documenting in configuration. > > Default number of partitions in RDDs returned by transformations like join, > > reduceByKey, and parallelize when not set by user. > in https://spark.apache.org/docs/latest/configuration.html should say > > Default number of partitions in RDDs, but not DS/DF (see > > spark.sql.shuffle.partitions) returned by transformations like join, > > reduceByKey, and parallelize when not set by user. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-30101: Summary: spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter (was: Dataset distinct does not respect spark.default.parallelism) > spark.sql.shuffle.partitions is not in Configuration docs, but a very > critical parameter > > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30101) Dataset distinct does not respect spark.default.parallelism
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam reopened SPARK-30101: - What is expected, is what is documented. > Dataset distinct does not respect spark.default.parallelism > --- > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30101) Dataset distinct does not respect spark.default.parallelism
[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986722#comment-16986722 ] sam commented on SPARK-30101: - [~cloud_fan] [~kabhwan] Well this is at least a documentation error since `spark.sql.shuffle.partitions` isn't even in the configuration documentation https://spark.apache.org/docs/latest/configuration.html Also "Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates." in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset should be updated to say "... and repartitions this with `spark.sql.shuffle.partitions` partitions". Do you agree we need a feature ticket to add `numPartitions` as an optional param to `distinct` since most shuffle operations have this? > Dataset distinct does not respect spark.default.parallelism > --- > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: sam >Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org