[jira] [Updated] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31851: - Description: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an exmaple. PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://koalas.readthedocs.io/en/latest/ was: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see pandas https://pandas.pydata.org/docs/ PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://pandas.pydata.org/docs/ - https://koalas.readthedocs.io/en/latest/ > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31851: - Target Version/s: 3.1.0 > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see pandas https://pandas.pydata.org/docs/ > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://pandas.pydata.org/docs/ > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31852) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31852. -- Resolution: Duplicate > Redesign PySpark documentation > -- > > Key: SPARK-31852 > URL: https://issues.apache.org/jira/browse/SPARK-31852 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see pandas https://pandas.pydata.org/docs/ > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://pandas.pydata.org/docs/ > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31852) Redesign PySpark documentation
Hyukjin Kwon created SPARK-31852: Summary: Redesign PySpark documentation Key: SPARK-31852 URL: https://issues.apache.org/jira/browse/SPARK-31852 Project: Spark Issue Type: Umbrella Components: ML, PySpark, Spark Core, SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see pandas https://pandas.pydata.org/docs/ PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://pandas.pydata.org/docs/ - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31851) Redesign PySpark documentation
Hyukjin Kwon created SPARK-31851: Summary: Redesign PySpark documentation Key: SPARK-31851 URL: https://issues.apache.org/jira/browse/SPARK-31851 Project: Spark Issue Type: Umbrella Components: ML, PySpark, Spark Core, SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see pandas https://pandas.pydata.org/docs/ PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://pandas.pydata.org/docs/ - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31850) DetermineTableStats rules computes stats multiple time for same table
[ https://issues.apache.org/jira/browse/SPARK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31850: Assignee: Apache Spark > DetermineTableStats rules computes stats multiple time for same table > - > > Key: SPARK-31850 > URL: https://issues.apache.org/jira/browse/SPARK-31850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Assignee: Apache Spark >Priority: Major > > DetermineTableStats rules computes stats multiple time for same table. > There are few rules which invoke > org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext > * The above method cause the logical plan to go through the analysis phase > multiple times > * The above method is invoked as a part of rule(s) which is part of a batch > which run till *Fixed point* which implies that that the analysis phase can > run multiple time > * example: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 > > |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] > * This is cause of concern especially in *DetermineTableStats* rule, where > the computation of stats can be expensive for a large table(And also it > happens multiple time due to the fixed point nature of batch from which > analysis is triggered) > > Repro steps > {code:java} > > spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") > val df = spark.sql("select count(id) id from c group by name order by id " ) > df.queryExecution.analyzed > {code} > Note: > * There is no log line in DetermineTableStats to indicate that stats compute > happened. Need to add a log line or use a debugger > * The above can be repro-ed with first query on a created table. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31850) DetermineTableStats rules computes stats multiple time for same table
[ https://issues.apache.org/jira/browse/SPARK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31850: Assignee: (was: Apache Spark) > DetermineTableStats rules computes stats multiple time for same table > - > > Key: SPARK-31850 > URL: https://issues.apache.org/jira/browse/SPARK-31850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > > DetermineTableStats rules computes stats multiple time for same table. > There are few rules which invoke > org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext > * The above method cause the logical plan to go through the analysis phase > multiple times > * The above method is invoked as a part of rule(s) which is part of a batch > which run till *Fixed point* which implies that that the analysis phase can > run multiple time > * example: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 > > |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] > * This is cause of concern especially in *DetermineTableStats* rule, where > the computation of stats can be expensive for a large table(And also it > happens multiple time due to the fixed point nature of batch from which > analysis is triggered) > > Repro steps > {code:java} > > spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") > val df = spark.sql("select count(id) id from c group by name order by id " ) > df.queryExecution.analyzed > {code} > Note: > * There is no log line in DetermineTableStats to indicate that stats compute > happened. Need to add a log line or use a debugger > * The above can be repro-ed with first query on a created table. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31850) DetermineTableStats rules computes stats multiple time for same table
[ https://issues.apache.org/jira/browse/SPARK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118346#comment-17118346 ] Apache Spark commented on SPARK-31850: -- User 'karuppayya' has created a pull request for this issue: https://github.com/apache/spark/pull/28662 > DetermineTableStats rules computes stats multiple time for same table > - > > Key: SPARK-31850 > URL: https://issues.apache.org/jira/browse/SPARK-31850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > > DetermineTableStats rules computes stats multiple time for same table. > There are few rules which invoke > org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext > * The above method cause the logical plan to go through the analysis phase > multiple times > * The above method is invoked as a part of rule(s) which is part of a batch > which run till *Fixed point* which implies that that the analysis phase can > run multiple time > * example: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 > > |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] > * This is cause of concern especially in *DetermineTableStats* rule, where > the computation of stats can be expensive for a large table(And also it > happens multiple time due to the fixed point nature of batch from which > analysis is triggered) > > Repro steps > {code:java} > > spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") > val df = spark.sql("select count(id) id from c group by name order by id " ) > df.queryExecution.analyzed > {code} > Note: > * There is no log line in DetermineTableStats to indicate that stats compute > happened. Need to add a log line or use a debugger > * The above can be repro-ed with first query on a created table. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31850) DetermineTableStats rules computes stats multiple time for same table
[ https://issues.apache.org/jira/browse/SPARK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118348#comment-17118348 ] Apache Spark commented on SPARK-31850: -- User 'karuppayya' has created a pull request for this issue: https://github.com/apache/spark/pull/28662 > DetermineTableStats rules computes stats multiple time for same table > - > > Key: SPARK-31850 > URL: https://issues.apache.org/jira/browse/SPARK-31850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > > DetermineTableStats rules computes stats multiple time for same table. > There are few rules which invoke > org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext > * The above method cause the logical plan to go through the analysis phase > multiple times > * The above method is invoked as a part of rule(s) which is part of a batch > which run till *Fixed point* which implies that that the analysis phase can > run multiple time > * example: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 > > |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] > * This is cause of concern especially in *DetermineTableStats* rule, where > the computation of stats can be expensive for a large table(And also it > happens multiple time due to the fixed point nature of batch from which > analysis is triggered) > > Repro steps > {code:java} > > spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") > val df = spark.sql("select count(id) id from c group by name order by id " ) > df.queryExecution.analyzed > {code} > Note: > * There is no log line in DetermineTableStats to indicate that stats compute > happened. Need to add a log line or use a debugger > * The above can be repro-ed with first query on a created table. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31850) DetermineTableStats rules computes stats multiple time for same table
[ https://issues.apache.org/jira/browse/SPARK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karuppayya updated SPARK-31850: --- Description: DetermineTableStats rules computes stats multiple time for same table. There are few rules which invoke org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext * The above method cause the logical plan to go through the analysis phase multiple times * The above method is invoked as a part of rule(s) which is part of a batch which run till *Fixed point* which implies that that the analysis phase can run multiple time * example: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] * This is cause of concern especially in *DetermineTableStats* rule, where the computation of stats can be expensive for a large table(And also it happens multiple time due to the fixed point nature of batch from which analysis is triggered) Repro steps {code:java} spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") val df = spark.sql("select count(id) id from c group by name order by id " ) df.queryExecution.analyzed {code} Note: * There is no log line in DetermineTableStats to indicate that stats compute happened. Need to add a log line or use a debugger * The above can be repro-ed with first query on a created table. was: DetermineTableStats rules computes stats multiple time. There are few rules which invoke org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext * The above method cause the logical plan to go through the analysis phase multiple times * The above method is invoked as a part of rule(s) which is part of a batch which run till *Fixed point* which implies that that the analysis phase can run multiple time * example: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] * This is cause of concern especially in *DetermineTableStats* rule, where the computation of stats can be expensive for a large table(And also it happens multiple time due to the fixed point nature of batch from which analysis is triggered) Repro steps {code:java} spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") val df = spark.sql("select count(id) id from c group by name order by id " ) df.queryExecution.analyzed {code} Note: * There is no log line in DetermineTableStats to indicate that stats compute happened. Need to add a log line or use a debugger * The above can be repro-ed with first query on a created table. Summary: DetermineTableStats rules computes stats multiple time for same table (was: DetermineTableStats rules computes stats multiple time) > DetermineTableStats rules computes stats multiple time for same table > - > > Key: SPARK-31850 > URL: https://issues.apache.org/jira/browse/SPARK-31850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > > DetermineTableStats rules computes stats multiple time for same table. > There are few rules which invoke > org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext > * The above method cause the logical plan to go through the analysis phase > multiple times > * The above method is invoked as a part of rule(s) which is part of a batch > which run till *Fixed point* which implies that that the analysis phase can > run multiple time > * example: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 > > |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] > * This is cause of concern especially in *DetermineTableStats* rule, where > the computation of stats can be expensive for a large table(And also it > happens multiple time due to the fixed point nature of batch from which > analysis is triggered) > > Repro steps > {code:java} > > spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") > val df = spark.sql("select count(id) id from c group by name order by id " ) > df.queryExecution.analyzed > {code} > Note: > * There is no log line in DetermineTableStats to indicate that stats compute > happened. Need to add a log line or use a debugger > * The above can be repro-ed with first query on a created table. > >
[jira] [Created] (SPARK-31850) DetermineTableStats rules computes stats multiple time
Karuppayya created SPARK-31850: -- Summary: DetermineTableStats rules computes stats multiple time Key: SPARK-31850 URL: https://issues.apache.org/jira/browse/SPARK-31850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Karuppayya DetermineTableStats rules computes stats multiple time. There are few rules which invoke org.apache.spark.sql.catalyst.analysis.Analyzer#executeSameContext * The above method cause the logical plan to go through the analysis phase multiple times * The above method is invoked as a part of rule(s) which is part of a batch which run till *Fixed point* which implies that that the analysis phase can run multiple time * example: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138 |https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2138] * This is cause of concern especially in *DetermineTableStats* rule, where the computation of stats can be expensive for a large table(And also it happens multiple time due to the fixed point nature of batch from which analysis is triggered) Repro steps {code:java} spark.sql("create table c(id INT, name STRING) STORED AS PARQUET") val df = spark.sql("select count(id) id from c group by name order by id " ) df.queryExecution.analyzed {code} Note: * There is no log line in DetermineTableStats to indicate that stats compute happened. Need to add a log line or use a debugger * The above can be repro-ed with first query on a created table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31849) Improve Python exception messages to be more Pythonic
[ https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31849: Assignee: (was: Apache Spark) > Improve Python exception messages to be more Pythonic > - > > Key: SPARK-31849 > URL: https://issues.apache.org/jira/browse/SPARK-31849 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Current PySpark exceptions are pretty ugly, and probably the most frequently > reported compliant from PySpark users. > For example, simple udf that throws a {{ZeroDivisionError}}: > {code} > from pyspark.sql.functions import udf > @udf > def divide_by_zero(v): > raise v / 0 > spark.range(1).select(divide_by_zero("id")).show() > {code} > shows a bunch of JVM stacktrace which is very hard for Python users to > understand: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line > 380, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 > in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 > (TID 11, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 377, in main > process() > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 372, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 352, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 142, in dump_stream > for obj in iterator: > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 341, in _batched > for item in iterator: > File "", line 1, in > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 85, in > return lambda *a: f(*a) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apache.spark.rdd.MapPartitions
[jira] [Assigned] (SPARK-31849) Improve Python exception messages to be more Pythonic
[ https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31849: Assignee: Apache Spark > Improve Python exception messages to be more Pythonic > - > > Key: SPARK-31849 > URL: https://issues.apache.org/jira/browse/SPARK-31849 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Current PySpark exceptions are pretty ugly, and probably the most frequently > reported compliant from PySpark users. > For example, simple udf that throws a {{ZeroDivisionError}}: > {code} > from pyspark.sql.functions import udf > @udf > def divide_by_zero(v): > raise v / 0 > spark.range(1).select(divide_by_zero("id")).show() > {code} > shows a bunch of JVM stacktrace which is very hard for Python users to > understand: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line > 380, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 > in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 > (TID 11, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 377, in main > process() > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 372, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 352, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 142, in dump_stream > for obj in iterator: > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 341, in _batched > for item in iterator: > File "", line 1, in > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 85, in > return lambda *a: f(*a) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at > org.apach
[jira] [Commented] (SPARK-31849) Improve Python exception messages to be more Pythonic
[ https://issues.apache.org/jira/browse/SPARK-31849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118319#comment-17118319 ] Apache Spark commented on SPARK-31849: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28661 > Improve Python exception messages to be more Pythonic > - > > Key: SPARK-31849 > URL: https://issues.apache.org/jira/browse/SPARK-31849 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Current PySpark exceptions are pretty ugly, and probably the most frequently > reported compliant from PySpark users. > For example, simple udf that throws a {{ZeroDivisionError}}: > {code} > from pyspark.sql.functions import udf > @udf > def divide_by_zero(v): > raise v / 0 > spark.range(1).select(divide_by_zero("id")).show() > {code} > shows a bunch of JVM stacktrace which is very hard for Python users to > understand: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line > 380, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 > in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 > (TID 11, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 377, in main > process() > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 372, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 352, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 142, in dump_stream > for obj in iterator: > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", > line 341, in _batched > for item in iterator: > File "", line 1, in > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", > line 85, in > return lambda *a: f(*a) > File > "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "", line 3, in divide_by_zero > ZeroDivisionError: division by zero > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.s
[jira] [Created] (SPARK-31849) Improve Python exception messages to be more Pythonic
Hyukjin Kwon created SPARK-31849: Summary: Improve Python exception messages to be more Pythonic Key: SPARK-31849 URL: https://issues.apache.org/jira/browse/SPARK-31849 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Current PySpark exceptions are pretty ugly, and probably the most frequently reported compliant from PySpark users. For example, simple udf that throws a {{ZeroDivisionError}}: {code} from pyspark.sql.functions import udf @udf def divide_by_zero(v): raise v / 0 spark.range(1).select(divide_by_zero("id")).show() {code} shows a bunch of JVM stacktrace which is very hard for Python users to understand: {code} Traceback (most recent call last): File "", line 1, in File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 380, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/.../spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o84.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 (TID 11, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 352, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 142, in dump_stream for obj in iterator: File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 341, in _batched for item in iterator: File "", line 1, in File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 85, in return lambda *a: f(*a) File "/.../spark-2.4.5-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper return f(*args, **kwargs) File "", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.
[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118287#comment-17118287 ] Yuming Wang commented on SPARK-31841: - I try to fix this issue before: https://github.com/apache/spark/pull/27986 > Dataset.repartition leverage adaptive execution > --- > > Key: SPARK-31841 > URL: https://issues.apache.org/jira/browse/SPARK-31841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: spark branch-3.0 from may 1 this year >Reporter: koert kuipers >Priority: Minor > > hello, > we are very happy users of adaptive query execution. its a great feature to > now have to think about and tune the number of partitions anymore in a > shuffle. > i noticed that Dataset.groupBy consistently uses adaptive execution when its > enabled (e.g. i don't see the default 200 partitions) but when i do > Dataset.repartition it seems i am back to a hardcoded number of partitions. > Should adaptive execution also be used for repartition? It would be nice to > be able to repartition without having to think about optimal number of > partitions. > An example: > {code:java} > $ spark-shell --conf spark.sql.adaptive.enabled=true --conf > spark.sql.adaptive.advisoryPartitionSizeInBytes=10 > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val x = (1 to 100).toDF > x: org.apache.spark.sql.DataFrame = [value: int] > scala> x.rdd.getNumPartitions > res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions > res1: Int = 200 > scala> x.groupBy("value").count.rdd.getNumPartitions > res2: Int = 67 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31847) DAGSchedulerSuite: Rewrite the test framework to cover most of the existing major features of the Spark Scheduler, mock the necessary part wisely, and make the test fram
jiaan.geng created SPARK-31847: -- Summary: DAGSchedulerSuite: Rewrite the test framework to cover most of the existing major features of the Spark Scheduler, mock the necessary part wisely, and make the test framework better extendable. Key: SPARK-31847 URL: https://issues.apache.org/jira/browse/SPARK-31847 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31848) Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors
jiaan.geng created SPARK-31848: -- Summary: Break down the very huge test files, each test suite should focus on one or several major features, but not all the related behaviors Key: SPARK-31848 URL: https://issues.apache.org/jira/browse/SPARK-31848 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31846) DAGSchedulerSuite: For the pattern of cancel + assert, extract the general method
jiaan.geng created SPARK-31846: -- Summary: DAGSchedulerSuite: For the pattern of cancel + assert, extract the general method Key: SPARK-31846 URL: https://issues.apache.org/jira/browse/SPARK-31846 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite exists many test case(e.g. trivial job cancellation | job cancellation no-kill backend) contains the pattern of cancel + assert, such as: {code:java} test("trivial job cancellation") { val rdd = new MyRDD(sc, 1, Nil) val jobId = submit(rdd, Array(0)) cancel(jobId) assert(failure.getMessage === s"Job $jobId cancelled ") assert(sparkListener.failedStages === Seq(0)) assertDataStructuresEmpty() } {code} We should extract the general method like cancelAndCheck(jobId: Int, predicate) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31845) DAGSchedulerSuite: Improve and reuse completeNextStageWithFetchFailure
jiaan.geng created SPARK-31845: -- Summary: DAGSchedulerSuite: Improve and reuse completeNextStageWithFetchFailure Key: SPARK-31845 URL: https://issues.apache.org/jira/browse/SPARK-31845 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite provides completeNextStageWithFetchFailure to make next stage occurs fetch failure. But many test case uses complete directly as follows: {code:java} complete(taskSets(2), Seq( (FetchFailed(BlockManagerId("hostA-exec2", "hostA", 12345), firstShuffleId, 0L, 0, 0, "ignored"), null) )) {code} We need to improve completeNextStageWithFetchFailure and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31844) DAGSchedulerSuite: For the pattern of failed + assert, extract the general method
jiaan.geng created SPARK-31844: -- Summary: DAGSchedulerSuite: For the pattern of failed + assert, extract the general method Key: SPARK-31844 URL: https://issues.apache.org/jira/browse/SPARK-31844 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite exists many test case(e.g. trivial job failure | run shuffle with map stage failure) contains the pattern of failed + assert, such as: {code:java} test("trivial job failure") { submit(new MyRDD(sc, 1, Nil), Array(0)) failed(taskSets(0), "some failure") assert(failure.getMessage === "Job aborted due to stage failure: some failure") assert(sparkListener.failedStages === Seq(0)) assertDataStructuresEmpty() } {code} We should extract the general method like failedAndCheck(taskSet: TaskSet, message: String, predicate) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31843) DAGSchedulerSuite: For the pattern of complete + assert, extract the general method
jiaan.geng created SPARK-31843: -- Summary: DAGSchedulerSuite: For the pattern of complete + assert, extract the general method Key: SPARK-31843 URL: https://issues.apache.org/jira/browse/SPARK-31843 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite exists many test case(e.g. run trivial job/run trivial job w/ dependency/cache location preferences w/ dependency) contains the pattern of complete + assert, such as: {code:java} test("run trivial job") { submit(new MyRDD(sc, 1, Nil), Array(0)) complete(taskSets(0), List((Success, 42))) assert(results === Map(0 -> 42)) assertDataStructuresEmpty() } {code} We should extract the general method like checkAnswer(job, expected) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31842) DAGSchedulerSuite: For the pattern of runevent + assert, extract the general method
jiaan.geng created SPARK-31842: -- Summary: DAGSchedulerSuite: For the pattern of runevent + assert, extract the general method Key: SPARK-31842 URL: https://issues.apache.org/jira/browse/SPARK-31842 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite exists many test case(e.g. late fetch failures / task events always posted / ignore late map task completions) contains the pattern of runevent + assert, such as: {code:java} runEvent(makeCompletionEvent( taskSets(1).tasks(0), FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)) assert(sparkListener.failedStages.contains(1)) {code} We should extract the general method like runEventAndCheck(events*, predicate) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118248#comment-17118248 ] Apache Spark commented on SPARK-31764: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28660 > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118247#comment-17118247 ] Apache Spark commented on SPARK-31764: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28660 > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118244#comment-17118244 ] Jungtaek Lim commented on SPARK-31764: -- Thanks for confirming. :) > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118242#comment-17118242 ] Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:34 AM: -- As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. (I don't remember why I marked the label "bug". Maybe, it's a mistake.) I'll make a backporting PR. was (Author: sarutak): As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. (I don't remember why I mark the label "bug". Maybe, it's a mistake.) I'll make a backporting PR. > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118242#comment-17118242 ] Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:34 AM: -- As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. (I don't remember why I mark the label "bug". Maybe, it's a mistake.) I'll make a backporting PR. was (Author: sarutak): As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. I'll make a backporting PR. > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-31764: --- Issue Type: Bug (was: Improvement) > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118242#comment-17118242 ] Kousuke Saruta commented on SPARK-31764: As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118242#comment-17118242 ] Kousuke Saruta edited comment on SPARK-31764 at 5/28/20, 2:33 AM: -- As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. I'll make a backporting PR. was (Author: sarutak): As you say, it's more proper to mark the label "bug" rather than "improvement" so I think it's better to go to 3.0. > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28879) Kubernetes node selector should be configurable
[ https://issues.apache.org/jira/browse/SPARK-28879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28879: Assignee: Apache Spark > Kubernetes node selector should be configurable > --- > > Key: SPARK-28879 > URL: https://issues.apache.org/jira/browse/SPARK-28879 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Franco >Assignee: Apache Spark >Priority: Major > > Similar to SPARK-25220. > > Having to create pod templates is not an answer because: > a) It's not available until Spark 3.0 which is unlikely to be deployed to > platforms like EMR for a while as there will be major impacts to dependent > libraries. > b) It is on overkill for something as fundamental as selecting different > nodes for your workers as your drivers e.g. for Spot instances. > > Would request that we add this feature in the 2.4 timeframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28879) Kubernetes node selector should be configurable
[ https://issues.apache.org/jira/browse/SPARK-28879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28879: Assignee: (was: Apache Spark) > Kubernetes node selector should be configurable > --- > > Key: SPARK-28879 > URL: https://issues.apache.org/jira/browse/SPARK-28879 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Franco >Priority: Major > > Similar to SPARK-25220. > > Having to create pod templates is not an answer because: > a) It's not available until Spark 3.0 which is unlikely to be deployed to > platforms like EMR for a while as there will be major impacts to dependent > libraries. > b) It is on overkill for something as fundamental as selecting different > nodes for your workers as your drivers e.g. for Spot instances. > > Would request that we add this feature in the 2.4 timeframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118203#comment-17118203 ] Xingbo Jiang commented on SPARK-31764: -- The issue is reported as "Improvement" and the affected version is "3.1.0", so it's merged to master only. [~sarutak] Do you think this should go to 3.0? > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118201#comment-17118201 ] Jungtaek Lim commented on SPARK-31764: -- For me this looks to be a bug - the description of PR states the same. Is there a reason this issue is marked as an "improvement" and the fix only landed to master branch? The bug is also placed in branch-3.0: https://github.com/apache/spark/blob/branch-3.0/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31839: Assignee: philipse > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Minor > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31839. -- Fix Version/s: 3.0.0 2.4.6 Resolution: Fixed Issue resolved by pull request 28655 [https://github.com/apache/spark/pull/28655] > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118191#comment-17118191 ] Apache Spark commented on SPARK-25351: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28659 > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Jalpan Randeri >Priority: Major > Labels: bulk-closed > Fix For: 3.1.0 > > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118189#comment-17118189 ] Apache Spark commented on SPARK-25351: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28659 > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Jalpan Randeri >Priority: Major > Labels: bulk-closed > Fix For: 3.1.0 > > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31763. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28652 [https://github.com/apache/spark/pull/28652] > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Assignee: Rakesh Raushan >Priority: Major > Fix For: 3.1.0 > > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31763: Assignee: Rakesh Raushan > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Assignee: Rakesh Raushan >Priority: Major > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31730) Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite
[ https://issues.apache.org/jira/browse/SPARK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118183#comment-17118183 ] Apache Spark commented on SPARK-31730: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/28658 > Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite > -- > > Key: SPARK-31730 > URL: https://issues.apache.org/jira/browse/SPARK-31730 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122655/testReport/ > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/668/testReport/ > BarrierTaskContextSuite.support multiple barrier() call within a single task > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1031 > was not less than or equal to 1000 > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > at > org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$15(BarrierTaskContextSuite.scala:157) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:58) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381) > at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458) > at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite.run(Suite.scala:1124) > at org.scalatest.Suite.run$(Suite.scala:1106) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:518) > at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) > at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:58) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.ut
[jira] [Resolved] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25351. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26585 [https://github.com/apache/spark/pull/26585] > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Jalpan Randeri >Priority: Major > Labels: bulk-closed > Fix For: 3.1.0 > > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-25351: Assignee: Jalpan Randeri > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Assignee: Jalpan Randeri >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31730) Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite
[ https://issues.apache.org/jira/browse/SPARK-31730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-31730. -- Fix Version/s: 3.0.0 Assignee: Xingbo Jiang Resolution: Fixed Fixed by https://github.com/apache/spark/pull/28584 > Flaky test: org.apache.spark.scheduler.BarrierTaskContextSuite > -- > > Key: SPARK-31730 > URL: https://issues.apache.org/jira/browse/SPARK-31730 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122655/testReport/ > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/668/testReport/ > BarrierTaskContextSuite.support multiple barrier() call within a single task > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 1031 > was not less than or equal to 1000 > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > at > org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$15(BarrierTaskContextSuite.scala:157) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:58) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381) > at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458) > at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite.run(Suite.scala:1124) > at org.scalatest.Suite.run$(Suite.scala:1106) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:518) > at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) > at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:58) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolEx
[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well
[ https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingcan Cui updated SPARK-31838: Description: {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union operator well (it only assumes a linear query plan). To temporarily fix the broken semantics before we have a complete improvement plan for the output mode, the validator should do the following. 1. Allow multiple aggregations from different branches of a union operator. 2. For complete output mode, disallow a union on streams with and without aggregations. was: {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union operator well (it only assumes a linear query plan). To temporarily fix the broken semantics before we have a complete improvement plan for the output mode, the validator should do the following. 1. Allow multiple aggregations from different branches of the last union operator. 2. For complete output mode, disallow a union on streams with and without aggregations. > The streaming output mode validator didn't treat union well > --- > > Key: SPARK-31838 > URL: https://issues.apache.org/jira/browse/SPARK-31838 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Priority: Major > > {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union > operator well (it only assumes a linear query plan). To temporarily fix the > broken semantics before we have a complete improvement plan for the output > mode, the validator should do the following. > 1. Allow multiple aggregations from different branches of a union operator. > 2. For complete output mode, disallow a union on streams with and without > aggregations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118140#comment-17118140 ] James Boylan commented on SPARK-31800: -- Good catch. I'll be back in the office tomorrow and I will test this scenario. Thank you for pointing that out. I'll update and close this ticket if that resolves it. > Unable to disable Kerberos when submitting jobs to Kubernetes > - > > Key: SPARK-31800 > URL: https://issues.apache.org/jira/browse/SPARK-31800 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: James Boylan >Priority: Major > > When you attempt to submit a process to Kubernetes using spark-submit through > --master, it returns the exception: > {code:java} > 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" org.apache.spark.SparkException: Please specify > spark.kubernetes.file.upload.path property. > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called > 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory > /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8 > {code} > No changes in settings appear to be able to disable Kerberos. This is when > running a simple execution of the SparkPi on our lab cluster. The command > being used is > {code:java} > ./bin/spark-submit --master k8s://https://{api_hostname} --deploy-mode > cluster --name spark-test --class org.apache.spark.examples.SparkPi --conf
[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well
[ https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingcan Cui updated SPARK-31838: Description: {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union operator well (it only assumes a linear query plan). To temporarily fix the broken semantics before we have a complete improvement plan for the output mode, the validator should do the following. 1. Allow multiple aggregations from different branches of the last union operator. 2. For complete output mode, disallow a union on streams with and without aggregations. was: {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union operator well (it only assumes a linear query plan). To temporarily fix the broken semantics before we have a complete improvement plan for the output mode, the validator should do the following. 1. Allow multiple (two for now) aggregations from different branches of the last union operator. 2. For complete output mode, disallow a union on streams with and without aggregations. > The streaming output mode validator didn't treat union well > --- > > Key: SPARK-31838 > URL: https://issues.apache.org/jira/browse/SPARK-31838 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Priority: Major > > {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union > operator well (it only assumes a linear query plan). To temporarily fix the > broken semantics before we have a complete improvement plan for the output > mode, the validator should do the following. > 1. Allow multiple aggregations from different branches of the last union > operator. > 2. For complete output mode, disallow a union on streams with and without > aggregations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118120#comment-17118120 ] koert kuipers commented on SPARK-31841: --- that is right. its a feature request i believe (unless i misunderstood whats happening). should i delete it here? > Dataset.repartition leverage adaptive execution > --- > > Key: SPARK-31841 > URL: https://issues.apache.org/jira/browse/SPARK-31841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: spark branch-3.0 from may 1 this year >Reporter: koert kuipers >Priority: Minor > > hello, > we are very happy users of adaptive query execution. its a great feature to > now have to think about and tune the number of partitions anymore in a > shuffle. > i noticed that Dataset.groupBy consistently uses adaptive execution when its > enabled (e.g. i don't see the default 200 partitions) but when i do > Dataset.repartition it seems i am back to a hardcoded number of partitions. > Should adaptive execution also be used for repartition? It would be nice to > be able to repartition without having to think about optimal number of > partitions. > An example: > {code:java} > $ spark-shell --conf spark.sql.adaptive.enabled=true --conf > spark.sql.adaptive.advisoryPartitionSizeInBytes=10 > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val x = (1 to 100).toDF > x: org.apache.spark.sql.DataFrame = [value: int] > scala> x.rdd.getNumPartitions > res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions > res1: Int = 200 > scala> x.groupBy("value").count.rdd.getNumPartitions > res2: Int = 67 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-31764: - Fix Version/s: 3.1.0 > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-31764. -- Resolution: Fixed > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31764) JsonProtocol doesn't write RDDInfo#isBarrier
[ https://issues.apache.org/jira/browse/SPARK-31764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118118#comment-17118118 ] Xingbo Jiang commented on SPARK-31764: -- Resolved by https://github.com/apache/spark/pull/28583 > JsonProtocol doesn't write RDDInfo#isBarrier > > > Key: SPARK-31764 > URL: https://issues.apache.org/jira/browse/SPARK-31764 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > JsonProtocol read RDDInfo#isBarrier but doesn't write it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31841) Dataset.repartition leverage adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118114#comment-17118114 ] Jungtaek Lim commented on SPARK-31841: -- It sounds like a question/feature request which is better to be moved to user@ or dev@ mailing list. > Dataset.repartition leverage adaptive execution > --- > > Key: SPARK-31841 > URL: https://issues.apache.org/jira/browse/SPARK-31841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: spark branch-3.0 from may 1 this year >Reporter: koert kuipers >Priority: Minor > > hello, > we are very happy users of adaptive query execution. its a great feature to > now have to think about and tune the number of partitions anymore in a > shuffle. > i noticed that Dataset.groupBy consistently uses adaptive execution when its > enabled (e.g. i don't see the default 200 partitions) but when i do > Dataset.repartition it seems i am back to a hardcoded number of partitions. > Should adaptive execution also be used for repartition? It would be nice to > be able to repartition without having to think about optimal number of > partitions. > An example: > {code:java} > $ spark-shell --conf spark.sql.adaptive.enabled=true --conf > spark.sql.adaptive.advisoryPartitionSizeInBytes=10 > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val x = (1 to 100).toDF > x: org.apache.spark.sql.DataFrame = [value: int] > scala> x.rdd.getNumPartitions > res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions > res1: Int = 200 > scala> x.groupBy("value").count.rdd.getNumPartitions > res2: Int = 67 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31841) Dataset.repartition leverage adaptive execution
koert kuipers created SPARK-31841: - Summary: Dataset.repartition leverage adaptive execution Key: SPARK-31841 URL: https://issues.apache.org/jira/browse/SPARK-31841 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Environment: spark branch-3.0 from may 1 this year Reporter: koert kuipers hello, we are very happy users of adaptive query execution. its a great feature to now have to think about and tune the number of partitions anymore in a shuffle. i noticed that Dataset.groupBy consistently uses adaptive execution when its enabled (e.g. i don't see the default 200 partitions) but when i do Dataset.repartition it seems i am back to a hardcoded number of partitions. Should adaptive execution also be used for repartition? It would be nice to be able to repartition without having to think about optimal number of partitions. An example: {code:java} $ spark-shell --conf spark.sql.adaptive.enabled=true --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=10 Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information. scala> val x = (1 to 100).toDF x: org.apache.spark.sql.DataFrame = [value: int] scala> x.rdd.getNumPartitions res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions res1: Int = 200 scala> x.groupBy("value").count.rdd.getNumPartitions res2: Int = 67 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
[ https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31797. -- Fix Version/s: 3.1.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28534 > Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions > --- > > Key: SPARK-31797 > URL: https://issues.apache.org/jira/browse/SPARK-31797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 > Environment: [PR link|https://github.com/apache/spark/pull/28534] >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > Fix For: 3.1.0 > > > 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, > {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}} > Reference: > [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds] > 2.People will have convenient way to get timestamps from seconds,milliseconds > and microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
[ https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31797: - Comment: was deleted (was: Don't set Fix/Target version) > Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions > --- > > Key: SPARK-31797 > URL: https://issues.apache.org/jira/browse/SPARK-31797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 > Environment: [PR link|https://github.com/apache/spark/pull/28534] >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > > 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, > {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}} > Reference: > [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds] > 2.People will have convenient way to get timestamps from seconds,milliseconds > and microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31827) fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
[ https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31827: Summary: fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form (was: better error message for the JDK bug of stand-alone form) > fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form > - > > Key: SPARK-31827 > URL: https://issues.apache.org/jira/browse/SPARK-31827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31797) Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
[ https://issues.apache.org/jira/browse/SPARK-31797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31797: - Fix Version/s: (was: 3.1.0) Flags: (was: Patch) Target Version/s: (was: 3.1.0) Labels: (was: features) Don't set Fix/Target version > Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions > --- > > Key: SPARK-31797 > URL: https://issues.apache.org/jira/browse/SPARK-31797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 > Environment: [PR link|https://github.com/apache/spark/pull/28534] >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > > 1.Add and register three new functions: {{TIMESTAMP_SECONDS}}, > {{TIMESTAMP_MILLIS}} and {{TIMESTAMP_MICROS}} > Reference: > [BigQuery|https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds] > 2.People will have convenient way to get timestamps from seconds,milliseconds > and microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table
[ https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117992#comment-17117992 ] Ankit Dsouza commented on SPARK-25185: -- Hey [~zhenhuawang], can this be prioritized and fixed (if required) this year? Using spark 2.3 > CBO rowcount statistics doesn't work for partitioned parquet external table > --- > > Key: SPARK-25185 > URL: https://issues.apache.org/jira/browse/SPARK-25185 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.2.1, 2.3.0 > Environment: > Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode > reading data from local file system >Reporter: Amit >Priority: Major > > Created a dummy partitioned data with partition column on string type col1=a > and col1=b > added csv data-> read through spark -> created partitioned external table-> > msck repair table to load partition. Did analyze on all columns and partition > column as well. > ~println(spark.sql("select * from test_p where > e='1a'").queryExecution.toStringWithStats)~ > ~val op = spark.sql("select * from test_p where > e='1a'").queryExecution.optimizedPlan~ > // e is the partitioned column > ~val stat = op.stats(spark.sessionState.conf)~ > ~print(stat.rowCount)~ > > Created the same way in parquet the rowcount comes up correctly in case of > csv but in parquet it shows as None. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110729#comment-17110729 ] Sudharshann D. edited comment on SPARK-31579 at 5/27/20, 5:59 PM: -- Please see my proof of concept [https://github.com/Sudhar287/spark/pull/1/files|https://github.com/Sudhar287/spark/pull/1/files] was (Author: suddhuasf): Please see my proof of concept [https://github.com/Sudhar287/spark/pull/1/files|http://example.com] > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31835: --- Assignee: Kent Yao > Add zoneId to codegen related test in DateExpressionsSuite > -- > > Key: SPARK-31835 > URL: https://issues.apache.org/jira/browse/SPARK-31835 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > the formatter will fail earlier before the codegen check happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31835. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28653 [https://github.com/apache/spark/pull/28653] > Add zoneId to codegen related test in DateExpressionsSuite > -- > > Key: SPARK-31835 > URL: https://issues.apache.org/jira/browse/SPARK-31835 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > the formatter will fail earlier before the codegen check happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31840) Add instance weight support in LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31840: Assignee: (was: Apache Spark) > Add instance weight support in LogisticRegressionSummary > > > Key: SPARK-31840 > URL: https://issues.apache.org/jira/browse/SPARK-31840 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add instance weight support in LogisticRegressionSummary -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31840) Add instance weight support in LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31840: Assignee: Apache Spark > Add instance weight support in LogisticRegressionSummary > > > Key: SPARK-31840 > URL: https://issues.apache.org/jira/browse/SPARK-31840 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > Add instance weight support in LogisticRegressionSummary -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31840) Add instance weight support in LogisticRegressionSummary
[ https://issues.apache.org/jira/browse/SPARK-31840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117923#comment-17117923 ] Apache Spark commented on SPARK-31840: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28657 > Add instance weight support in LogisticRegressionSummary > > > Key: SPARK-31840 > URL: https://issues.apache.org/jira/browse/SPARK-31840 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add instance weight support in LogisticRegressionSummary -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31840) Add instance weight support in LogisticRegressionSummary
Huaxin Gao created SPARK-31840: -- Summary: Add instance weight support in LogisticRegressionSummary Key: SPARK-31840 URL: https://issues.apache.org/jira/browse/SPARK-31840 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: Huaxin Gao Add instance weight support in LogisticRegressionSummary -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117887#comment-17117887 ] Edgar Klerks commented on SPARK-23443: -- Working on this. Can I make work in progress PR's? > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31719) Refactor JoinSelection
[ https://issues.apache.org/jira/browse/SPARK-31719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31719. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28540 [https://github.com/apache/spark/pull/28540] > Refactor JoinSelection > -- > > Key: SPARK-31719 > URL: https://issues.apache.org/jira/browse/SPARK-31719 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Major > Fix For: 3.1.0 > > > This PR extracts the logic for selecting the planned join type out of the > `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. This > change both cleans up the code in `JoinSelection` and allows the logic to be > in one place and be used from other rules that need to make decision based on > the join type before the planning time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31719) Refactor JoinSelection
[ https://issues.apache.org/jira/browse/SPARK-31719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31719: --- Assignee: Ali Afroozeh > Refactor JoinSelection > -- > > Key: SPARK-31719 > URL: https://issues.apache.org/jira/browse/SPARK-31719 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Major > > This PR extracts the logic for selecting the planned join type out of the > `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. This > change both cleans up the code in `JoinSelection` and allows the logic to be > in one place and be used from other rules that need to make decision based on > the join type before the planning time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117852#comment-17117852 ] Apache Spark commented on SPARK-31839: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28655 > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Priority: Minor > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31839: Assignee: (was: Apache Spark) > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Priority: Minor > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117851#comment-17117851 ] Apache Spark commented on SPARK-31839: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28655 > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Priority: Minor > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31839) delete duplicate code
[ https://issues.apache.org/jira/browse/SPARK-31839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31839: Assignee: Apache Spark > delete duplicate code > -- > > Key: SPARK-31839 > URL: https://issues.apache.org/jira/browse/SPARK-31839 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.5 >Reporter: philipse >Assignee: Apache Spark >Priority: Minor > > there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality
[ https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117847#comment-17117847 ] Apache Spark commented on SPARK-31837: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28656 > Shift to the new highest locality level if there is when recomputeLocality > -- > > Key: SPARK-31837 > URL: https://issues.apache.org/jira/browse/SPARK-31837 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > In the new version of delay scheduling, if a task set is submitted before any > executors are added to TaskScheduler, the task set will always schedule tasks > at ANY locality level. We should improve this corner case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality
[ https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31837: Assignee: (was: Apache Spark) > Shift to the new highest locality level if there is when recomputeLocality > -- > > Key: SPARK-31837 > URL: https://issues.apache.org/jira/browse/SPARK-31837 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > In the new version of delay scheduling, if a task set is submitted before any > executors are added to TaskScheduler, the task set will always schedule tasks > at ANY locality level. We should improve this corner case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality
[ https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31837: Assignee: Apache Spark > Shift to the new highest locality level if there is when recomputeLocality > -- > > Key: SPARK-31837 > URL: https://issues.apache.org/jira/browse/SPARK-31837 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > In the new version of delay scheduling, if a task set is submitted before any > executors are added to TaskScheduler, the task set will always schedule tasks > at ANY locality level. We should improve this corner case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31839) delete duplicate code
philipse created SPARK-31839: Summary: delete duplicate code Key: SPARK-31839 URL: https://issues.apache.org/jira/browse/SPARK-31839 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.5 Reporter: philipse there are duplicate code, we can clear it to improve test quality -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality
[ https://issues.apache.org/jira/browse/SPARK-31837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117848#comment-17117848 ] Apache Spark commented on SPARK-31837: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28656 > Shift to the new highest locality level if there is when recomputeLocality > -- > > Key: SPARK-31837 > URL: https://issues.apache.org/jira/browse/SPARK-31837 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > In the new version of delay scheduling, if a task set is submitted before any > executors are added to TaskScheduler, the task set will always schedule tasks > at ANY locality level. We should improve this corner case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31838) The streaming output mode validator didn't treat union well
[ https://issues.apache.org/jira/browse/SPARK-31838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingcan Cui updated SPARK-31838: Parent: SPARK-31724 Issue Type: Sub-task (was: Bug) > The streaming output mode validator didn't treat union well > --- > > Key: SPARK-31838 > URL: https://issues.apache.org/jira/browse/SPARK-31838 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Priority: Major > > {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union > operator well (it only assumes a linear query plan). To temporarily fix the > broken semantics before we have a complete improvement plan for the output > mode, the validator should do the following. > 1. Allow multiple (two for now) aggregations from different branches of the > last union operator. > 2. For complete output mode, disallow a union on streams with and without > aggregations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31838) The streaming output mode validator didn't treat union well
Xingcan Cui created SPARK-31838: --- Summary: The streaming output mode validator didn't treat union well Key: SPARK-31838 URL: https://issues.apache.org/jira/browse/SPARK-31838 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Xingcan Cui {{UnsupportedOperationChecker.checkForStreaming}} didn't treat the union operator well (it only assumes a linear query plan). To temporarily fix the broken semantics before we have a complete improvement plan for the output mode, the validator should do the following. 1. Allow multiple (two for now) aggregations from different branches of the last union operator. 2. For complete output mode, disallow a union on streams with and without aggregations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31837) Shift to the new highest locality level if there is when recomputeLocality
wuyi created SPARK-31837: Summary: Shift to the new highest locality level if there is when recomputeLocality Key: SPARK-31837 URL: https://issues.apache.org/jira/browse/SPARK-31837 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: wuyi In the new version of delay scheduling, if a task set is submitted before any executors are added to TaskScheduler, the task set will always schedule tasks at ANY locality level. We should improve this corner case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31795) Stream Data with API to ServiceNow
[ https://issues.apache.org/jira/browse/SPARK-31795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Wetenkamp resolved SPARK-31795. --- Resolution: Workaround > Stream Data with API to ServiceNow > -- > > Key: SPARK-31795 > URL: https://issues.apache.org/jira/browse/SPARK-31795 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.5 >Reporter: Dominic Wetenkamp >Priority: Major > > 1) Create class > 2) Instantiate class > 3) Setup stream > 4) Write stream. Here do I get a pickeling error, I really don't know how to > get it work without error. > > > class CMDB: > #Public Properties > @property > def streamDF(self): > return spark.readStream.table(self.__source_table) > > #Constructor > def __init__(self, destination_table, source_table): > self.__destination_table = destination_table > self.__source_table = source_table > #Private Methodes > def __processRow(self, row): > #API connection info > url = 'https://foo.service-now.com/api/now/table/' + > self.__destination_table + '?sysparm_display_value=true' > user = 'username' > password = 'psw' > > headers = \{"Content-Type":"application/json","Accept":"application/json"} > response = requests.post(url, auth=(user, password), headers=headers, data = > json.dumps(row.asDict())) > > return response > #Public Methodes > def uploadStreamDF(self, df): > return df.writeStream.foreach(self.__processRow).trigger(once=True).start() > > > > cmdb = CMDB('destination_table_name','source_table_name') > streamDF = (cmdb.streamDF > .withColumn('object_id',col('source_column_id')) > .withColumn('name',col('source_column_name')) > ).select('object_id','name') > #set stream works, able to display data > cmdb.uploadStreamDF(streamDF) > #cmdb.uploadStreamDF(streamDF) fails with error: PicklingError: Could not > serialize object: Exception: It appears that you are attempting to reference > SparkContext from a broadcast variable, action, or transformation. > SparkContext can only be used on the driver, not in code that it run on > workers. For more information, see SPARK-5063. See exception below: > ''' > Exception Traceback (most recent call last) > /databricks/spark/python/pyspark/serializers.py in dumps(self, obj) > 704 try: > --> 705 return cloudpickle.dumps(obj, 2) > 706 except pickle.PickleError: > /databricks/spark/python/pyspark/cloudpickle.py in dumps(obj, protocol) > 862 cp = CloudPickler(file,protocol) > --> 863 cp.dump(obj) > 864 return file.getvalue() > /databricks/spark/python/pyspark/cloudpickle.py in dump(self, obj) > 259 try: > --> 260 return Pickler.dump(self, obj) > 261 except RuntimeError as e: > /databricks/python/lib/python3.7/pickle.py in dump(self, obj) > 436 self.framer.start_framing() > --> 437 self.save(obj) > 438 self.write(STOP) > ''' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31803) Make sure instance weight is not negative
[ https://issues.apache.org/jira/browse/SPARK-31803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31803. -- Fix Version/s: 3.1.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28621 > Make sure instance weight is not negative > - > > Key: SPARK-31803 > URL: https://issues.apache.org/jira/browse/SPARK-31803 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > add checks to make sure instance weight is not negative. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31638) Clean code for pagination for all pages
[ https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31638. -- Fix Version/s: 3.1.0 Assignee: Rakesh Raushan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28448 > Clean code for pagination for all pages > --- > > Key: SPARK-31638 > URL: https://issues.apache.org/jira/browse/SPARK-31638 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.1.0 > > > Clean code for pagination for different pages of spark webUI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages
[ https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31638: - Issue Type: Improvement (was: Bug) > Clean code for pagination for all pages > --- > > Key: SPARK-31638 > URL: https://issues.apache.org/jira/browse/SPARK-31638 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Minor > > Clean code for pagination for different pages of spark webUI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions
[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117771#comment-17117771 ] Yuming Wang commented on SPARK-31809: - {noformat} hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:t1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:t1 TableScan alias: t1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: COALESCE(c1,c2) is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: c1 (type: string), c2 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HashTable Sink Operator keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: t2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: c1 is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: c1 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 1.281 seconds, Fetched: 67 row(s) {noformat} > Infer IsNotNull for all children of NullIntolerant expressions > -- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner >:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] >: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) >+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner >:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c
[jira] [Commented] (SPARK-28054) Unable to insert partitioned table dynamically when partition name is upper case
[ https://issues.apache.org/jira/browse/SPARK-28054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117759#comment-17117759 ] Sandeep Katta commented on SPARK-28054: --- [~hyukjin.kwon] is there any reason why this PR is not backported to branch2.4 ? > Unable to insert partitioned table dynamically when partition name is upper > case > > > Key: SPARK-28054 > URL: https://issues.apache.org/jira/browse/SPARK-28054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: ChenKai >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > {code:java} > -- create sql and column name is upper case > CREATE TABLE src (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING) > -- insert sql > INSERT INTO TABLE src PARTITION(ds) SELECT 'k' key, 'v' value, '1' ds > {code} > The error is: > {code:java} > Error in query: > org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: > Partition spec {ds=, DS=1} contains non-partition columns; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage
[ https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117726#comment-17117726 ] Wesley Hildebrandt commented on SPARK-31836: One last note .. with only four (or fewer) files this doesn't happen, I suspect because PySpark uses four executors so each only takes a single input file. > input_file_name() gives wrong value following Python UDF usage > -- > > Key: SPARK-31836 > URL: https://issues.apache.org/jira/browse/SPARK-31836 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wesley Hildebrandt >Priority: Major > > I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8. > The following commands demonstrate that the input_file_name() function > sometimes returns the wrong filename following usage of a Python UDF: > $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done > $ pyspark > >>> import pyspark.sql.functions as F > >>> spark.readStream.text('file:///tmp/test-file-*', > >>> wholetext=True).withColumn('file1', > >>> F.input_file_name()).withColumn('udf', F.udf(lambda > >>> x:x)('value')).withColumn('file2', > >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda > >>> df,_: df.select('file1','file2').show(truncate=False, > >>> vertical=True)).start().awaitTermination() > A few notes about this bug: > * It happens with many different files, so it's not related to the file > contents > * It also happens loading files from HDFS, so storage location is not a > factor > * It also happens using .csv() to read the files instead of .text(), so > input format is not a factor > * I have not been able to cause the error without using readStream, so it > seems to be related to streaming > * The bug also happens using spark-submit to send a job to my cluster > * I haven't tested an older version, but it's possible that Spark pulls > 24958 and 25321([https://github.com/apache/spark/pull/24958], > [https://github.com/apache/spark/pull/25321]) to fix issue 28153 > (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage
Wesley Hildebrandt created SPARK-31836: -- Summary: input_file_name() gives wrong value following Python UDF usage Key: SPARK-31836 URL: https://issues.apache.org/jira/browse/SPARK-31836 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Wesley Hildebrandt I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8. The following commands demonstrate that the input_file_name() function sometimes returns the wrong filename following usage of a Python UDF: $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done $ pyspark >>> import pyspark.sql.functions as F >>> spark.readStream.text('file:///tmp/test-file-*', >>> wholetext=True).withColumn('file1', F.input_file_name()).withColumn('udf', >>> F.udf(lambda x:x)('value')).withColumn('file2', >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda >>> df,_: df.select('file1','file2').show(truncate=False, >>> vertical=True)).start().awaitTermination() A few notes about this bug: * It happens with many different files, so it's not related to the file contents * It also happens loading files from HDFS, so storage location is not a factor * It also happens using .csv() to read the files instead of .text(), so input format is not a factor * I have not been able to cause the error without using readStream, so it seems to be related to streaming * The bug also happens using spark-submit to send a job to my cluster * I haven't tested an older version, but it's possible that Spark pulls 24958 and 25321([https://github.com/apache/spark/pull/24958], [https://github.com/apache/spark/pull/25321]) to fix issue 28153 (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31834) Improve error message for incompatible data types
[ https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117689#comment-17117689 ] Apache Spark commented on SPARK-31834: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/28654 > Improve error message for incompatible data types > - > > Key: SPARK-31834 > URL: https://issues.apache.org/jira/browse/SPARK-31834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Zhu, Lipeng >Priority: Major > > {code:java} > spark-sql> create table SPARK_31834(a int) using parquet; > spark-sql> insert into SPARK_31834 select '1'; > Error in query: Cannot write incompatible data to table > '`default`.`spark_31834`': > - Cannot safely cast 'a': StringType to IntegerType; > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types
[ https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31834: Assignee: Zhu, Lipeng (was: Apache Spark) > Improve error message for incompatible data types > - > > Key: SPARK-31834 > URL: https://issues.apache.org/jira/browse/SPARK-31834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Zhu, Lipeng >Priority: Major > > {code:java} > spark-sql> create table SPARK_31834(a int) using parquet; > spark-sql> insert into SPARK_31834 select '1'; > Error in query: Cannot write incompatible data to table > '`default`.`spark_31834`': > - Cannot safely cast 'a': StringType to IntegerType; > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types
[ https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31834: Assignee: Apache Spark (was: Zhu, Lipeng) > Improve error message for incompatible data types > - > > Key: SPARK-31834 > URL: https://issues.apache.org/jira/browse/SPARK-31834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:java} > spark-sql> create table SPARK_31834(a int) using parquet; > spark-sql> insert into SPARK_31834 select '1'; > Error in query: Cannot write incompatible data to table > '`default`.`spark_31834`': > - Cannot safely cast 'a': StringType to IntegerType; > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31834) Improve error message for incompatible data types
[ https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117688#comment-17117688 ] Apache Spark commented on SPARK-31834: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/28654 > Improve error message for incompatible data types > - > > Key: SPARK-31834 > URL: https://issues.apache.org/jira/browse/SPARK-31834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Zhu, Lipeng >Priority: Major > > {code:java} > spark-sql> create table SPARK_31834(a int) using parquet; > spark-sql> insert into SPARK_31834 select '1'; > Error in query: Cannot write incompatible data to table > '`default`.`spark_31834`': > - Cannot safely cast 'a': StringType to IntegerType; > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117613#comment-17117613 ] Gabor Somogyi commented on SPARK-31815: --- [~agrim] why do you think an external kerberos connector for hive is needed? Could you mention at least one use-case? > Support Hive Kerberos login in JDBC connector > - > > Key: SPARK-31815 > URL: https://issues.apache.org/jira/browse/SPARK-31815 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31834) Improve error message for incompatible data types
[ https://issues.apache.org/jira/browse/SPARK-31834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-31834: --- Assignee: Zhu, Lipeng > Improve error message for incompatible data types > - > > Key: SPARK-31834 > URL: https://issues.apache.org/jira/browse/SPARK-31834 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Zhu, Lipeng >Priority: Major > > {code:java} > spark-sql> create table SPARK_31834(a int) using parquet; > spark-sql> insert into SPARK_31834 select '1'; > Error in query: Cannot write incompatible data to table > '`default`.`spark_31834`': > - Cannot safely cast 'a': StringType to IntegerType; > spark-sql> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117569#comment-17117569 ] Puviarasu commented on SPARK-31754: --- Hello [~kabhwan], Please find below our updates for testing with Spark 2.4.5 and Spark 3.0.0preview2. * *Spark 2.4.5:* The application fails with the same java.lang.NullPointerException as we were getting in Spark 2.4.0 * *Spark 3.0.0preview2:* The application fails with below exception. Somewhat related to https://issues.apache.org/jira/browse/SPARK-27780. Full exception stack along with Logical Plan : [^Excpetion-3.0.0Preview2.txt] {code:java} org.apache.spark.shuffle.FetchFailedException: java.lang.IllegalArgumentException: Unknown message type: 10 {code} Regarding the input and checkpoint, it is production data actually. Sharing them as such is very difficult. We are looking for options to anonymize the data before sharing. Even then we require approvals from stake holders. Thank you. > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, > Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util
[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puviarasu updated SPARK-31754: -- Attachment: Excpetion-3.0.0Preview2.txt > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, > Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583) >
[jira] [Commented] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117557#comment-17117557 ] Apache Spark commented on SPARK-31835: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28653 > Add zoneId to codegen related test in DateExpressionsSuite > -- > > Key: SPARK-31835 > URL: https://issues.apache.org/jira/browse/SPARK-31835 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > the formatter will fail earlier before the codegen check happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117555#comment-17117555 ] Apache Spark commented on SPARK-31835: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28653 > Add zoneId to codegen related test in DateExpressionsSuite > -- > > Key: SPARK-31835 > URL: https://issues.apache.org/jira/browse/SPARK-31835 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > the formatter will fail earlier before the codegen check happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31835) Add zoneId to codegen related test in DateExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-31835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31835: Assignee: Apache Spark > Add zoneId to codegen related test in DateExpressionsSuite > -- > > Key: SPARK-31835 > URL: https://issues.apache.org/jira/browse/SPARK-31835 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > the formatter will fail earlier before the codegen check happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org