[jira] [Commented] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383792#comment-17383792
 ] 

Apache Spark commented on SPARK-32709:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/33432

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35815) Allow delayThreshold for watermark to be represented as ANSI day-time/year-month interval literals

2021-07-19 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383781#comment-17383781
 ] 

Kousuke Saruta commented on SPARK-35815:


I don't think so, and I'll work on this soon.

> Allow delayThreshold for watermark to be represented as ANSI 
> day-time/year-month interval literals
> --
>
> Key: SPARK-35815
> URL: https://issues.apache.org/jira/browse/SPARK-35815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Priority: Major
>
> delayThreshold parameter of DataFrame.withWatermark should handle ANSI 
> day-time/year-month interval literals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35815) Allow delayThreshold for watermark to be represented as ANSI day-time/year-month interval literals

2021-07-19 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383780#comment-17383780
 ] 

Wenchen Fan commented on SPARK-35815:
-

[~sarutak] do we have any more blockers for this one?

> Allow delayThreshold for watermark to be represented as ANSI 
> day-time/year-month interval literals
> --
>
> Key: SPARK-35815
> URL: https://issues.apache.org/jira/browse/SPARK-35815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Priority: Major
>
> delayThreshold parameter of DataFrame.withWatermark should handle ANSI 
> day-time/year-month interval literals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform

2021-07-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35228:

Parent: (was: SPARK-27790)
Issue Type: Improvement  (was: Sub-task)

> Add expression ToHiveString for keep consistent between hive/spark format in 
> df.show and transform
> --
>
> Key: SPARK-35228
> URL: https://issues.apache.org/jira/browse/SPARK-35228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/32335#discussion_r620027850] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36221) Make sure CustomShuffleReaderExec has at least one partition

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383751#comment-17383751
 ] 

Apache Spark commented on SPARK-36221:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/33431

> Make sure CustomShuffleReaderExec has at least one partition
> 
>
> Key: SPARK-36221
> URL: https://issues.apache.org/jira/browse/SPARK-36221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Minor
>
> Since SPARK-32083, AQE coalesce always return at least one partition, it 
> should be robust to add non-empty check in `CustomShuffleReaderExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36221) Make sure CustomShuffleReaderExec has at least one partition

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36221:


Assignee: (was: Apache Spark)

> Make sure CustomShuffleReaderExec has at least one partition
> 
>
> Key: SPARK-36221
> URL: https://issues.apache.org/jira/browse/SPARK-36221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Minor
>
> Since SPARK-32083, AQE coalesce always return at least one partition, it 
> should be robust to add non-empty check in `CustomShuffleReaderExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36221) Make sure CustomShuffleReaderExec has at least one partition

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36221:


Assignee: Apache Spark

> Make sure CustomShuffleReaderExec has at least one partition
> 
>
> Key: SPARK-36221
> URL: https://issues.apache.org/jira/browse/SPARK-36221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> Since SPARK-32083, AQE coalesce always return at least one partition, it 
> should be robust to add non-empty check in `CustomShuffleReaderExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36196) Spark FetchFailedException Stream is corrupted Error

2021-07-19 Thread Arghya Saha (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arghya Saha updated SPARK-36196:

Component/s: PySpark
 Kubernetes

> Spark FetchFailedException Stream is corrupted Error
> 
>
> Key: SPARK-36196
> URL: https://issues.apache.org/jira/browse/SPARK-36196
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark, Spark Core
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark on K8s
>Reporter: Arghya Saha
>Priority: Major
>
> I am running Spark on K8S. There are around thousands of jobs runs everyday 
> but few are getting failed everyday(not same job) and with below exception. 
> It succeed on retry. I have read about the error in multiple Jira and saw its 
> resolved with Spark 3.0.0 but I am still getting the error with higher 
> version.
> {code:java}
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
> org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.base/java.io.BufferedInputStream.fill(Unknown Source) at 
> java.base/java.io.BufferedInputStream.read1(Unknown Source) at 
> java.base/java.io.BufferedInputStream.read(Unknown Source) at 
> java.base/java.io.DataInputStream.read(Unknown Source) at 
> org.sparkproject.guava.io.ByteStreams.read(ByteStreams.java:899) at 
> org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:733) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.sort_addToSorter_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>  at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:817)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:687)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:197)
>  at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.io.IOException: Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:250) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 38 moreCaused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 
> 8785 of input buffer at 
> 

[jira] [Assigned] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-36093:
---

Assignee: angerszhu  (was: Apache Spark)

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36046) Support new functions make_timestamp_ntz and make_timestamp_ltz

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383738#comment-17383738
 ] 

Apache Spark commented on SPARK-36046:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33430

> Support new functions make_timestamp_ntz and make_timestamp_ltz
> ---
>
> Key: SPARK-36046
> URL: https://issues.apache.org/jira/browse/SPARK-36046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Syntax:
> make_timestamp_ntz(year, month, day, hour, min, sec)  Create local date-time 
> from year, month, day, hour, min, sec fields
> make_timestamp_ltz(year, month, day, hour, min, sec, [timezone])  Create 
> current timestamp with local time zone from year, month, day, hour, min, sec 
> and timezone fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36216.
--
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33427
[https://github.com/apache/spark/pull/33427]

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36216:


Assignee: Hyukjin Kwon

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36046) Support new functions make_timestamp_ntz and make_timestamp_ltz

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383736#comment-17383736
 ] 

Apache Spark commented on SPARK-36046:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33430

> Support new functions make_timestamp_ntz and make_timestamp_ltz
> ---
>
> Key: SPARK-36046
> URL: https://issues.apache.org/jira/browse/SPARK-36046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Syntax:
> make_timestamp_ntz(year, month, day, hour, min, sec)  Create local date-time 
> from year, month, day, hour, min, sec fields
> make_timestamp_ltz(year, month, day, hour, min, sec, [timezone])  Create 
> current timestamp with local time zone from year, month, day, hour, min, sec 
> and timezone fields



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36221) Make sure CustomShuffleReaderExec has at least one partition

2021-07-19 Thread XiDuo You (Jira)
XiDuo You created SPARK-36221:
-

 Summary: Make sure CustomShuffleReaderExec has at least one 
partition
 Key: SPARK-36221
 URL: https://issues.apache.org/jira/browse/SPARK-36221
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


Since SPARK-32083, AQE coalesce always return at least one partition, it should 
be robust to add non-empty check in `CustomShuffleReaderExec`.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36217:


Assignee: Apache Spark

> Rename CustomShuffleReader and OptimizeLocalShuffleReader
> -
>
> Key: SPARK-36217
> URL: https://issues.apache.org/jira/browse/SPARK-36217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
> should just be something like {{AQEShuffleRead}}.
> {{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
> reader which is odds. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383731#comment-17383731
 ] 

Apache Spark commented on SPARK-36217:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33429

> Rename CustomShuffleReader and OptimizeLocalShuffleReader
> -
>
> Key: SPARK-36217
> URL: https://issues.apache.org/jira/browse/SPARK-36217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
> should just be something like {{AQEShuffleRead}}.
> {{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
> reader which is odds. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383732#comment-17383732
 ] 

Apache Spark commented on SPARK-36217:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33429

> Rename CustomShuffleReader and OptimizeLocalShuffleReader
> -
>
> Key: SPARK-36217
> URL: https://issues.apache.org/jira/browse/SPARK-36217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
> should just be something like {{AQEShuffleRead}}.
> {{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
> reader which is odds. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36217:


Assignee: (was: Apache Spark)

> Rename CustomShuffleReader and OptimizeLocalShuffleReader
> -
>
> Key: SPARK-36217
> URL: https://issues.apache.org/jira/browse/SPARK-36217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
> should just be something like {{AQEShuffleRead}}.
> {{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
> reader which is odds. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36220) Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383727#comment-17383727
 ] 

Apache Spark commented on SPARK-36220:
--

User 'tobiasedwards' has created a pull request for this issue:
https://github.com/apache/spark/pull/33428

> Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations
> -
>
> Key: SPARK-36220
> URL: https://issues.apache.org/jira/browse/SPARK-36220
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Tobias Edwards
>Priority: Minor
>
> This bug involves incorrect type annotations for {{pyspark.sql.types.Row}}'s 
> {{\_\_new\_\_}} and {{\_\_init\_\_}} methods when invoked without keyword 
> arguments (_i.e._, {{\*args}} rather than {{\*\*kwargs}}).
> When creating a 
> [Row|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]
>  with unnamed fields which are not of type {{str}} (_e.g._, {{row1 = 
> Row("Alice", 11)}} appears in the {{Row}} documentation) type checkers 
> produce an error.
> The implementation doesn't assume the arguments are of type {{str}}, and in 
> fact the documentation includes an example where non-{{str}} types are 
> provided in this way (see [the final example 
> here|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]).
> An example of the type error produced by 
> [pyright|https://github.com/microsoft/pyright] is
> {code}
> error: No overloads for "__init__" match the provided arguments
>     Argument types: (Literal['Alice'], Literal[11]) (reportGeneralTypeIssues)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36220) Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36220:


Assignee: Apache Spark

> Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations
> -
>
> Key: SPARK-36220
> URL: https://issues.apache.org/jira/browse/SPARK-36220
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Tobias Edwards
>Assignee: Apache Spark
>Priority: Minor
>
> This bug involves incorrect type annotations for {{pyspark.sql.types.Row}}'s 
> {{\_\_new\_\_}} and {{\_\_init\_\_}} methods when invoked without keyword 
> arguments (_i.e._, {{\*args}} rather than {{\*\*kwargs}}).
> When creating a 
> [Row|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]
>  with unnamed fields which are not of type {{str}} (_e.g._, {{row1 = 
> Row("Alice", 11)}} appears in the {{Row}} documentation) type checkers 
> produce an error.
> The implementation doesn't assume the arguments are of type {{str}}, and in 
> fact the documentation includes an example where non-{{str}} types are 
> provided in this way (see [the final example 
> here|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]).
> An example of the type error produced by 
> [pyright|https://github.com/microsoft/pyright] is
> {code}
> error: No overloads for "__init__" match the provided arguments
>     Argument types: (Literal['Alice'], Literal[11]) (reportGeneralTypeIssues)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36220) Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36220:


Assignee: (was: Apache Spark)

> Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations
> -
>
> Key: SPARK-36220
> URL: https://issues.apache.org/jira/browse/SPARK-36220
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Tobias Edwards
>Priority: Minor
>
> This bug involves incorrect type annotations for {{pyspark.sql.types.Row}}'s 
> {{\_\_new\_\_}} and {{\_\_init\_\_}} methods when invoked without keyword 
> arguments (_i.e._, {{\*args}} rather than {{\*\*kwargs}}).
> When creating a 
> [Row|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]
>  with unnamed fields which are not of type {{str}} (_e.g._, {{row1 = 
> Row("Alice", 11)}} appears in the {{Row}} documentation) type checkers 
> produce an error.
> The implementation doesn't assume the arguments are of type {{str}}, and in 
> fact the documentation includes an example where non-{{str}} types are 
> provided in this way (see [the final example 
> here|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]).
> An example of the type error produced by 
> [pyright|https://github.com/microsoft/pyright] is
> {code}
> error: No overloads for "__init__" match the provided arguments
>     Argument types: (Literal['Alice'], Literal[11]) (reportGeneralTypeIssues)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36217:
-
Description: 
The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
should just be something like {{AQEShuffleRead}}.
{{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
reader which is odds. We should also rename this to something else

  was:
The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
should just be {{AQEShuffleReader}}.

{{OptimizeLocalShuffleReader}} is a name of a rule but it's named it as a 
reader. We should also rename this to something else


> Rename CustomShuffleReader and OptimizeLocalShuffleReader
> -
>
> Key: SPARK-36217
> URL: https://issues.apache.org/jira/browse/SPARK-36217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
> should just be something like {{AQEShuffleRead}}.
> {{OptimizeLocalShuffleReader}} is also a name of a rule but it reads as a 
> reader which is odds. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36220) Incorrect pyspark.sql.types.Row __new__ and __init__ type annotations

2021-07-19 Thread Tobias Edwards (Jira)
Tobias Edwards created SPARK-36220:
--

 Summary: Incorrect pyspark.sql.types.Row __new__ and __init__ type 
annotations
 Key: SPARK-36220
 URL: https://issues.apache.org/jira/browse/SPARK-36220
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Tobias Edwards


This bug involves incorrect type annotations for {{pyspark.sql.types.Row}}'s 
{{\_\_new\_\_}} and {{\_\_init\_\_}} methods when invoked without keyword 
arguments (_i.e._, {{\*args}} rather than {{\*\*kwargs}}).

When creating a 
[Row|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]
 with unnamed fields which are not of type {{str}} (_e.g._, {{row1 = 
Row("Alice", 11)}} appears in the {{Row}} documentation) type checkers produce 
an error.

The implementation doesn't assume the arguments are of type {{str}}, and in 
fact the documentation includes an example where non-{{str}} types are provided 
in this way (see [the final example 
here|https://hyukjin-spark.readthedocs.io/en/latest/reference/api/pyspark.sql.types.Row.html]).

An example of the type error produced by 
[pyright|https://github.com/microsoft/pyright] is

{code}
error: No overloads for "__init__" match the provided arguments
    Argument types: (Literal['Alice'], Literal[11]) (reportGeneralTypeIssues)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36219) Add flag to allow Driver to request for OPPORTUNISTIC containers

2021-07-19 Thread chaosju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaosju updated SPARK-36219:

Affects Version/s: (was: 3.1.2)
   3.3.0

> Add flag to allow Driver to request for OPPORTUNISTIC containers
> 
>
> Key: SPARK-36219
> URL: https://issues.apache.org/jira/browse/SPARK-36219
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: chaosju
>Priority: Major
>
> YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes 
> and specifically OPPORTUNISTIC containers.
> The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
> provide hints via config to the SPARK framework as to the number of 
> containers it would like to schedule as OPPORTUNISTIC.
> like MAPREDUCE-6703 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36219) Add flag to allow Driver to request for OPPORTUNISTIC containers

2021-07-19 Thread chaosju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaosju updated SPARK-36219:

Description: 
YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes and 
specifically OPPORTUNISTIC containers.
The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
provide hints via config to the SPARK framework as to the number of containers 
it would like to schedule as OPPORTUNISTIC.

> Add flag to allow Driver to request for OPPORTUNISTIC containers
> 
>
> Key: SPARK-36219
> URL: https://issues.apache.org/jira/browse/SPARK-36219
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: chaosju
>Priority: Major
>
> YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes 
> and specifically OPPORTUNISTIC containers.
> The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
> provide hints via config to the SPARK framework as to the number of 
> containers it would like to schedule as OPPORTUNISTIC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36219) Add flag to allow Driver to request for OPPORTUNISTIC containers

2021-07-19 Thread chaosju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chaosju updated SPARK-36219:

Description: 
YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes and 
specifically OPPORTUNISTIC containers.
The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
provide hints via config to the SPARK framework as to the number of containers 
it would like to schedule as OPPORTUNISTIC.


like MAPREDUCE-6703 .

  was:
YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes and 
specifically OPPORTUNISTIC containers.
The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
provide hints via config to the SPARK framework as to the number of containers 
it would like to schedule as OPPORTUNISTIC.


> Add flag to allow Driver to request for OPPORTUNISTIC containers
> 
>
> Key: SPARK-36219
> URL: https://issues.apache.org/jira/browse/SPARK-36219
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.1.2
>Reporter: chaosju
>Priority: Major
>
> YARN-2882 and YARN-4335 introduces the concept of container ExecutionTypes 
> and specifically OPPORTUNISTIC containers.
> The default ExecutionType is GUARANTEED. This JIRA proposes to allow users to 
> provide hints via config to the SPARK framework as to the number of 
> containers it would like to schedule as OPPORTUNISTIC.
> like MAPREDUCE-6703 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36219) Add flag to allow Driver to request for OPPORTUNISTIC containers

2021-07-19 Thread chaosju (Jira)
chaosju created SPARK-36219:
---

 Summary: Add flag to allow Driver to request for OPPORTUNISTIC 
containers
 Key: SPARK-36219
 URL: https://issues.apache.org/jira/browse/SPARK-36219
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.1.2
Reporter: chaosju






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36093:

Fix Version/s: 3.0.4

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383707#comment-17383707
 ] 

Hyukjin Kwon edited comment on SPARK-36218 at 7/20/21, 2:44 AM:


cc [~maropu], [~cloud_fan], [~dongjoon] FYI.

Actually, I faced this issue in our internal repo a while ago, and just added a 
hacky fix by adding an explicit GC:

{code}
  if (tpcdsDataPath.nonEmpty) {
tpcdsQueries
  .foreach { name =>
  val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(name) {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v1_4", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
tpcdsQueriesV2_7_0
  .foreach { name =>
  val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(s"$name-v2.7") {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v2_7", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
  } else {
ignore("skipped because env `SPARK_TPCDS_DATA` is not set") {}
  }
}
{code}

Oh wait, let me take this back. TPC-DS became flaky now in our internal repo 
even with the fix ^.


was (Author: hyukjin.kwon):
cc [~maropu], [~cloud_fan], [~dongjoon] FYI.

Actually, I faced this issue in our internal repo a while ago, and just added a 
hacky fix by adding an explicit GC:

{code}
  if (tpcdsDataPath.nonEmpty) {
tpcdsQueries
  .foreach { name =>
  val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(name) {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v1_4", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
tpcdsQueriesV2_7_0
  .foreach { name =>
  val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(s"$name-v2.7") {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v2_7", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
  } else {
ignore("skipped because env `SPARK_TPCDS_DATA` is not set") {}
  }
}
{code}

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36218:
-
Comment: was deleted

(was: If we're not sure, It think we can land the same hacky fix for now ... 
[~maropu] do you have any idea about this?)

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36218:
-
Comment: was deleted

(was: Let me create a Pr for now as a temporary workaround ...)

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383709#comment-17383709
 ] 

Hyukjin Kwon commented on SPARK-36218:
--

Let me create a Pr for now as a temporary workaround ...

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383708#comment-17383708
 ] 

Hyukjin Kwon commented on SPARK-36218:
--

If we're not sure, It think we can land the same hacky fix for now ... 
[~maropu] do you have any idea about this?

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383707#comment-17383707
 ] 

Hyukjin Kwon edited comment on SPARK-36218 at 7/20/21, 2:41 AM:


cc [~maropu], [~cloud_fan], [~dongjoon] FYI.

Actually, I faced this issue in our internal repo a while ago, and just added a 
hacky fix by adding an explicit GC:

{code}
  if (tpcdsDataPath.nonEmpty) {
tpcdsQueries
  .foreach { name =>
  val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(name) {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v1_4", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
tpcdsQueriesV2_7_0
  .foreach { name =>
  val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(s"$name-v2.7") {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v2_7", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
  } else {
ignore("skipped because env `SPARK_TPCDS_DATA` is not set") {}
  }
}
{code}


was (Author: hyukjin.kwon):
cc [~maropu], [~cloud_fan], [~dongjoon] FYI.

Actually, I faced this issue in our internal repo a while ago, and just added a 
hacky fix by adding an explicit GC:

{code}
  if (tpcdsDataPath.nonEmpty) {
tpcdsQueries
  .filter(_ != "q95") // TODO(SC-75125)
  .filter(_ != "q75") // TODO(SC-75127)
  .filter(_ != "q64") // TODO(SC-75126)
  .foreach { name =>
  val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(name) {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v1_4", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
tpcdsQueriesV2_7_0
  .filter(_ != "q95") // TODO(SC-75125)
  .filter(_ != "q75") // TODO(SC-75127)
  .filter(_ != "q64") // TODO(SC-75126)
  .foreach { name =>
  val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(s"$name-v2.7") {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v2_7", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
  } else {
ignore("skipped because env `SPARK_TPCDS_DATA` is not set") {}
  }
}
{code}

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383707#comment-17383707
 ] 

Hyukjin Kwon commented on SPARK-36218:
--

cc [~maropu], [~cloud_fan], [~dongjoon] FYI.

Actually, I faced this issue in our internal repo a while ago, and just added a 
hacky fix by adding an explicit GC:

{code}
  if (tpcdsDataPath.nonEmpty) {
tpcdsQueries
  .filter(_ != "q95") // TODO(SC-75125)
  .filter(_ != "q75") // TODO(SC-75127)
  .filter(_ != "q64") // TODO(SC-75126)
  .foreach { name =>
  val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(name) {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v1_4", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
tpcdsQueriesV2_7_0
  .filter(_ != "q95") // TODO(SC-75125)
  .filter(_ != "q75") // TODO(SC-75127)
  .filter(_ != "q64") // TODO(SC-75126)
  .foreach { name =>
  val queryString = resourceToString(s"tpcds-v2.7.0/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
  test(s"$name-v2.7") {
+ // SPARK-36218: workaround to prevent unexpected failure related to 
resource usage.
+ System.gc()
val goldenFile = new File(s"$baseResourcePath/v2_7", s"$name.sql.out")
runQuery(queryString, goldenFile)
  }
}
  } else {
ignore("skipped because env `SPARK_TPCDS_DATA` is not set") {}
  }
}
{code}

> Flaky Test: TPC-DS in PR builder
> 
>
> Key: SPARK-36218
> URL: https://issues.apache.org/jira/browse/SPARK-36218
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> [info] - q1 (9 seconds, 603 milliseconds)
> [info] - q2 (5 seconds, 860 milliseconds)
> [info] - q3 (1 second, 777 milliseconds)
> [info] - q4 (31 seconds, 951 milliseconds)
> [info] - q5 (4 seconds, 561 milliseconds)
> [info] - q7 (2 seconds, 471 milliseconds)
> [info] - q8 (2 seconds, 74 milliseconds)
> [info] - q9 (4 seconds, 402 milliseconds)
> [info] - q10 (4 seconds, 618 milliseconds)
> /home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 
> Killed  "$@"
> Error: Process completed with exit code 137.
> {code}
> It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36218) Flaky Test: TPC-DS in PR builder

2021-07-19 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36218:


 Summary: Flaky Test: TPC-DS in PR builder
 Key: SPARK-36218
 URL: https://issues.apache.org/jira/browse/SPARK-36218
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.1.2, 3.0.3, 3.2.0, 3.3.0
Reporter: Hyukjin Kwon


{code}
[info] - q1 (9 seconds, 603 milliseconds)
[info] - q2 (5 seconds, 860 milliseconds)
[info] - q3 (1 second, 777 milliseconds)
[info] - q4 (31 seconds, 951 milliseconds)
[info] - q5 (4 seconds, 561 milliseconds)
[info] - q7 (2 seconds, 471 milliseconds)
[info] - q8 (2 seconds, 74 milliseconds)
[info] - q9 (4 seconds, 402 milliseconds)
[info] - q10 (4 seconds, 618 milliseconds)
/home/runner/work/spark/spark/build/sbt-launch-lib.bash: line 77:  1659 Killed  
"$@"
Error: Process completed with exit code 137.
{code}

It dies in the middle: https://github.com/apache/spark/runs/3109502701



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35807) Deprecate the `num_files` argument

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35807:
-
Fix Version/s: 3.2.0

> Deprecate the `num_files` argument
> --
>
> Key: SPARK-35807
> URL: https://issues.apache.org/jira/browse/SPARK-35807
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> We should deprecate the num_files argument in [DataFrame.to_csv 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]and
>  
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html].
> Because the behavior of num_files is not actually specify the number of 
> files, but it specifies the number of partition.
> So we should encourage users to use 
> [DataFrame.spark.repartition|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html]
>  instead in the warning message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35807) Deprecate the `num_files` argument

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35807:


Assignee: Haejoon Lee

> Deprecate the `num_files` argument
> --
>
> Key: SPARK-35807
> URL: https://issues.apache.org/jira/browse/SPARK-35807
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should deprecate the num_files argument in [DataFrame.to_csv 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]and
>  
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html].
> Because the behavior of num_files is not actually specify the number of 
> files, but it specifies the number of partition.
> So we should encourage users to use 
> [DataFrame.spark.repartition|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html]
>  instead in the warning message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36216:


Assignee: Apache Spark

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383701#comment-17383701
 ] 

Apache Spark commented on SPARK-36216:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33427

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36217) Rename CustomShuffleReader and OptimizeLocalShuffleReader

2021-07-19 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36217:


 Summary: Rename CustomShuffleReader and OptimizeLocalShuffleReader
 Key: SPARK-36217
 URL: https://issues.apache.org/jira/browse/SPARK-36217
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


The name {{CustomShuffleReader}} is confusing and sounds like an API. This 
should just be {{AQEShuffleReader}}.

{{OptimizeLocalShuffleReader}} is a name of a rule but it's named it as a 
reader. We should also rename this to something else



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36216:


Assignee: (was: Apache Spark)

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36216:
-
Description: 
Test is flaky (https://github.com/apache/spark/runs/3109815586):

{code}
Traceback (most recent call last):
  File 
"/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
line 391, in test_parameter_convergence
eventually(condition, catch_assertions=True)
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
eventually
raise lastValue
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
eventually
lastValue = condition()
  File 
"/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
line 387, in condition
self.assertEqual(len(model_weights), len(batches))
AssertionError: 9 != 10
{code}

Should probably increase timeout

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Test is flaky (https://github.com/apache/spark/runs/3109815586):
> {code}
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 391, in test_parameter_convergence
> eventually(condition, catch_assertions=True)
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
> line 387, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 9 != 10
> {code}
> Should probably increase timeout



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36216:


 Summary: Increase timeout for 
StreamingLinearRegressionWithTests.test_parameter_convergence
 Key: SPARK-36216
 URL: https://issues.apache.org/jira/browse/SPARK-36216
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 3.1.2, 3.0.3, 3.2.0, 3.3.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36216) Increase timeout for StreamingLinearRegressionWithTests.test_parameter_convergence

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36216:
-
Docs Text:   (was: Test is flaky 
(https://github.com/apache/spark/runs/3109815586):

{code}
Traceback (most recent call last):
  File 
"/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
line 391, in test_parameter_convergence
eventually(condition, catch_assertions=True)
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in 
eventually
raise lastValue
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in 
eventually
lastValue = condition()
  File 
"/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", 
line 387, in condition
self.assertEqual(len(model_weights), len(batches))
AssertionError: 9 != 10
{code}

Should probably increase timeout)

> Increase timeout for 
> StreamingLinearRegressionWithTests.test_parameter_convergence
> --
>
> Key: SPARK-36216
> URL: https://issues.apache.org/jira/browse/SPARK-36216
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35807) Deprecate the `num_files` argument

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383699#comment-17383699
 ] 

Apache Spark commented on SPARK-35807:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33379

> Deprecate the `num_files` argument
> --
>
> Key: SPARK-35807
> URL: https://issues.apache.org/jira/browse/SPARK-35807
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should deprecate the num_files argument in [DataFrame.to_csv 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]and
>  
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html].
> Because the behavior of num_files is not actually specify the number of 
> files, but it specifies the number of partition.
> So we should encourage users to use 
> [DataFrame.spark.repartition|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html]
>  instead in the warning message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35807) Deprecate the `num_files` argument

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383698#comment-17383698
 ] 

Apache Spark commented on SPARK-35807:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33379

> Deprecate the `num_files` argument
> --
>
> Key: SPARK-35807
> URL: https://issues.apache.org/jira/browse/SPARK-35807
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should deprecate the num_files argument in [DataFrame.to_csv 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]and
>  
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html].
> Because the behavior of num_files is not actually specify the number of 
> files, but it specifies the number of partition.
> So we should encourage users to use 
> [DataFrame.spark.repartition|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html]
>  instead in the warning message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36215) Add logging for slow fetches to diagnose external shuffle service issues

2021-07-19 Thread Shardul Mahadik (Jira)
Shardul Mahadik created SPARK-36215:
---

 Summary: Add logging for slow fetches to diagnose external shuffle 
service issues
 Key: SPARK-36215
 URL: https://issues.apache.org/jira/browse/SPARK-36215
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: Shardul Mahadik


Currently we can see from the metrics that a task or stage has slow fetches, 
and the logs indicate _all_ of the shuffle servers those tasks were fetching 
from, but often this is a big set (dozens or even hundreds) and narrowing down 
which one caused issues can be very difficult. We should add some logging when 
a fetch is "slow" as determined by some preconfigured thresholds.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-19 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35809:

Description: The current behavior of [ps.sql 
|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
 lost the index, so we should add the `index_col` arguments for this API so 
that we can preserve the index.  (was: The current behavior of [ps.sql 
|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
 lost the index, so we should add the `indxe_col` arguments for this API so 
that we can preserve the index.)

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `index_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36179.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33393
[https://github.com/apache/spark/pull/33393]

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36179) Support TimestampNTZType in SparkGetColumnsOperation

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36179:


Assignee: Kent Yao

> Support TimestampNTZType in SparkGetColumnsOperation
> 
>
> Key: SPARK-36179
> URL: https://issues.apache.org/jira/browse/SPARK-36179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> TimestampNTZType is unhandled in SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35809) Add `index_col` argument for ps.sql.

2021-07-19 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383679#comment-17383679
 ] 

Haejoon Lee commented on SPARK-35809:
-

I'm working on this

> Add `index_col` argument for ps.sql.
> 
>
> Key: SPARK-35809
> URL: https://issues.apache.org/jira/browse/SPARK-35809
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The current behavior of [ps.sql 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.sql.html]always
>  lost the index, so we should add the `indxe_col` arguments for this API so 
> that we can preserve the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36176) Expose tableExists in pyspark.sql.catalog

2021-07-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36176.
--
Fix Version/s: 3.2.0
 Assignee: Dominik Gehl
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33388

> Expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Minor
> Fix For: 3.2.0
>
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383669#comment-17383669
 ] 

Apache Spark commented on SPARK-32920:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33426

> Add support in Spark driver to coordinate the finalization of the push/merge 
> phase in push-based shuffle for a given shuffle and the initiation of the 
> reduce stage
> ---
>
> Key: SPARK-32920
> URL: https://issues.apache.org/jira/browse/SPARK-32920
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> With push-based shuffle, we are currently decoupling map task executions from 
> the shuffle block push process. Thus, when all map tasks finish, we might 
> want to wait for some small extra time to allow more shuffle blocks to get 
> pushed and merged. This requires some extra coordination in the Spark driver 
> when it transitions from a shuffle map stage to the corresponding reduce 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36000) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-19 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383667#comment-17383667
 ] 

Xinrong Meng commented on SPARK-36000:
--

We might want to support spark.createDataFrame(data=[decimal.Decimal('NaN')], 
schema='decimal') first.

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383663#comment-17383663
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33425

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383664#comment-17383664
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33425

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-19 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383641#comment-17383641
 ] 

Takuya Ueshin commented on SPARK-36214:
---

I'm working on this.

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-19 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36214:
-

 Summary: Add add_categories to CategoricalAccessor and 
CategoricalIndex.
 Key: SPARK-36214
 URL: https://issues.apache.org/jira/browse/SPARK-36214
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35997) Implement comparison operators for CategoricalDtype in pandas API on Spark

2021-07-19 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-35997.
--
Resolution: Done

> Implement comparison operators for CategoricalDtype in pandas API on Spark
> --
>
> Key: SPARK-35997
> URL: https://issues.apache.org/jira/browse/SPARK-35997
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> In pandas API on Spark, "<, <=, >, >=" have not been implemented for 
> CategoricalDtype.
> We ought to match pandas' behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-19 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36127.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng  (was: Apache Spark)
   Resolution: Fixed

Issue resolved by pull request 33373
https://github.com/apache/spark/pull/33373

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-19 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383529#comment-17383529
 ] 

Thomas Graves commented on SPARK-25075:
---

Just wanted to check the plans for scala 2.13 in 3.2.  It looks like scala 2.12 
will still be the default, correct?

Are we planning on releasing the Spark tgz artifacts for 2.13 and 2.12 or only 
2.12?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36213:


Assignee: (was: Apache Spark)

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383485#comment-17383485
 ] 

Apache Spark commented on SPARK-36213:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33424

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36213:


Assignee: Apache Spark

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Kent Yao (Jira)
Kent Yao created SPARK-36213:


 Summary: Normalize PartitionSpec for DescTable with PartitionSpec
 Key: SPARK-36213
 URL: https://issues.apache.org/jira/browse/SPARK-36213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.0.3, 2.4.8, 3.2.0
Reporter: Kent Yao


!image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36212) Add exception for Kafka readstream when decryption fails

2021-07-19 Thread Jon LaFlamme (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon LaFlamme updated SPARK-36212:
-
Fix Version/s: (was: 3.1.0)
   3.0.0

> Add exception for Kafka readstream when decryption fails
> 
>
> Key: SPARK-36212
> URL: https://issues.apache.org/jira/browse/SPARK-36212
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Jon LaFlamme
>Priority: Minor
>  Labels: exceptions, warnings
> Fix For: 3.0.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> A silent failure is possible when reading from a Kafka broker under the 
> following circumstances:
> SDF.isStreaming = True
> SDF.printSchema() => returns expected schema
> Query results are empty.
> Issue: Tsl Decryption has failed, but there is no exception or warning.
> Request: Add a warning throw an exception when decryption fails, such that 
> developers can efficiently diagnose the readstream problem.
>  
> This is my first ticket submitted. Please notify me if I should change 
> anything in this ticket to make it more conformant to community standards. 
> I'm still a beginner with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36212) Add exception for Kafka readstream when decryption fails

2021-07-19 Thread Jon LaFlamme (Jira)
Jon LaFlamme created SPARK-36212:


 Summary: Add exception for Kafka readstream when decryption fails
 Key: SPARK-36212
 URL: https://issues.apache.org/jira/browse/SPARK-36212
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark 3.0.0
Reporter: Jon LaFlamme
 Fix For: 3.1.0


A silent failure is possible when reading from a Kafka broker under the 
following circumstances:

SDF.isStreaming = True

SDF.printSchema() => returns expected schema

Query results are empty.

Issue: Tsl Decryption has failed, but there is no exception or warning.

Request: Add a warning throw an exception when decryption fails, such that 
developers can efficiently diagnose the readstream problem.

 

This is my first ticket submitted. Please notify me if I should change anything 
in this ticket to make it more conformant to community standards. I'm still a 
beginner with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383415#comment-17383415
 ] 

Apache Spark commented on SPARK-36210:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/33423

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36210:


Assignee: (was: Apache Spark)

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383414#comment-17383414
 ] 

Apache Spark commented on SPARK-36210:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/33423

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36210:


Assignee: Apache Spark

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36093.
-
Fix Version/s: 3.1.3
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 33417
[https://github.com/apache/spark/pull/33417]

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3
>
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34806) Helper class for batch Dataset.observe()

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383394#comment-17383394
 ] 

Apache Spark commented on SPARK-34806:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/33422

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.3.0
>
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36211:


Assignee: (was: Apache Spark)

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383385#comment-17383385
 ] 

Apache Spark commented on SPARK-36211:
--

User 'luranhe' has created a pull request for this issue:
https://github.com/apache/spark/pull/33399

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36211:


Assignee: Apache Spark

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)
Luran He created SPARK-36211:


 Summary: type check fails for `F.udf(...).asNonDeterministic()
 Key: SPARK-36211
 URL: https://issues.apache.org/jira/browse/SPARK-36211
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Luran He


The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the `udf` signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luran He updated SPARK-36211:
-
Description: 
The following code should type-check, but doesn't:


{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}


In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong

  was:
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong


> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luran He updated SPARK-36211:
-
Description: 
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong

  was:
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the `udf` signature is wrong


> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread koert kuipers (Jira)
koert kuipers created SPARK-36210:
-

 Summary: Preserve column insertion order in Dataset.withColumns
 Key: SPARK-36210
 URL: https://issues.apache.org/jira/browse/SPARK-36210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: koert kuipers


Dataset.withColumns uses a Map (columnMap) to store the mapping of column name 
to column. however this loses the order of the columns. also none of the 
operations used on the Map (find and filter) benefits from the map's lookup 
features. so it seems simpler to use a Seq instead, which also preserves the 
insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383355#comment-17383355
 ] 

Apache Spark commented on SPARK-36209:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33420

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36209:


Assignee: Apache Spark

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36209:


Assignee: (was: Apache Spark)

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383354#comment-17383354
 ] 

Apache Spark commented on SPARK-36166:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33421

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Dominik Gehl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Gehl updated SPARK-36209:
-
Description: 
On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
to the python doc points to 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
 which returns a "Not found"


> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Dominik Gehl (Jira)
Dominik Gehl created SPARK-36209:


 Summary: 
https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
invalid link to Python doc
 Key: SPARK-36209
 URL: https://issues.apache.org/jira/browse/SPARK-36209
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.1.2
 Environment: On 
https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
the python doc points to 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
 which returns a "Not found"
Reporter: Dominik Gehl






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383330#comment-17383330
 ] 

Apache Spark commented on SPARK-36208:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33419

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36208:


Assignee: Kousuke Saruta  (was: Apache Spark)

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383329#comment-17383329
 ] 

Apache Spark commented on SPARK-36208:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33419

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36208:


Assignee: Apache Spark  (was: Kousuke Saruta)

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36208:
---
Summary: SparkScriptTransformation should support ANSI interval types  
(was: SparkScriptTransformation )

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383323#comment-17383323
 ] 

Apache Spark commented on SPARK-36207:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33416

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36208) SparkScriptTransformation

2021-07-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36208:
---
Parent: SPARK-27790
Issue Type: Sub-task  (was: Bug)

> SparkScriptTransformation 
> --
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36207:


Assignee: (was: Apache Spark)

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383322#comment-17383322
 ] 

Apache Spark commented on SPARK-36093:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33418

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36207:


Assignee: Apache Spark

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383321#comment-17383321
 ] 

Apache Spark commented on SPARK-36093:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33417

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36208) SparkScriptTransformation

2021-07-19 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36208:
--

 Summary: SparkScriptTransformation 
 Key: SPARK-36208
 URL: https://issues.apache.org/jira/browse/SPARK-36208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


SparkScriptTransformation supports CalendarIntervalType so it's better to 
support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Dominik Gehl (Jira)
Dominik Gehl created SPARK-36207:


 Summary: Export databaseExists in pyspark.sql.catalog
 Key: SPARK-36207
 URL: https://issues.apache.org/jira/browse/SPARK-36207
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Dominik Gehl


expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >