[jira] [Commented] (SPARK-38102) Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487937#comment-17487937
 ] 

Apache Spark commented on SPARK-38102:
--

User 'ocworld' has created a pull request for this issue:
https://github.com/apache/spark/pull/35417

> Supporting custom commitProtocolClass when using saveAsNewAPIHadoopDataset
> --
>
> Key: SPARK-38102
> URL: https://issues.apache.org/jira/browse/SPARK-38102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Keunhyun Oh
>Priority: Major
>
> There is no way to apply spark-hadoop-cloud's commitProtocolClass when using 
> saveAsNewAPIHadoopDataset that are to avoid object storage's problem.
> [https://spark.apache.org/docs/latest/cloud-integration.html]
>  
> It is needed to support custom commitProtocolClass class when using 
> saveAsNewAPIHadoopDataset by an option. For example,
> {code:java}
> spark.hadoop.mapreduce.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38127) Fix bug of EnumTypeSetBenchmark and update benchmark result

2022-02-07 Thread Yang Jie (Jira)
Yang Jie created SPARK-38127:


 Summary: Fix bug of EnumTypeSetBenchmark and update benchmark 
result
 Key: SPARK-38127
 URL: https://issues.apache.org/jira/browse/SPARK-38127
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


The iteration times of the comparison case in the benchmark is not same



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487956#comment-17487956
 ] 

Apache Spark commented on SPARK-38124:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35419

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38124:


Assignee: (was: Apache Spark)

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38124:


Assignee: Apache Spark

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38127) Fix bug of EnumTypeSetBenchmark and update benchmark result

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38127:


Assignee: Apache Spark

> Fix bug of EnumTypeSetBenchmark and update benchmark result
> ---
>
> Key: SPARK-38127
> URL: https://issues.apache.org/jira/browse/SPARK-38127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> The iteration times of the comparison case in the benchmark is not same



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38127) Fix bug of EnumTypeSetBenchmark and update benchmark result

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487958#comment-17487958
 ] 

Apache Spark commented on SPARK-38127:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35418

> Fix bug of EnumTypeSetBenchmark and update benchmark result
> ---
>
> Key: SPARK-38127
> URL: https://issues.apache.org/jira/browse/SPARK-38127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> The iteration times of the comparison case in the benchmark is not same



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38127) Fix bug of EnumTypeSetBenchmark and update benchmark result

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38127:


Assignee: (was: Apache Spark)

> Fix bug of EnumTypeSetBenchmark and update benchmark result
> ---
>
> Key: SPARK-38127
> URL: https://issues.apache.org/jira/browse/SPARK-38127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> The iteration times of the comparison case in the benchmark is not same



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37413:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487979#comment-17487979
 ] 

Apache Spark commented on SPARK-37413:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35420

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37413:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487983#comment-17487983
 ] 

Apache Spark commented on SPARK-37413:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35420

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38046) Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing

2022-02-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38046.
--
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35343
[https://github.com/apache/spark/pull/35343]

> Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing
> --
>
> Key: SPARK-38046
> URL: https://issues.apache.org/jira/browse/SPARK-38046
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> There is a test call "compositeReadLimit"
>  
> [https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L460]
>  
> that is flaky.  The problem is because the Kakfa connector is always getting 
> the actual system time and not advancing it manually, thus leaving room for 
> non-deterministic behaviors especially since the source determines if 
> "maxTriggerDelayMs" is satisfied by comparing the last trigger time with the 
> current system time.  One can simply "sleep" at points in the test to 
> generate different outcomes.
>  
> Example output when test fails:
>  
> {code:java}
> - compositeReadLimit *** FAILED *** (7 seconds, 862 milliseconds)
>   == Results ==
>   !== Correct Answer - 0 ==   == Spark Answer - 14 ==
>   !struct<>                   struct
>   !                           [112]
>   !                           [113]
>   !                           [114]
>   !                           [115]
>   !                           [116]
>   !                           [117]
>   !                           [118]
>   !                           [119]
>   !                           [120]
>   !                           [16]
>   !                           [17]
>   !                           [18]
>   !                           [19]
>   !                           [20]
>       
>   
>   == Progress ==
>      
> StartStream(ProcessingTimeTrigger(100),org.apache.spark.sql.streaming.util.StreamManualClock@30075210,Map(),null)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[12],[13],[14],[15]
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>   => CheckNewAnswer: 
>      Assert(, )
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[12],[117],[118],[119],[120],[121],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24]
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckNewAnswer: 
>      Assert(, )
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[117],[118],[119],[12],[120],[121],[122],[123],[124],[125],[126],[127],[128],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30]
>   
>   == Stream ==
>   Output Mode: Append
>   Stream state: {KafkaSourceV1[Subscribe[topic-41]]: 
> {"topic-41":{"2":1,"1":11,"0":21}}}
>   Thread state: alive
>   Thread stack trace: java.lang.Object.wait(Native Method)
>   org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:67)
>   
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
>   
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:76)
>   
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:222)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:350)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution$$Lambda$3081/1859014229.apply$mcV$sp(Unknown
>  Source)
>   scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:857)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:252)
>   
>   
>   == Sink ==
>   0: [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] 
> [1] [10] [11] [12] [13] [14] [15]
>   1: [112] [113] [114] [115] [116] [117] [118] [119] [120] [16] [17] [18] 
> [19] [20]
>   
>   
>   == Plan ==
>   == Parsed Logical Plan ==

[jira] [Assigned] (SPARK-38046) Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing

2022-02-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38046:


Assignee: Boyang Jerry Peng

> Fix KafkaSource/KafkaMicroBatch flaky test due to non-deterministic timing
> --
>
> Key: SPARK-38046
> URL: https://issues.apache.org/jira/browse/SPARK-38046
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>Priority: Major
>
> There is a test call "compositeReadLimit"
>  
> [https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L460]
>  
> that is flaky.  The problem is because the Kakfa connector is always getting 
> the actual system time and not advancing it manually, thus leaving room for 
> non-deterministic behaviors especially since the source determines if 
> "maxTriggerDelayMs" is satisfied by comparing the last trigger time with the 
> current system time.  One can simply "sleep" at points in the test to 
> generate different outcomes.
>  
> Example output when test fails:
>  
> {code:java}
> - compositeReadLimit *** FAILED *** (7 seconds, 862 milliseconds)
>   == Results ==
>   !== Correct Answer - 0 ==   == Spark Answer - 14 ==
>   !struct<>                   struct
>   !                           [112]
>   !                           [113]
>   !                           [114]
>   !                           [115]
>   !                           [116]
>   !                           [117]
>   !                           [118]
>   !                           [119]
>   !                           [120]
>   !                           [16]
>   !                           [17]
>   !                           [18]
>   !                           [19]
>   !                           [20]
>       
>   
>   == Progress ==
>      
> StartStream(ProcessingTimeTrigger(100),org.apache.spark.sql.streaming.util.StreamManualClock@30075210,Map(),null)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[12],[13],[14],[15]
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>   => CheckNewAnswer: 
>      Assert(, )
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[12],[117],[118],[119],[120],[121],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24]
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckNewAnswer: 
>      Assert(, )
>      AdvanceManualClock(100)
>      AssertOnQuery(, )
>      CheckAnswer: 
> [1],[10],[100],[101],[102],[103],[104],[105],[106],[107],[11],[108],[109],[110],[111],[112],[113],[114],[115],[116],[117],[118],[119],[12],[120],[121],[122],[123],[124],[125],[126],[127],[128],[13],[14],[15],[16],[17],[18],[19],[2],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30]
>   
>   == Stream ==
>   Output Mode: Append
>   Stream state: {KafkaSourceV1[Subscribe[topic-41]]: 
> {"topic-41":{"2":1,"1":11,"0":21}}}
>   Thread state: alive
>   Thread stack trace: java.lang.Object.wait(Native Method)
>   org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:67)
>   
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
>   
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:76)
>   
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:222)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:350)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution$$Lambda$3081/1859014229.apply$mcV$sp(Unknown
>  Source)
>   scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:857)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
>   
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:252)
>   
>   
>   == Sink ==
>   0: [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] 
> [1] [10] [11] [12] [13] [14] [15]
>   1: [112] [113] [114] [115] [116] [117] [118] [119] [120] [16] [17] [18] 
> [19] [20]
>   
>   
>   == Plan ==
>   == Parsed Logical Plan ==
>   WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@48d73c40
>   +- SerializeFromObject [input[0, int, false] AS value

[jira] [Commented] (SPARK-36061) Add Volcano feature step

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487996#comment-17487996
 ] 

Apache Spark commented on SPARK-36061:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35422

> Add Volcano feature step
> 
>
> Key: SPARK-36061
> URL: https://issues.apache.org/jira/browse/SPARK-36061
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> Create a PodGroup with user specified minimum resources required



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38128:


 Summary: Show full stacktrace in tests by default in PySpark tests
 Key: SPARK-38128
 URL: https://issues.apache.org/jira/browse/SPARK-38128
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
side traceback by default but that makes a bit harder to debug the test 
failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38128:


Assignee: (was: Apache Spark)

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38128:


Assignee: Apache Spark

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488017#comment-17488017
 ] 

Apache Spark commented on SPARK-38128:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35423

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488018#comment-17488018
 ] 

Apache Spark commented on SPARK-38128:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35423

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38116:


Assignee: Apache Spark

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38116:


Assignee: (was: Apache Spark)

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38129) Adaptive enable timeout for BroadcastQueryStageExec

2022-02-07 Thread weixiuli (Jira)
weixiuli created SPARK-38129:


 Summary: Adaptive enable timeout for BroadcastQueryStageExec
 Key: SPARK-38129
 URL: https://issues.apache.org/jira/browse/SPARK-38129
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0
Reporter: weixiuli


We should disable timeout for BroadcastQueryStageExec when it comes from 
shuffle query stages which runtime statistics are usually correct in AQE, but 
should enable timeout for it when it comes from others which statistics may be 
incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488036#comment-17488036
 ] 

Apache Spark commented on SPARK-38116:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35424

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-38129:
-
Summary: Adaptively enable timeout for BroadcastQueryStageExec  (was: 
Adaptive enable timeout for BroadcastQueryStageExec)

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37411) Inline type hints for python/pyspark/ml/regression.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488042#comment-17488042
 ] 

Maciej Szymkiewicz commented on SPARK-37411:


I'll handle this one.

> Inline type hints for python/pyspark/ml/regression.py
> -
>
> Key: SPARK-37411
> URL: https://issues.apache.org/jira/browse/SPARK-37411
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/regression.pyi to 
> python/pyspark/ml/regression.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488047#comment-17488047
 ] 

Leona Yoda commented on SPARK-38116:


I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136)
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

 

Then, I think, in any case for reading operation, auto commit should be 
disabled. Removing the if condition on {{beforeFetch}} is considerable.

 

 

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488047#comment-17488047
 ] 

Leona Yoda edited comment on SPARK-38116 at 2/7/22, 11:25 AM:
--

I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
[https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136])
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

 

Then, I think, in any case for reading operation, auto commit should be 
disabled. Removing the if condition on {{beforeFetch}} is considerable.


was (Author: yoda-mon):
I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136)
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

 

Then, I think, in any case for reading operation, auto commit should be 
disabled. Removing the if condition on {{beforeFetch}} is considerable.

 

 

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38129:


Assignee: Apache Spark

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38129:


Assignee: (was: Apache Spark)

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488049#comment-17488049
 ] 

Apache Spark commented on SPARK-38129:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/35425

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38128:


Assignee: Hyukjin Kwon

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38128) Show full stacktrace in tests by default in PySpark tests

2022-02-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38128.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35423
[https://github.com/apache/spark/pull/35423]

> Show full stacktrace in tests by default in PySpark tests
> -
>
> Key: SPARK-38128
> URL: https://issues.apache.org/jira/browse/SPARK-38128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-33407 and SPARK-31849 hide Java stacktrace and internal Python worker 
> side traceback by default but that makes a bit harder to debug the test 
> failures. We should probably show the full stacktrace by default in tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-07 Thread Steven Aerts (Jira)
Steven Aerts created SPARK-38130:


 Summary: array_sort does not allow non-orderable datatypes
 Key: SPARK-38130
 URL: https://issues.apache.org/jira/browse/SPARK-38130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
 Environment:  
Reporter: Steven Aerts


 {{array_sort}} has check to see if the entries it has to sort are orderable.

I think this check should be removed.  Because even entries which are not 
orderable can have a lambda function which makes them orderable.


{code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))")org.apache.spark.sql.AnalysisException: cannot resolve 
> 'array_sort(`a`, lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-07 Thread Steven Aerts (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Aerts updated SPARK-38130:
-
Description: 
 {{array_sort}} has check to see if the entries it has to sort are orderable.

I think this check should be removed.  Because even entries which are not 
orderable can have a lambda function which makes them orderable.
{code:java}
Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
"b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - cardinality(y))"){code}
fails with:
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
lambdafunction((cardinality(namedlambdavariable()) - 
cardinality(namedlambdavariable())), namedlambdavariable(), 
namedlambdavariable()))' due to data type mismatch: array_sort does not support 
sorting array of type map which is not orderable {code}
While the case where this check is relevant, fails with a different error which 
is triggered earlier in the code path:
{code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a)"){code}
Fails with:
{code:java}
org.apache.spark.sql.AnalysisException: cannot resolve '(namedlambdavariable() 
< namedlambdavariable())' due to data type mismatch: LessThan does not support 
ordering on type map; line 1 pos 0;
{code}

  was:
 {{array_sort}} has check to see if the entries it has to sort are orderable.

I think this check should be removed.  Because even entries which are not 
orderable can have a lambda function which makes them orderable.


{code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))")org.apache.spark.sql.AnalysisException: cannot resolve 
> 'array_sort(`a`, lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}


> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Priority: Major
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488100#comment-17488100
 ] 

Apache Spark commented on SPARK-38130:
--

User 'steven-aerts' has created a pull request for this issue:
https://github.com/apache/spark/pull/35426

> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Priority: Major
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38130:


Assignee: (was: Apache Spark)

> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Priority: Major
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38130:


Assignee: Apache Spark

> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Assignee: Apache Spark
>Priority: Major
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-38129:
-
Fix Version/s: 3.2.1
   3.2.0

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.2.0, 3.2.1
>
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38129) Adaptively enable timeout for BroadcastQueryStageExec

2022-02-07 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-38129:
-
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> Adaptively enable timeout for BroadcastQueryStageExec
> -
>
> Key: SPARK-38129
> URL: https://issues.apache.org/jira/browse/SPARK-38129
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.2.0, 3.2.1
>
>
> We should disable timeout for BroadcastQueryStageExec when it comes from 
> shuffle query stages which runtime statistics are usually correct in AQE, but 
> should enable timeout for it when it comes from others which statistics may 
> be incorrect, and keep it the same as non-AQE.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37401) Inline type hints for python/pyspark/ml/clustering.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488243#comment-17488243
 ] 

Maciej Szymkiewicz commented on SPARK-37401:


I'll handle this one.

> Inline type hints for python/pyspark/ml/clustering.py
> -
>
> Key: SPARK-37401
> URL: https://issues.apache.org/jira/browse/SPARK-37401
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/clustering.pyi to 
> python/pyspark/ml/clustering.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488245#comment-17488245
 ] 

Maciej Szymkiewicz commented on SPARK-37410:


I'll handle this one.

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38131) Keep only user-facing error classes

2022-02-07 Thread Max Gekk (Jira)
Max Gekk created SPARK-38131:


 Summary: Keep only user-facing error classes
 Key: SPARK-38131
 URL: https://issues.apache.org/jira/browse/SPARK-38131
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Revise existing error classes/exceptions. And remove non-user facing error 
classes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35781) Support Spark on Apple Silicon on macOS natively on Java 17

2022-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35781:
-

Assignee: Dongjoon Hyun

> Support Spark on Apple Silicon on macOS natively on Java 17
> ---
>
> Key: SPARK-35781
> URL: https://issues.apache.org/jira/browse/SPARK-35781
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: DB Tsai
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This is an umbrella JIRA tracking the progress of supporting Apple Silicon on 
> macOS natively.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37411) Inline type hints for python/pyspark/ml/regression.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37411:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/regression.py
> -
>
> Key: SPARK-37411
> URL: https://issues.apache.org/jira/browse/SPARK-37411
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/regression.pyi to 
> python/pyspark/ml/regression.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37411) Inline type hints for python/pyspark/ml/regression.py

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488277#comment-17488277
 ] 

Apache Spark commented on SPARK-37411:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35427

> Inline type hints for python/pyspark/ml/regression.py
> -
>
> Key: SPARK-37411
> URL: https://issues.apache.org/jira/browse/SPARK-37411
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/regression.pyi to 
> python/pyspark/ml/regression.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37411) Inline type hints for python/pyspark/ml/regression.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37411:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/regression.py
> -
>
> Key: SPARK-37411
> URL: https://issues.apache.org/jira/browse/SPARK-37411
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/regression.pyi to 
> python/pyspark/ml/regression.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37219) support AS OF syntax

2022-02-07 Thread agen hoki togel online (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488288#comment-17488288
 ] 

agen hoki togel online commented on SPARK-37219:


h2. Bagaimana Dengan Sistem Keposlot?

[Keposlot|https://165.22.216.152/] Menjalankan sistem dengan 1 user id dapat 
bermain semua permainan yang tersedia. Adapun Keposlot menyediakan transaksi 
melalu bank transfer dan ewallet lainnya. Anda dapat mencari situs keposlot 
melalui google dengan cari ketik pencarian keposlot maka akan muncul dan 
terhubung dengan web situs resmi Keposlot.
h2. Slot Online Terpercaya Di Keposlot?

Keposlot adalah situs resmi [slot online terpercaya|https://165.22.216.152/] di 
indonesia. Keposlot memiliki permainan slot terbanyak yaitu [Pragmatic 
Play|https://165.22.216.152/] dan slot online lainnya yang selalu menciptakan 
permainan baru. Keposlot dipercaya para player karena sistem deposit yang 
sangat cepat diproses, Adapun Withdraw berapa pun akan diproses dengan cepat. 
Maka dari itu situs ini dipercaya dan tidak perlu dicemas kan dalam masalah 
withdraw. Segala keamanan data akan di simpan sedemikian kita menjaga rahasia 
para pemain.

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37219) support AS OF syntax

2022-02-07 Thread agen hoki togel online (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488287#comment-17488287
 ] 

agen hoki togel online commented on SPARK-37219:


Situs Resmi Judi Togel Online Terlengkap [Acctoto|https://katie-cassidy.us/] 
adalah situs togel online terlengkap di indonesia. Situs acctoto merupakan 
situs [Tebak Angka|https://tsagaandarium.org/] terlengkap dalam bermain judi 
[togel online|https://katie-cassidy.us/]. Acctoto adalah bandar togel online 
terpercaya yang menyediakan permainan tebak angka atau togel online yang 
dikeluarkan togel online Singapure, togel online Cambodia, togel online Sydney 
dan togel online Hongkongpools. Sistem Referal Untuk Member Sistem Referral 
Acctoto juga memberikan untuk anda yang menginginkan pendapatan tambahan setiap 
harinya. Daftar dan bergabung sekarang juga di Acctoto bandar togel online 
terpercaya di Indonesia. Maka anda bisa merecomendasikan website kita yang 
untuk mendapatkan tambahan nilai saldo anda dan bisa di withdrawkan kapan waktu 
yang anda inginkan. Situs Acctoto Adalah [Agen Hoki Togel 
Online|https://katie-cassidy.us/] karena tidak ada [bandar Togel 
Indonesia|https://katie-cassidy.us/] yang memberikan kenyaman anda dalam 
bermainan di Acctoto. Untuk hadiah yang diberikan kepada member sangatlah 
berkualitas dan memuaskan bagi para member sejati kita.

> support AS OF syntax
> 
>
> Key: SPARK-37219
> URL: https://issues.apache.org/jira/browse/SPARK-37219
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> https://docs.databricks.com/delta/quick-start.html#query-an-earlier-version-of-the-table-time-travel
> Delta Lake time travel allows user to query an older snapshot of a Delta 
> table. To query an older version of a table, user needs to specify a version 
> or timestamp in a SELECT statement using AS OF syntax as the follows
> SELECT * FROM default.people10m VERSION AS OF 0;
> SELECT * FROM default.people10m TIMESTAMP AS OF '2019-01-29 00:37:58';
> This ticket is opened to add AS OF syntax in Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488393#comment-17488393
 ] 

Apache Spark commented on SPARK-36665:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/35428

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38132:
-

 Summary: Remove NotPropagation
 Key: SPARK-38132
 URL: https://issues.apache.org/jira/browse/SPARK-38132
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
optimization for now until we find a better approach.

{{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
so that it does not properly rewrite the predicate to a NULL-aware left anti 
join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488415#comment-17488415
 ] 

Apache Spark commented on SPARK-38132:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/35428

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38132:


Assignee: Apache Spark

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Major
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38132:


Assignee: (was: Apache Spark)

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488414#comment-17488414
 ] 

Apache Spark commented on SPARK-38132:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/35428

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-38133:
-

 Summary: Grouping by timestamp_ntz will sometimes corrupt the 
results
 Key: SPARK-38133
 URL: https://issues.apache.org/jira/browse/SPARK-38133
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Bruce Robbins


Assume this data:
{noformat}
create or replace temp view v1 as
select * from values
  (1, timestamp_ntz'2012-01-01 00:00:00', 1),
  (2, timestamp_ntz'2012-01-01 00:00:00', 2),
  (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
  (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
  (2, timestamp_ntz'2013-01-01 00:00:00', 3)
  as data(a, b, c);
{noformat}
Run the following query:
{noformat}
select *
from v1
pivot (
  sum(c)
  for a in (1, 2)
);
{noformat}
You get incorrect results for the group-by column:
{noformat}
2012-01-01 19:05:19.476736  15000   2
2013-01-01 19:05:19.476736  48000   3
Time taken: 2.65 seconds, Fetched 2 row(s)
{noformat}
Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
timestamp_ntz column, you get incorrect results:
{noformat}
set spark.sql.codegen.wholeStage=false;
select a, b, sum(c) from v1 group by a, b;
{noformat}
This query produces
{noformat}
2   2012-01-01 09:32:39.738368  2
1   2013-01-01 09:32:39.738368  48000
2   2013-01-01 09:32:39.738368  3
Time taken: 1.927 seconds, Fetched 4 row(s)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488441#comment-17488441
 ] 

Bruce Robbins commented on SPARK-38133:
---

I think I have a handle on what is causing this, and will make a PR shortly.

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-07 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488445#comment-17488445
 ] 

kk commented on SPARK-38115:


Thanks [~hyukjin.kwon] for responding.

Basically I am trying to write data to s3 from spark dataframe. And this will 
use FileOutputCommitter by spark.

[https://stackoverflow.com/questions/46665299/spark-avoid-creating-temporary-directory-in-s3]

Now my requirement is to either change the '{*}_temporary{*}' path to write to 
different s3 bucket and copy to original s3 by setting any spark conf or 
parameter part of write step.

or 

stop creating *_temporary* when writing to s3. 

As we have version enabled bucket the _temporary is being stored in the version 
even though it is not physically present.

Below is the write step:

df.coalesce(1).write.format('parquet').mode('overwrite').save('{*}s3a{*}://outpath')

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488487#comment-17488487
 ] 

Apache Spark commented on SPARK-37410:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35429

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37410:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37410:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38124:
-
Priority: Blocker  (was: Major)

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38124:
-
Labels: correctness  (was: )

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to all stateful operators

2022-02-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38124:
-
Issue Type: Bug  (was: Improvement)

> Revive HashClusteredDistribution and apply to all stateful operators
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means other stateful 
> operators already used ClusteredDistribution, hence have been broken for a 
> long time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38132:
-

Assignee: Kazuyuki Tanimura

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38132.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35428
[https://github.com/apache/spark/pull/35428]

> Remove NotPropagation
> -
>
> Key: SPARK-38132
> URL: https://issues.apache.org/jira/browse/SPARK-38132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
> optimization for now until we find a better approach.
> {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
> so that it does not properly rewrite the predicate to a NULL-aware left anti 
> join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488504#comment-17488504
 ] 

Apache Spark commented on SPARK-38133:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/35430

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38133:


Assignee: (was: Apache Spark)

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38133:


Assignee: Apache Spark

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-07 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488506#comment-17488506
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

[~aokolnychyi] issue resolved.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488510#comment-17488510
 ] 

Dongjoon Hyun commented on SPARK-38133:
---

Does this happen on master branch only, [~bersprockets]?

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488514#comment-17488514
 ] 

Bruce Robbins commented on SPARK-38133:
---

[~dongjoon] 

>Does this happen on master branch only

As far as I know, TIMESTAMP_NTZ exists only on master branch.

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-07 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488520#comment-17488520
 ] 

Anton Okolnychyi commented on SPARK-36665:
--

Thank you [~kazuyukitanimura] [~viirya]!

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37409) Inline type hints for python/pyspark/ml/pipeline.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37409.

Fix Version/s: 3.3.0
 Assignee: Maciej Szymkiewicz
   Resolution: Fixed

Issue resolved by pull request 35408
https://github.com/apache/spark/pull/35408

> Inline type hints for python/pyspark/ml/pipeline.py
> ---
>
> Key: SPARK-37409
> URL: https://issues.apache.org/jira/browse/SPARK-37409
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/pipeline.pyi to 
> python/pyspark/ml/pipeline.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36969) Inline type hints for SparkContext

2022-02-07 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488529#comment-17488529
 ] 

Haejoon Lee commented on SPARK-36969:
-

[~dchvn] Just noticed the previous PR is closed without merging. Are you still 
working on this ??

> Inline type hints for SparkContext
> --
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37412) Inline type hints for python/pyspark/ml/stat.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37412:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/stat.py
> ---
>
> Key: SPARK-37412
> URL: https://issues.apache.org/jira/browse/SPARK-37412
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/stat.pyi to 
> python/pyspark/ml/stat.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37412) Inline type hints for python/pyspark/ml/stat.py

2022-02-07 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37412.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35401
[https://github.com/apache/spark/pull/35401]

> Inline type hints for python/pyspark/ml/stat.py
> ---
>
> Key: SPARK-37412
> URL: https://issues.apache.org/jira/browse/SPARK-37412
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/stat.pyi to 
> python/pyspark/ml/stat.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37993) Avoid multiple calls to configuration parameter values

2022-02-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37993:
-
Priority: Trivial  (was: Major)

This is not "Major"

> Avoid multiple calls to configuration parameter values
> --
>
> Key: SPARK-37993
> URL: https://issues.apache.org/jira/browse/SPARK-37993
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0, 3.0.2, 3.0.3, 3.1.0, 3.1.2, 3.2.0
>Reporter: weixiuli
>Priority: Trivial
>
> Avoid multiple calls to configuration parameter values



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488047#comment-17488047
 ] 

Leona Yoda edited comment on SPARK-38116 at 2/8/22, 2:29 AM:
-

I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
[https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136])
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

Is there any other cases that autocommit true and fetchSize > 0 will work well 
? 


was (Author: yoda-mon):
I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
[https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136])
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

 

Then, I think, in any case for reading operation, auto commit should be 
disabled. Removing the if condition on {{beforeFetch}} is considerable.

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38116) Ability to turn off auto commit in JDBC source for read only operations

2022-02-07 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488047#comment-17488047
 ] 

Leona Yoda edited comment on SPARK-38116 at 2/8/22, 2:30 AM:
-

I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
[https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136])
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

Is there any other cases that autocommit true and fetchSize > 0 will work well 
?  [~Koraseg] 


was (Author: yoda-mon):
I post sample PR, however, the intended behavior might be realized already if 
user set fetchSize > 0.

(cf.  
[https://github.com/apache/spark/blob/v3.2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L136])
 # If user set fetchsize > 0, autocommit will be set to false when reading 
operation in the original code.
 # fetchSize is set to 0 by default, so if users disable autocommit the jdbc 
driver will try to get all rows.

By the PR users will be able to choose the case autocommit true and fetchSize > 
0 ... but the document says it won't work.

Is there any other cases that autocommit true and fetchSize > 0 will work well 
? 

> Ability to turn off auto commit in JDBC source for read only operations
> ---
>
> Key: SPARK-38116
> URL: https://issues.apache.org/jira/browse/SPARK-38116
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Artem Kupchinskiy
>Priority: Minor
>
> Currently, all the jdbc connections on executors side work always with auto 
> commit option set to true.
> However, there are cases where this mode makes hard to use 
> JdbcRelationProvider at all, i.e. reading huge datasets from Postgres (a 
> whole result set is collected regardless of a fetch size when autocommit is 
> set to true 
> https://jdbc.postgresql.org/documentation/91/query.html#query-with-cursor )
> So the proposal is following:
>  # Add a boolean option "autocommit" to JDBC Source allowing a user to turn 
> off autocommit mode for read only operations.
>  # Add guards which prevent using this option in DML operations.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-07 Thread LeeeeLiu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LLiu updated SPARK-38033:
-
Affects Version/s: 3.0.3

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38133:
-

Assignee: Bruce Robbins

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

2022-02-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38133.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35430
[https://github.com/apache/spark/pull/35430]

> Grouping by timestamp_ntz will sometimes corrupt the results
> 
>
> Key: SPARK-38133
> URL: https://issues.apache.org/jira/browse/SPARK-38133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.0
>
>
> Assume this data:
> {noformat}
> create or replace temp view v1 as
> select * from values
>   (1, timestamp_ntz'2012-01-01 00:00:00', 1),
>   (2, timestamp_ntz'2012-01-01 00:00:00', 2),
>   (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
>   (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
>   (2, timestamp_ntz'2013-01-01 00:00:00', 3)
>   as data(a, b, c);
> {noformat}
> Run the following query:
> {noformat}
> select *
> from v1
> pivot (
>   sum(c)
>   for a in (1, 2)
> );
> {noformat}
> You get incorrect results for the group-by column:
> {noformat}
> 2012-01-01 19:05:19.47673615000   2
> 2013-01-01 19:05:19.47673648000   3
> Time taken: 2.65 seconds, Fetched 2 row(s)
> {noformat}
> Actually, _whenever_ the TungstenAggregationIterator is used to group by a 
> timestamp_ntz column, you get incorrect results:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select a, b, sum(c) from v1 group by a, b;
> {noformat}
> This query produces
> {noformat}
> 2 2012-01-01 09:32:39.738368  2
> 1 2013-01-01 09:32:39.738368  48000
> 2 2013-01-01 09:32:39.738368  3
> Time taken: 1.927 seconds, Fetched 4 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38097) Improve the error for pivoting of unsupported value types

2022-02-07 Thread Yuto Akutsu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488576#comment-17488576
 ] 

Yuto Akutsu commented on SPARK-38097:
-

[~maxgekk] I'll work on this.

> Improve the error for pivoting of unsupported value types
> -
>
> Key: SPARK-38097
> URL: https://issues.apache.org/jira/browse/SPARK-38097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message from:
> {code:scala}
>   test("Improve the error for pivoting of unsupported value types") {
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"))
>   .agg(sum($"sales.earnings"))
>   .show(false)
>   }
> {code}
> can confuse users:
> {code:java}
> The feature is not supported: literal for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> {code}
> Need to improve the error message and make it more precise.
> See https://github.com/apache/spark/pull/35302#discussion_r793629370



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36969) Inline type hints for SparkContext

2022-02-07 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488577#comment-17488577
 ] 

dch nguyen commented on SPARK-36969:


[~itholic] this was resolved by 
https://issues.apache.org/jira/browse/SPARK-37152. Thanks!

> Inline type hints for SparkContext
> --
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36444) Remove OptimizeSubqueries from batch of PartitionPruning

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488582#comment-17488582
 ] 

Apache Spark commented on SPARK-36444:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/35431

> Remove OptimizeSubqueries from batch of PartitionPruning
> 
>
> Key: SPARK-36444
> URL: https://issues.apache.org/jira/browse/SPARK-36444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> To support this case:
> {code:scala}
> sql(
> """
>   |SELECT date_id, product_id FROM fact_sk f
>   |JOIN (select store_id + 3 as new_store_id from dim_store where 
> country = 'US') s
>   |ON f.store_id = s.new_store_id
> """.stripMargin)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36969) Inline type hints for SparkContext

2022-02-07 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488594#comment-17488594
 ] 

Haejoon Lee commented on SPARK-36969:
-

Great!! Thanks, [~dchvn] :)

> Inline type hints for SparkContext
> --
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38134) Upgrade Arrow to 7.0.0

2022-02-07 Thread Yang Jie (Jira)
Yang Jie created SPARK-38134:


 Summary: Upgrade Arrow to 7.0.0
 Key: SPARK-38134
 URL: https://issues.apache.org/jira/browse/SPARK-38134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38134) Upgrade Arrow to 7.0.0

2022-02-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-38134:
-
Description: https://arrow.apache.org/release/7.0.0.html

> Upgrade Arrow to 7.0.0
> --
>
> Key: SPARK-38134
> URL: https://issues.apache.org/jira/browse/SPARK-38134
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://arrow.apache.org/release/7.0.0.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37585:


Assignee: Apache Spark

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: Apache Spark
>Priority: Major
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37585:


Assignee: (was: Apache Spark)

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488596#comment-17488596
 ] 

Apache Spark commented on SPARK-37585:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35432

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37585) DSV2 InputMetrics are not getting update in corner case

2022-02-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488597#comment-17488597
 ] 

Apache Spark commented on SPARK-37585:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35432

> DSV2 InputMetrics are not getting update in corner case
> ---
>
> Key: SPARK-37585
> URL: https://issues.apache.org/jira/browse/SPARK-37585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> In some corner cases, DSV2 is not updating the input metrics.
>  
> This is very special case where the number of records read are less than 1000 
> and *hasNext* is not called for last element(cz input.hasNext returns false 
> so MetricsIterator.hasNext is not called)
>  
> hasNext implementation of MetricsIterator
>  
> {code:java}
> override def hasNext: Boolean = {
>   if (iter.hasNext) {
> true
>   } else {
> metricsHandler.updateMetrics(0, force = true)
> false
>   } {code}
>  
> You reproduce this issue easily in spark-shell by running below code
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler.{SparkListener, 
> SparkListenerTaskEnd}spark.conf.set("spark.sql.sources.useV1SourceList", "")
> val dir = "Users/tmp1"
> spark.range(0, 100).write.format("parquet").mode("overwrite").save(dir)
> val df = spark.read.format("parquet").load(dir)
> val bytesReads = new mutable.ArrayBuffer[Long]()
> val recordsRead = new mutable.ArrayBuffer[Long]()val bytesReadListener = new 
> SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
>     recordsRead += taskEnd.taskMetrics.inputMetrics.recordsRead
>   }
> }
> spark.sparkContext.addSparkListener(bytesReadListener)
> try {
> df.limit(10).collect()
> assert(recordsRead.sum > 0)
> assert(bytesReads.sum > 0)
> } finally {
> spark.sparkContext.removeSparkListener(bytesReadListener)
> } {code}
> This code generally fails at *assert(bytesReads.sum > 0)* which confirms that 
> updateMetrics API is not called
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38134) Upgrade Arrow to 7.0.0

2022-02-07 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488601#comment-17488601
 ] 

Yang Jie commented on SPARK-38134:
--

Waiting can be download from the maven central repository

 

> Upgrade Arrow to 7.0.0
> --
>
> Key: SPARK-38134
> URL: https://issues.apache.org/jira/browse/SPARK-38134
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://arrow.apache.org/release/7.0.0.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38030:
---

Assignee: Shardul Mahadik

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
>   at org.apac

[jira] [Resolved] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-02-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38030.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35332
[https://github.com/apache/spark/pull/35332]

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.3.0
>
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.execut

[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2022-02-07 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488610#comment-17488610
 ] 

Wenchen Fan commented on SPARK-35531:
-

[~angerszhuuu] can you help to backport it?

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30661) KMeans blockify input vectors

2022-02-07 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488611#comment-17488611
 ] 

zhengruifeng commented on SPARK-30661:
--

since the input datasets of kmeans are likely dense, so I tend to add a dense 
impl as a alternative.

 

I think we can do it in this way:

step1: move existing impl to the .ml side. I think we should keep existing impl 
to avoid possible regression cases;

step2: make the .mllib.kmeans call the .ml.kmeans internally. We also need to 
support initialization with existing model in the .ml side, since .mllib.kmeans 
supports this function;

step3: add the new dense impl. Make .ml.kmeans extends HasSolver, it will 
supports three options: row-based (default), block-based, auto. If end user set 
it auto, then the impl will check the sparsity and choose the underlying impl.

step4: sync the change to the python side.

 

cc [~srowen] 

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_kmeans.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38097) Improve the error for pivoting of unsupported value types

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38097:


Assignee: Apache Spark

> Improve the error for pivoting of unsupported value types
> -
>
> Key: SPARK-38097
> URL: https://issues.apache.org/jira/browse/SPARK-38097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The error message from:
> {code:scala}
>   test("Improve the error for pivoting of unsupported value types") {
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"))
>   .agg(sum($"sales.earnings"))
>   .show(false)
>   }
> {code}
> can confuse users:
> {code:java}
> The feature is not supported: literal for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> {code}
> Need to improve the error message and make it more precise.
> See https://github.com/apache/spark/pull/35302#discussion_r793629370



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38097) Improve the error for pivoting of unsupported value types

2022-02-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38097:


Assignee: (was: Apache Spark)

> Improve the error for pivoting of unsupported value types
> -
>
> Key: SPARK-38097
> URL: https://issues.apache.org/jira/browse/SPARK-38097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message from:
> {code:scala}
>   test("Improve the error for pivoting of unsupported value types") {
> trainingSales
>   .groupBy($"sales.year")
>   .pivot(struct(lower($"sales.course"), $"training"))
>   .agg(sum($"sales.earnings"))
>   .show(false)
>   }
> {code}
> can confuse users:
> {code:java}
> The feature is not supported: literal for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for '[dotnet,Dummies]' of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245)
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> {code}
> Need to improve the error message and make it more precise.
> See https://github.com/apache/spark/pull/35302#discussion_r793629370



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >