date:20220427

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529233#comment-17529233
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36390

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529232#comment-17529232
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36390

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-27 Thread Jackie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackie Zhang resolved SPARK-39013.
--
Resolution: Won't Fix

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jackie Zhang
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39052:


Assignee: Hyukjin Kwon

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39052.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36389
[https://github.com/apache/spark/pull/36389]

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39047:


Assignee: Max Gekk

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39047.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36380
[https://github.com/apache/spark/pull/36380]

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-39046.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36379
[https://github.com/apache/spark/pull/36379]

> Return an empty context string if TreeNode.origin is wrongly set
> 
>
> Key: SPARK-39046
> URL: https://issues.apache.org/jira/browse/SPARK-39046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39052:


Assignee: (was: Apache Spark)

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529147#comment-17529147
 ] 

Apache Spark commented on SPARK-39052:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36389

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39052:


Assignee: Apache Spark

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39052:
-
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39052:
-
Priority: Minor  (was: Major)

> Support Char in Literal.create
> --
>
> Key: SPARK-39052
> URL: https://issues.apache.org/jira/browse/SPARK-39052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875
>  added the support of Char at the literal. Liternal.create should work with 
> this too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39052) Support Char in Literal.create

2022-04-27 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-39052:


 Summary: Support Char in Literal.create
 Key: SPARK-39052
 URL: https://issues.apache.org/jira/browse/SPARK-39052
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/commit/54fcaafb094e299f21c18370fddb4a727c88d875 
added the support of Char at the literal. Liternal.create should work with this 
too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39036) Support Alter Table/Partition Concatenate command

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39036:
-
Summary: Support Alter Table/Partition Concatenate command  (was: support 
Alter Table/Partition Concatenate command)

> Support Alter Table/Partition Concatenate command
> -
>
> Key: SPARK-39036
> URL: https://issues.apache.org/jira/browse/SPARK-39036
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: gabrywu
>Priority: Major
>
> Hi, folks, 
> In Hive, we can use following command to merge small files, however, there is 
> not a corresponding command to do that. 
> I believe it's useful and it's not enough only using AQE.  Is anyone working 
> on this to merge small files? If not, I want to create a PR to implement it
>  
> {code:java}
> ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, 
> ...])] CONCATENATE;{code}
>  
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529144#comment-17529144
 ] 

Hyukjin Kwon commented on SPARK-39044:
--

[~rshkv] it would be much easier to debug more if there's an self-contained 
reproducer

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeEx

[jira] [Assigned] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38988:


Assignee: Xinrong Meng

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Xinrong Meng
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38988.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36367
[https://github.com/apache/spark/pull/36367]

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39049:


Assignee: Bjørn Jørgensen

> Remove unneeded pass
> 
>
> Key: SPARK-39049
> URL: https://issues.apache.org/jira/browse/SPARK-39049
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39049.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36383
[https://github.com/apache/spark/pull/36383]

> Remove unneeded pass
> 
>
> Key: SPARK-39049
> URL: https://issues.apache.org/jira/browse/SPARK-39049
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.3.0
>
>
> Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39051:


Assignee: Xinrong Meng

> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
> --
>
> Key: SPARK-39051
> URL: https://issues.apache.org/jira/browse/SPARK-39051
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39051.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36384
[https://github.com/apache/spark/pull/36384]

> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
> --
>
> Key: SPARK-39051
> URL: https://issues.apache.org/jira/browse/SPARK-39051
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39038) Skip reporting test results if triggering workflow was skipped

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39038:


Assignee: Enrico Minack

> Skip reporting test results if triggering workflow was skipped
> --
>
> Key: SPARK-39038
> URL: https://issues.apache.org/jira/browse/SPARK-39038
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>
> The `"Report test results"` workflow is triggered when either `"Build and 
> test"` or `"Build and test (ANSI)"` complete. On fork repositories, workflow 
> `"Build and test (ANSI)"` is always skipped.
> The triggered `"Report test results"` workflow downloads artifacts from the 
> triggering workflow and errors because there are none artifacts.
> Therefore, the `"Report test results"` workflow should be skipped when the 
> triggering workflow completed with conclusion `'skipped'`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39038) Skip reporting test results if triggering workflow was skipped

2022-04-27 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39038.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36371
[https://github.com/apache/spark/pull/36371]

> Skip reporting test results if triggering workflow was skipped
> --
>
> Key: SPARK-39038
> URL: https://issues.apache.org/jira/browse/SPARK-39038
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.4.0
>
>
> The `"Report test results"` workflow is triggered when either `"Build and 
> test"` or `"Build and test (ANSI)"` complete. On fork repositories, workflow 
> `"Build and test (ANSI)"` is always skipped.
> The triggered `"Report test results"` workflow downloads artifacts from the 
> triggering workflow and errors because there are none artifacts.
> Therefore, the `"Report test results"` workflow should be skipped when the 
> triggering workflow completed with conclusion `'skipped'`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529116#comment-17529116
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36388

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529115#comment-17529115
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36388

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529114#comment-17529114
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36387

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529113#comment-17529113
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36386

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529112#comment-17529112
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36387

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38918) Nested column pruning should filter out attributes that do not belong to the current relation

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529111#comment-17529111
 ] 

Apache Spark commented on SPARK-38918:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/36386

> Nested column pruning should filter out attributes that do not belong to the 
> current relation
> -
>
> Key: SPARK-38918
> URL: https://issues.apache.org/jira/browse/SPARK-38918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> `SchemaPruning` currently does not check if the root field of a nested column 
> belongs to the current relation. This can happen when the filter contains 
> correlated subqueries, where the children field can contain attributes from 
> both the inner and the outer query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39051:


Assignee: (was: Apache Spark)

> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
> --
>
> Key: SPARK-39051
> URL: https://issues.apache.org/jira/browse/SPARK-39051
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39051:


Assignee: Apache Spark

> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
> --
>
> Key: SPARK-39051
> URL: https://issues.apache.org/jira/browse/SPARK-39051
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529109#comment-17529109
 ] 

Apache Spark commented on SPARK-39051:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36384

> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`
> --
>
> Key: SPARK-39051
> URL: https://issues.apache.org/jira/browse/SPARK-39051
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39051) Minor refactoring of `python/pyspark/sql/pandas/conversion.py`

2022-04-27 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39051:


 Summary: Minor refactoring of 
`python/pyspark/sql/pandas/conversion.py`
 Key: SPARK-39051
 URL: https://issues.apache.org/jira/browse/SPARK-39051
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Minor refactoring of `python/pyspark/sql/pandas/conversion.py`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-04-27 Thread Jason Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529097#comment-17529097
 ] 

Jason Xu commented on SPARK-38388:
--

Hi [~cloud_fan], could you assign this ticket to me? I have bandwidth to work 
on it in May.

Another possible solution:
Since the root cause is related to non-deterministic data in shuffling, is it 
possible to let driver to keep checksums of all shuffle blocks, if a map task 
re-attempt generates shuffle block with different checksum, Spark can detect 
on-the-fly and rerun all reduce tasks to avoid correctness issue.
I feel this could be a better solution because this is transparent to users, it 
doesn't require users to explicitly mark their data as nondeterminate. There 
are challenges for the other solution: 1. It wouldn't be easy to educate 
regular Spark users about the issue, they might not see the advice or they 
don't understand the importance of marking DeterministicLevel. 2. Even if they 
understand, it's hard for users to always remember to mark nondeterminate of 
their data.

Would do you think?

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.1
>Reporter: Jason Xu
>Priority: Major
>  Labels: correctness, data-loss
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record hash. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39050) Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529093#comment-17529093
 ] 

Apache Spark commented on SPARK-39050:
--

User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/36385

> Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE
> 
>
> Key: SPARK-39050
> URL: https://issues.apache.org/jira/browse/SPARK-39050
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Serge Rielau
>Priority: Major
>
> UNSUPPORTED_OPERATION is very similar to UNSUPPORTED_FEATURE.
> We can just roll them together



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39050) Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39050:


Assignee: Apache Spark

> Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE
> 
>
> Key: SPARK-39050
> URL: https://issues.apache.org/jira/browse/SPARK-39050
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> UNSUPPORTED_OPERATION is very similar to UNSUPPORTED_FEATURE.
> We can just roll them together



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39050) Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39050:


Assignee: (was: Apache Spark)

> Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE
> 
>
> Key: SPARK-39050
> URL: https://issues.apache.org/jira/browse/SPARK-39050
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Serge Rielau
>Priority: Major
>
> UNSUPPORTED_OPERATION is very similar to UNSUPPORTED_FEATURE.
> We can just roll them together



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39050) Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE

2022-04-27 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-39050:


 Summary: Convert UNSUPPORTED_OPERATION to UNSUPPORTED_FEATURE
 Key: SPARK-39050
 URL: https://issues.apache.org/jira/browse/SPARK-39050
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Serge Rielau


UNSUPPORTED_OPERATION is very similar to UNSUPPORTED_FEATURE.
We can just roll them together



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39049:


Assignee: (was: Apache Spark)

> Remove unneeded pass
> 
>
> Key: SPARK-39049
> URL: https://issues.apache.org/jira/browse/SPARK-39049
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528941#comment-17528941
 ] 

Apache Spark commented on SPARK-39049:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/36383

> Remove unneeded pass
> 
>
> Key: SPARK-39049
> URL: https://issues.apache.org/jira/browse/SPARK-39049
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39049:


Assignee: Apache Spark

> Remove unneeded pass
> 
>
> Key: SPARK-39049
> URL: https://issues.apache.org/jira/browse/SPARK-39049
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39049) Remove unneeded pass

2022-04-27 Thread Jira

Bjørn Jørgensen created SPARK-39049:
---

 Summary: Remove unneeded pass
 Key: SPARK-39049
 URL: https://issues.apache.org/jira/browse/SPARK-39049
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


Remove unneeded pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38988:


Assignee: (was: Apache Spark)

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38988:


Assignee: Apache Spark

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528933#comment-17528933
 ] 

Apache Spark commented on SPARK-38988:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36367

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types

2022-04-27 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39048:
-
Summary: Refactor `GroupBy._reduce_for_stat_function` on accepted data 
types   (was: Refactor GroupBy._reduce_for_stat_function on accepted data types 
)

> Refactor `GroupBy._reduce_for_stat_function` on accepted data types 
> 
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528928#comment-17528928
 ] 

Apache Spark commented on SPARK-39048:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36382

> Refactor GroupBy._reduce_for_stat_function on accepted data types 
> --
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39048:


Assignee: Apache Spark

> Refactor GroupBy._reduce_for_stat_function on accepted data types 
> --
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39048:


Assignee: (was: Apache Spark)

> Refactor GroupBy._reduce_for_stat_function on accepted data types 
> --
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528927#comment-17528927
 ] 

Apache Spark commented on SPARK-39048:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36382

> Refactor GroupBy._reduce_for_stat_function on accepted data types 
> --
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39048) Refactor GroupBy._reduce_for_stat_function on accepted data types

2022-04-27 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39048:


 Summary: Refactor GroupBy._reduce_for_stat_function on accepted 
data types 
 Key: SPARK-39048
 URL: https://issues.apache.org/jira/browse/SPARK-39048
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


`Groupby._reduce_for_stat_function` is a common helper function leveraged by 
multiple statistical functions of GroupBy objects.

It defines parameters `only_numeric` and `bool_as_numeric` to control accepted 
Spark types.

To be consistent with pandas API, we may also have to introduce 
`str_as_numeric` for `sum` for example.

Instead of introducing parameters designated for each Spark type, the PR is 
proposed to introduce a parameter `accepted_spark_types` to specify accepted 
types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37942) Use error classes in the compilation errors of properties

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37942:


Assignee: Apache Spark

> Use error classes in the compilation errors of properties
> -
>
> Key: SPARK-37942
> URL: https://issues.apache.org/jira/browse/SPARK-37942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * cannotReadCorruptedTablePropertyError
> * cannotCreateJDBCNamespaceWithPropertyError
> * cannotSetJDBCNamespaceWithPropertyError
> * cannotUnsetJDBCNamespaceWithPropertyError
> * alterTableSerDePropertiesNotSupportedForV2TablesError
> * unsetNonExistentPropertyError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37942) Use error classes in the compilation errors of properties

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528896#comment-17528896
 ] 

Apache Spark commented on SPARK-37942:
--

User 'jerqi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36381

> Use error classes in the compilation errors of properties
> -
>
> Key: SPARK-37942
> URL: https://issues.apache.org/jira/browse/SPARK-37942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * cannotReadCorruptedTablePropertyError
> * cannotCreateJDBCNamespaceWithPropertyError
> * cannotSetJDBCNamespaceWithPropertyError
> * cannotUnsetJDBCNamespaceWithPropertyError
> * alterTableSerDePropertiesNotSupportedForV2TablesError
> * unsetNonExistentPropertyError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37942) Use error classes in the compilation errors of properties

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37942:


Assignee: (was: Apache Spark)

> Use error classes in the compilation errors of properties
> -
>
> Key: SPARK-37942
> URL: https://issues.apache.org/jira/browse/SPARK-37942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * cannotReadCorruptedTablePropertyError
> * cannotCreateJDBCNamespaceWithPropertyError
> * cannotSetJDBCNamespaceWithPropertyError
> * cannotUnsetJDBCNamespaceWithPropertyError
> * alterTableSerDePropertiesNotSupportedForV2TablesError
> * unsetNonExistentPropertyError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1752#comment-1752
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36380

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39047:


Assignee: Apache Spark

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39047:


Assignee: (was: Apache Spark)

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528886#comment-17528886
 ] 

Apache Spark commented on SPARK-39047:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36380

> Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE
> 
>
> Key: SPARK-39047
> URL: https://issues.apache.org/jira/browse/SPARK-39047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
> remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38997) DS V2 aggregate push-down supports group by expressions

2022-04-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38997.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36325
[https://github.com/apache/spark/pull/36325]

> DS V2 aggregate push-down supports group by expressions
> ---
>
> Key: SPARK-38997
> URL: https://issues.apache.org/jira/browse/SPARK-38997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, Spark DS V2 aggregate push-down only supports group by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, SUM("SALARY") FROM "test"."employee"   GROUP BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38997) DS V2 aggregate push-down supports group by expressions

2022-04-27 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38997:
---

Assignee: jiaan.geng

> DS V2 aggregate push-down supports group by expressions
> ---
>
> Key: SPARK-38997
> URL: https://issues.apache.org/jira/browse/SPARK-38997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Currently, Spark DS V2 aggregate push-down only supports group by column.
> But the SQL show below is very useful and common.
> SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 1.00) THEN "SALARY" 
> ELSE 0.00 END AS key, SUM("SALARY") FROM "test"."employee"   GROUP BY key



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39047) Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE

2022-04-27 Thread Max Gekk (Jira)

Max Gekk created SPARK-39047:


 Summary: Replace the error class ILLEGAL_SUBSTRING by 
INVALID_PARAMETER_VALUE
 Key: SPARK-39047
 URL: https://issues.apache.org/jira/browse/SPARK-39047
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Use the INVALID_PARAMETER_VALUE error class instead of ILLEGAL_SUBSTRING and 
remove the last one because it duplicates INVALID_PARAMETER_VALUE.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2022-04-27 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528877#comment-17528877
 ] 

Vinod KC edited comment on SPARK-25177 at 4/27/22 4:29 PM:
---

In case, if anyone looking for a workaround to convert  0 in scientific 
notation to plaintext, this code snippet may help.

 
{code:java}
import org.apache.spark.sql.types.Decimal

val handleBigDecZeroUDF = udf((decimalVal:Decimal) => {
if (decimalVal.scale > 6) {
    decimalVal.toBigDecimal.bigDecimal.toPlainString()
  } else {
    decimalVal.toString()
  }
})   
 
spark.sql("create table testBigDec (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
spark.sql("insert into testBigDec values(0, 0,0)")
spark.sql("insert into testBigDec values(1, 1, 1)")
val df = spark.table("testBigDec")
df.show(false) // this will show scientific notation

// use custom UDF `handleBigDecZeroUDF` to convert zero into plainText notation 
 
df.select(handleBigDecZeroUDF(col("a")).as("a"),col("b"),handleBigDecZeroUDF(col("c")).as("c")).show(false)


// Result of df.show(false) 
+-++--+
|a        |b       |c         |
+-++--+
|0E-7     |0.00|0E-8      |
|1.000|1.00|1.|
+-++--+

// Result using handleBigDecZeroUDF
+-++--+
|a        |b       |c         |
+-++--+
|0.000|0.00|0.|
|1.000|1.00|1.|
+-++--+



{code}
 


was (Author: vinodkc):
In case, if anyone looking for a workaround to convert  0 in scientific 
notation to plaintext, this code snippet may help.

 
{code:java}
import org.apache.spark.sql.types.Decimal

val handleBigDecZeroUDF = udf((decimalVal:Decimal) => {
if (decimalVal.scale > 6) {
    decimalVal.toBigDecimal.bigDecimal.toPlainString()
  } else {
    decimalVal.toString()
  }
})   
 
spark.sql("create table testBigDec (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
spark.sql("insert into testBigDec values(0, 0,0)")
spark.sql("insert into testBigDec values(1, 1, 1)")
val df = spark.table("testBigDec")
df.show(false) // this will show scientific notation

// use custom UDF `handleBigDecZeroUDF` to convert zero into plainText notation 

df.select(handleBigDecZeroUDF(col("a")).as("a"),md5(handleBigDecZeroUDF(col("a"))).as("a-md5"),col("b"),handleBigDecZeroUDF(col("c")).as("c")).show(false)
 {code}

 

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>  Labels: bulk-closed
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation

2022-04-27 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528877#comment-17528877
 ] 

Vinod KC commented on SPARK-25177:
--

In case, if anyone looking for a workaround to convert  0 in scientific 
notation to plaintext, this code snippet may help.

 
{code:java}
import org.apache.spark.sql.types.Decimal

val handleBigDecZeroUDF = udf((decimalVal:Decimal) => {
if (decimalVal.scale > 6) {
    decimalVal.toBigDecimal.bigDecimal.toPlainString()
  } else {
    decimalVal.toString()
  }
})   
 
spark.sql("create table testBigDec (a decimal(10,7), b decimal(10,6), c 
decimal(10,8))")
spark.sql("insert into testBigDec values(0, 0,0)")
spark.sql("insert into testBigDec values(1, 1, 1)")
val df = spark.table("testBigDec")
df.show(false) // this will show scientific notation

// use custom UDF `handleBigDecZeroUDF` to convert zero into plainText notation 

df.select(handleBigDecZeroUDF(col("a")).as("a"),md5(handleBigDecZeroUDF(col("a"))).as("a-md5"),col("b"),handleBigDecZeroUDF(col("c")).as("c")).show(false)
 {code}

 

> When dataframe decimal type column having scale higher than 6, 0 values are 
> shown in scientific notation
> 
>
> Key: SPARK-25177
> URL: https://issues.apache.org/jira/browse/SPARK-25177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>  Labels: bulk-closed
>
> If scale of decimal type is > 6 , 0 value will be shown in scientific 
> notation and hence, when the dataframe output is saved to external database, 
> it fails due to scientific notation on "0" values.
> Eg: In Spark
>  --
>  spark.sql("create table test (a decimal(10,7), b decimal(10,6), c 
> decimal(10,8))")
>  spark.sql("insert into test values(0, 0,0)")
>  spark.sql("insert into test values(1, 1, 1)")
>  spark.table("test").show()
> |         a     |           b |               c |
> |       0E-7 |0.00|         0E-8 |//If scale > 6, zero is displayed in 
> scientific notation|
> |1.000|1.00|1.|
>  
>  Eg: In Postgress
>  --
>  CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8));
>  INSERT INTO Testdec VALUES (0,0,0);
>  INSERT INTO Testdec VALUES (1,1,1);
>  select * from Testdec;
>  Result:
>            a |           b |        c
>  ---++---
>  0.000 | 0.00 | 0.
>  1.000 | 1.00 | 1.
> We can make spark SQL result consistent with other Databases like Postgresql
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Lukas Grasmann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528863#comment-17528863
 ] 

Lukas Grasmann commented on SPARK-39022:


This is my first contribution. How do I assign this issue to myself?

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired results (see above).
>

[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-04-27 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528862#comment-17528862
 ] 

Xiangrui Meng commented on SPARK-38648:
---

I think it is beneficial to both Spark and DL frameworks if Spark has 
state-of-the-art DL capabilities. We did some work in the past to make Spark 
work better with DL frameworks, e.g., iterator Scalar Pandas UDF, barrier mode, 
and GPU scheduling. But most of them are low level APIs for developers, not end 
users. Our Spark user guide contains little about DL and AI.

The dependency on DL frameworks might create issues. One idea is to develop in 
the Spark repo and Spark namespace but publish to PyPI independently. For 
example, in order to use DL features, users need to explicitly install 
`pyspark-dl` and then use the features under `pyspark.dl` namespace.

Putting development inside Spark and publishing under the spark namespace would 
help drive both development and adoption.

> SPIP: Simplified API for DL Inferencing
> ---
>
> Key: SPARK-38648
> URL: https://issues.apache.org/jira/browse/SPARK-38648
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Lee Yang
>Priority: Minor
>
> h1. Background and Motivation
> The deployment of deep learning (DL) models to Spark clusters can be a point 
> of friction today.  DL practitioners often aren't well-versed with Spark, and 
> Spark experts often aren't well-versed with the fast-changing DL frameworks.  
> Currently, the deployment of trained DL models is done in a fairly ad-hoc 
> manner, with each model integration usually requiring significant effort.
> To simplify this process, we propose adding an integration layer for each 
> major DL framework that can introspect their respective saved models to 
> more-easily integrate these models into Spark applications.  You can find a 
> detailed proposal here: 
> [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]
> h1. Goals
>  - Simplify the deployment of pre-trained single-node DL models to Spark 
> inference applications.
>  - Follow pandas_udf for simple inference use-cases.
>  - Follow Spark ML Pipelines APIs for transfer-learning use-cases.
>  - Enable integrations with popular third-party DL frameworks like 
> TensorFlow, PyTorch, and Huggingface.
>  - Focus on PySpark, since most of the DL frameworks use Python.
>  - Take advantage of built-in Spark features like GPU scheduling and Arrow 
> integration.
>  - Enable inference on both CPU and GPU.
> h1. Non-goals
>  - DL model training.
>  - Inference w/ distributed models, i.e. "model parallel" inference.
> h1. Target Personas
>  - Data scientists who need to deploy DL models on Spark.
>  - Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38914) Allow user to insert specified columns into insertable view

2022-04-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38914:
--

Assignee: morvenhuang

> Allow user to insert specified columns into insertable view
> ---
>
> Key: SPARK-38914
> URL: https://issues.apache.org/jira/browse/SPARK-38914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: morvenhuang
>Assignee: morvenhuang
>Priority: Minor
>
> The option `spark.sql.defaultColumn.useNullsForMissingDefautValues` allows us 
> to insert specified columns into table (SPARK-38795), but currently this 
> option does not work for insertable view, 
> Below INSERT INTO will result in AnalysisException even when the 
> useNullsForMissingDefautValues option is true,
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW v1 (c1 int, c2 string) USING 
> org.apache.spark.sql.json.DefaultSource OPTIONS ( path 'json_dir')");
> spark.sql("INSERT INTO v1(c1) VALUES(100)");
> org.apache.spark.sql.AnalysisException: unknown requires that the data to be 
> inserted have the same number of columns as the target table: target table 
> has 2 column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> {code}
>  
> I can provide a fix for this issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38914) Allow user to insert specified columns into insertable view

2022-04-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38914.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36212
[https://github.com/apache/spark/pull/36212]

> Allow user to insert specified columns into insertable view
> ---
>
> Key: SPARK-38914
> URL: https://issues.apache.org/jira/browse/SPARK-38914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: morvenhuang
>Assignee: morvenhuang
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option `spark.sql.defaultColumn.useNullsForMissingDefautValues` allows us 
> to insert specified columns into table (SPARK-38795), but currently this 
> option does not work for insertable view, 
> Below INSERT INTO will result in AnalysisException even when the 
> useNullsForMissingDefautValues option is true,
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW v1 (c1 int, c2 string) USING 
> org.apache.spark.sql.json.DefaultSource OPTIONS ( path 'json_dir')");
> spark.sql("INSERT INTO v1(c1) VALUES(100)");
> org.apache.spark.sql.AnalysisException: unknown requires that the data to be 
> inserted have the same number of columns as the target table: target table 
> has 2 column(s) but the inserted data has 1 column(s), including 0 partition 
> column(s) having constant value(s).
> {code}
>  
> I can provide a fix for this issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39046:


Assignee: Gengliang Wang  (was: Apache Spark)

> Return an empty context string if TreeNode.origin is wrongly set
> 
>
> Key: SPARK-39046
> URL: https://issues.apache.org/jira/browse/SPARK-39046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528859#comment-17528859
 ] 

Apache Spark commented on SPARK-39046:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36379

> Return an empty context string if TreeNode.origin is wrongly set
> 
>
> Key: SPARK-39046
> URL: https://issues.apache.org/jira/browse/SPARK-39046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39046:


Assignee: Apache Spark  (was: Gengliang Wang)

> Return an empty context string if TreeNode.origin is wrongly set
> 
>
> Key: SPARK-39046
> URL: https://issues.apache.org/jira/browse/SPARK-39046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528856#comment-17528856
 ] 

Apache Spark commented on SPARK-39022:
--

User 'Lukas-Grasmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/36378

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpecte

[jira] [Assigned] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39022:


Assignee: (was: Apache Spark)

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired results (see above).
>  * Moving the projection might reduce performance for {{Sort}} since the 
> input is p

[jira] [Assigned] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39022:


Assignee: Apache Spark

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Assignee: Apache Spark
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired results (see above).
>  * Moving the projection might reduce performance for {{Sort}

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528855#comment-17528855
 ] 

Apache Spark commented on SPARK-39022:
--

User 'Lukas-Grasmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/36378

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpecte

[jira] [Updated] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39046:
---
Parent: SPARK-38615
Issue Type: Sub-task  (was: Improvement)

> Return an empty context string if TreeNode.origin is wrongly set
> 
>
> Key: SPARK-39046
> URL: https://issues.apache.org/jira/browse/SPARK-39046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39046) Return an empty context string if TreeNode.origin is wrongly set

2022-04-27 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-39046:
--

 Summary: Return an empty context string if TreeNode.origin is 
wrongly set
 Key: SPARK-39046
 URL: https://issues.apache.org/jira/browse/SPARK-39046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39045) INTERNAL_ERROR for "all" internal errors

2022-04-27 Thread Serge Rielau (Jira)

Serge Rielau created SPARK-39045:


 Summary: INTERNAL_ERROR for "all" internal errors
 Key: SPARK-39045
 URL: https://issues.apache.org/jira/browse/SPARK-39045
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Serge Rielau


We should be able to inject the  [SYSTEM_ERROR] class for most cases without 
waiting to label the long tail on user facing error classes 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38979) Improve error log readability in OrcUtils.requestedColumnIds

2022-04-27 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38979:


Assignee: dzcxzl

> Improve error log readability in OrcUtils.requestedColumnIds
> 
>
> Key: SPARK-38979
> URL: https://issues.apache.org/jira/browse/SPARK-38979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
>
> In OrcUtils#requestedColumnIds sometimes it fails because 
> orcFieldNames.length > dataSchema.length, the log is not very clear.
> {code:java}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct has less fields than the actual ORC physical schema, no 
> idea which columns were dropped, fail to read. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38979) Improve error log readability in OrcUtils.requestedColumnIds

2022-04-27 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38979.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36296
[https://github.com/apache/spark/pull/36296]

> Improve error log readability in OrcUtils.requestedColumnIds
> 
>
> Key: SPARK-38979
> URL: https://issues.apache.org/jira/browse/SPARK-38979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.4.0
>
>
> In OrcUtils#requestedColumnIds sometimes it fails because 
> orcFieldNames.length > dataSchema.length, the log is not very clear.
> {code:java}
> java.lang.AssertionError: assertion failed: The given data schema 
> struct has less fields than the actual ORC physical schema, no 
> idea which columns were dropped, fail to read. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-27 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39015:
-
Fix Version/s: 3.3.0

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528751#comment-17528751
 ] 

Willi Raschkowski commented on SPARK-39044:
---

This worked on Spark 3.0.

[~beliefer], given we're hitting this in {{withBufferSerialized}}, I think this 
might be related to SPARK-37203.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Description: 
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 
1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}

  was:
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(Ta

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Description: 
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
... 1 more
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}

  was:
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get this error below:
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.ru

[jira] [Created] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-39044:
-

 Summary: AggregatingAccumulator with TypedImperativeAggregate 
throwing NullPointerException
 Key: SPARK-39044
 URL: https://issues.apache.org/jira/browse/SPARK-39044
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Willi Raschkowski


We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get this error below:
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
... 1 more
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39043) Hive client should not gather statistic by default.

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39043:


Assignee: (was: Apache Spark)

> Hive client should not gather statistic by default.
> ---
>
> Key: SPARK-39043
> URL: https://issues.apache.org/jira/browse/SPARK-39043
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> When use `InsertIntoHiveTable`, when insert overwrite partition, it will call
> Hive.loadPartition(), in this method, when `hive.stats.autogather` is 
> true(default is true)
>  
> {code:java}
> // Some comments here
> public String getFoo()
>   if (oldPart == null) {
> newTPart.getTPartition().setParameters(new HashMap());
> if (this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) 
> {
>   
> StatsSetupConst.setBasicStatsStateForCreateTable(newTPart.getParameters(),
>   StatsSetupConst.TRUE);
> }
> public static void setBasicStatsStateForCreateTable(Map 
> params, String setting) {
>   if (TRUE.equals(setting)) {
> for (String stat : StatsSetupConst.supportedStats) {
>   params.put(stat, "0");
> }
>   }
>   setBasicStatsState(params, setting);
> } 
> public static final String[] supportedStats = 
> {NUM_FILES,ROW_COUNT,TOTAL_SIZE,RAW_DATA_SIZE};
> {code}
> Then it set default rowNum as 0, but since spark will update numFiles and 
> rawSize, so rowNum remain 0.
> This impact other system like presto's CBO.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39043) Hive client should not gather statistic by default.

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39043:


Assignee: Apache Spark

> Hive client should not gather statistic by default.
> ---
>
> Key: SPARK-39043
> URL: https://issues.apache.org/jira/browse/SPARK-39043
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> When use `InsertIntoHiveTable`, when insert overwrite partition, it will call
> Hive.loadPartition(), in this method, when `hive.stats.autogather` is 
> true(default is true)
>  
> {code:java}
> // Some comments here
> public String getFoo()
>   if (oldPart == null) {
> newTPart.getTPartition().setParameters(new HashMap());
> if (this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) 
> {
>   
> StatsSetupConst.setBasicStatsStateForCreateTable(newTPart.getParameters(),
>   StatsSetupConst.TRUE);
> }
> public static void setBasicStatsStateForCreateTable(Map 
> params, String setting) {
>   if (TRUE.equals(setting)) {
> for (String stat : StatsSetupConst.supportedStats) {
>   params.put(stat, "0");
> }
>   }
>   setBasicStatsState(params, setting);
> } 
> public static final String[] supportedStats = 
> {NUM_FILES,ROW_COUNT,TOTAL_SIZE,RAW_DATA_SIZE};
> {code}
> Then it set default rowNum as 0, but since spark will update numFiles and 
> rawSize, so rowNum remain 0.
> This impact other system like presto's CBO.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39043) Hive client should not gather statistic by default.

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528734#comment-17528734
 ] 

Apache Spark commented on SPARK-39043:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36377

> Hive client should not gather statistic by default.
> ---
>
> Key: SPARK-39043
> URL: https://issues.apache.org/jira/browse/SPARK-39043
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> When use `InsertIntoHiveTable`, when insert overwrite partition, it will call
> Hive.loadPartition(), in this method, when `hive.stats.autogather` is 
> true(default is true)
>  
> {code:java}
> // Some comments here
> public String getFoo()
>   if (oldPart == null) {
> newTPart.getTPartition().setParameters(new HashMap());
> if (this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) 
> {
>   
> StatsSetupConst.setBasicStatsStateForCreateTable(newTPart.getParameters(),
>   StatsSetupConst.TRUE);
> }
> public static void setBasicStatsStateForCreateTable(Map 
> params, String setting) {
>   if (TRUE.equals(setting)) {
> for (String stat : StatsSetupConst.supportedStats) {
>   params.put(stat, "0");
> }
>   }
>   setBasicStatsState(params, setting);
> } 
> public static final String[] supportedStats = 
> {NUM_FILES,ROW_COUNT,TOTAL_SIZE,RAW_DATA_SIZE};
> {code}
> Then it set default rowNum as 0, but since spark will update numFiles and 
> rawSize, so rowNum remain 0.
> This impact other system like presto's CBO.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528725#comment-17528725
 ] 

Apache Spark commented on SPARK-39040:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/36376

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39040:


Assignee: (was: Apache Spark)

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528723#comment-17528723
 ] 

Apache Spark commented on SPARK-39040:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/36376

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39040) Respect NaNvl in EquivalentExpressions for expression elimination

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39040:


Assignee: Apache Spark

> Respect NaNvl in EquivalentExpressions for expression elimination
> -
>
> Key: SPARK-39040
> URL: https://issues.apache.org/jira/browse/SPARK-39040
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> For example the query will fail:
> {code:java}
> set spark.sql.ansi.enabled=true;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
> SELECT nanvl(1, 1/0 + 1/0);  {code}
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
> (TID 4) (10.221.98.68 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.
> == SQL(line 1, position 17) ==
> select nanvl(1 , 1/0 + 1/0)
>                  ^^^    at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151)
>  {code}
> We should respect the ordering of conditional expression that always evaluate 
> the predicate branch first, so the query above should not fail.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39043) Hive client should not gather statistic by default.

2022-04-27 Thread angerszhu (Jira)

angerszhu created SPARK-39043:
-

 Summary: Hive client should not gather statistic by default.
 Key: SPARK-39043
 URL: https://issues.apache.org/jira/browse/SPARK-39043
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.3.0
Reporter: angerszhu


When use `InsertIntoHiveTable`, when insert overwrite partition, it will call
Hive.loadPartition(), in this method, when `hive.stats.autogather` is 
true(default is true)

 

{code:java}
// Some comments here
public String getFoo()
  if (oldPart == null) {
newTPart.getTPartition().setParameters(new HashMap());
if (this.getConf().getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
  
StatsSetupConst.setBasicStatsStateForCreateTable(newTPart.getParameters(),
  StatsSetupConst.TRUE);
}

public static void setBasicStatsStateForCreateTable(Map params, 
String setting) {
  if (TRUE.equals(setting)) {
for (String stat : StatsSetupConst.supportedStats) {
  params.put(stat, "0");
}
  }
  setBasicStatsState(params, setting);
} 

public static final String[] supportedStats = 
{NUM_FILES,ROW_COUNT,TOTAL_SIZE,RAW_DATA_SIZE};
{code}




Then it set default rowNum as 0, but since spark will update numFiles and 
rawSize, so rowNum remain 0.

This impact other system like presto's CBO.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528717#comment-17528717
 ] 

Apache Spark commented on SPARK-39015:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36375

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528716#comment-17528716
 ] 

Apache Spark commented on SPARK-39015:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36375

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39006) Show a directional error message for PVC Dynamic Allocation Failure

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528683#comment-17528683
 ] 

Apache Spark commented on SPARK-39006:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36374

> Show a directional error message for PVC Dynamic Allocation Failure
> ---
>
> Key: SPARK-39006
> URL: https://issues.apache.org/jira/browse/SPARK-39006
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: qian
>Priority: Major
>
> When spark application requires multiple executors and not set pvc claimName 
> with onDemand or SPARK_EXECUTOR_ID, it always create executor pods. Because 
> pvc has be created by first executor pod.
> {noformat}
> 22/04/22 08:55:47 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: 
> https://kubernetes.default.svc/api/v1/namespaces/default/persistentvolumeclaims.
>  Message: persistentvolumeclaims "test-1" already exists. Received status: 
> Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, 
> kind=persistentvolumeclaims, name=test-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=persistentvolumeclaims 
> "test-1" already exists, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:697)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:676)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:629)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:566)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:527)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:315)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:651)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:91)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[kubernetes-client-5.10.1.jar:?]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$3(ExecutorPodsAllocator.scala:415)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at scala.collection.immutable.List.foreach(List.scala:431) 
> ~[scala-library-2.12.15.jar:?]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:408)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library-2.12.15.jar:?]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:385)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$35(ExecutorPodsAllocator.scala:349)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$35$adapted(ExecutorPodsAllocator.scala:342)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> ~[scala-library-2.12.15.jar:?]
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> ~[scala-library-2.12.15.jar:?]
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) 
> ~[scala-library-2.12.15.jar:?]
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:342)
>  ~[spark-kubernetes_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>         at 
> org.apache.spark.scheduler.cluster.k8s.Exec

[jira] [Commented] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528678#comment-17528678
 ] 

Apache Spark commented on SPARK-39041:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36373

> Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
> -
>
> Key: SPARK-39041
> URL: https://issues.apache.org/jira/browse/SPARK-39041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly

2022-04-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528679#comment-17528679
 ] 

Apache Spark commented on SPARK-39041:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36373

> Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
> -
>
> Key: SPARK-39041
> URL: https://issues.apache.org/jira/browse/SPARK-39041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39041:


Assignee: (was: Apache Spark)

> Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
> -
>
> Key: SPARK-39041
> URL: https://issues.apache.org/jira/browse/SPARK-39041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39041) Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly

2022-04-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39041:


Assignee: Apache Spark

> Mapping Spark Query ResultSet/Schema to TRowSet/TTableSchema directly
> -
>
> Key: SPARK-39041
> URL: https://issues.apache.org/jira/browse/SPARK-39041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo