date:20221205

[jira] [Commented] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator

2022-12-05 Thread huldar chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643706#comment-17643706
 ] 

huldar chen commented on SPARK-41266:
-

You can try to use ANSI compliance：
{code:java}
spark.sql.ansi.enabled=true {code}

> Spark does not parse timestamp strings when using the IN operator
> -
>
> Key: SPARK-41266
> URL: https://issues.apache.org/jira/browse/SPARK-41266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Windows 10, Spark 3.2.1 with Java 11
>Reporter: Laurens Versluis
>Priority: Major
>
> Likely affects more versions, tested only with 3.2.1.
>  
> Summary:
> Spark will convert a timestamp string to a timestamp when using the equal 
> operator (=), yet won't do this when using the IN operator.
>  
> Details:
> While debugging an issue why we got no results on a query, we found out that 
> when using the equal symbol `=` in the WHERE clause combined with a 
> TimeStampType column that Spark will convert the string to a timestamp and 
> filter.
> However, when using the IN operator (our query), it will not do so, and 
> perform a cast to string. We expected the behavior to be similar, or at least 
> that Spark realizes the IN clause operates on a TimeStampType column and thus 
> attempts to convert to timestamp first before falling back to string 
> comparison.
>  
> *Minimal reproducible example:*
> Suppose we have a one-line dataset with the follow contents and schema:
>  
> {noformat}
> ++
> |starttime   |
> ++
> |2019-08-11 19:33:05         |
> ++
> root
>  |-- starttime: timestamp (nullable = true){noformat}
> Then if we fire the following queries, we will not get results for the 
> IN-clause one using a timestamp string with timezone information:
>  
>  
> {code:java}
> // Works - Spark casts the argument to a string and the internal 
> representation of the time seems to match it...
> singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
> // Works
> singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
> // Works
> singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
> // Doesn't work
> singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show();
> //Works
> singleCol.filter("starttime IN 
> (to_timestamp('2019-08-11T19:33:05Z'))").show(); {code}
>  
> We can see from the output that a cast to string is taking place:
> {noformat}
> [...] isnotnull(starttime#59),(cast(starttime#59 as string) = 2019-08-11 
> 19:33:05){noformat}
> Since the = operator does work, it would be consistent if operators such as 
> the IN operator would have similar, consistent behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41001) Connection string support for Python client

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643696#comment-17643696
 ] 

Apache Spark commented on SPARK-41001:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38931

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41001) Connection string support for Python client

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643695#comment-17643695
 ] 

Apache Spark commented on SPARK-41001:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38931

> Connection string support for Python client
> ---
>
> Key: SPARK-41001
> URL: https://issues.apache.org/jira/browse/SPARK-41001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40801) Upgrade Apache Commons Text to 1.10

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643688#comment-17643688
 ] 

Apache Spark commented on SPARK-40801:
--

User 'cutiechi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38930

> Upgrade Apache Commons Text to 1.10
> ---
>
> Key: SPARK-40801
> URL: https://issues.apache.org/jira/browse/SPARK-40801
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.2.3, 3.3.2, 3.4.0
>
>
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34987) AQE improve: change shuffle hash join to sort merge join when skewed shuffle hash join exists

2022-12-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34987.
-
Resolution: Not A Problem

> AQE improve: change shuffle hash join to sort merge join when skewed shuffle 
> hash join exists
> -
>
> Key: SPARK-34987
> URL: https://issues.apache.org/jira/browse/SPARK-34987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1, 3.2.0
>Reporter: exmy
>Priority: Minor
>
> In our production, `spark.sql.join.preferSortMergeJoin` is false by default. 
> AQE currently can only optimize skewed join for sort merge join, it will be 
> better if we can change shuffle hash join to sort merge join when skewed 
> shuffle hash join exists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41346) Implement asc and desc methods

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643686#comment-17643686
 ] 

Apache Spark commented on SPARK-41346:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38929

> Implement asc and desc methods
> --
>
> Key: SPARK-41346
> URL: https://issues.apache.org/jira/browse/SPARK-41346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41034) Connect DataFrame should require RemoteSparkSession

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643676#comment-17643676
 ] 

Apache Spark commented on SPARK-41034:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38928

> Connect DataFrame should require RemoteSparkSession
> ---
>
> Key: SPARK-41034
> URL: https://issues.apache.org/jira/browse/SPARK-41034
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41369:


Assignee: Apache Spark

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41369:


Assignee: (was: Apache Spark)

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41369:
--
  Assignee: (was: Venkata Sai Akhil Gudesa)

Reverted at 
https://github.com/apache/spark/commit/324d0909623db5fd5abadcf5e8116a6ba1211ba2

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41369:
-
Fix Version/s: (was: 3.4.0)

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41244) Introducing a Protobuf serializer for UI data on KV store

2022-12-05 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41244.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38779
[https://github.com/apache/spark/pull/38779]

> Introducing a Protobuf serializer for UI data on KV store
> -
>
> Key: SPARK-41244
> URL: https://issues.apache.org/jira/browse/SPARK-41244
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Introducing Protobuf serializer for KV store, which is 3 times as fast as the 
> default serializer according to end-to-end benchmark against RocksDB.
> To move fast and make review easier, the first PR will cover the class 
> `JobDataWrapper` only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643667#comment-17643667
 ] 

Apache Spark commented on SPARK-41369:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38927

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41402) Override nodeName of StringDecode

2022-12-05 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-41402:


 Summary: Override nodeName of StringDecode
 Key: SPARK-41402
 URL: https://issues.apache.org/jira/browse/SPARK-41402
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Override nodeName of StringDecode for clarity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41401) spark2 stagedir can't be change

2022-12-05 Thread sinlang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sinlang updated SPARK-41401:

Description: 
i want't change different staging dir when write temporary data using , but 
spark3 seen can only write in table path

spark.yarn.stagingDir parameter only work when use spark2

 

in org.apache.spark.internal.io.FileCommitProtocol  file :  

  def getStagingDir(path: String, jobId: String): Path = {
    new Path(path, ".spark-staging-" + jobId)
  }
}

  was:
i want't change different staging dir when write temporary data using , but 
spark3 seen can only write in table path

spark.yarn.stagingDir parameter only work when use spark2

 

in FileCommitProtocol  file :  

  def getStagingDir(path: String, jobId: String): Path = {
    new Path(path, ".spark-staging-" + jobId)
  }
}


> spark2 stagedir can't be change 
> 
>
> Key: SPARK-41401
> URL: https://issues.apache.org/jira/browse/SPARK-41401
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.2.3
>Reporter: sinlang
>Priority: Major
>
> i want't change different staging dir when write temporary data using , but 
> spark3 seen can only write in table path
> spark.yarn.stagingDir parameter only work when use spark2
>  
> in org.apache.spark.internal.io.FileCommitProtocol  file :  
>   def getStagingDir(path: String, jobId: String): Path = {
>     new Path(path, ".spark-staging-" + jobId)
>   }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41401) spark2 stagedir can't be change

2022-12-05 Thread sinlang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sinlang updated SPARK-41401:

Description: 
i want't change different staging dir when write temporary data using , but 
spark3 seen can only write in table path

spark.yarn.stagingDir parameter only work when use spark2

 

in FileCommitProtocol  file :  

  def getStagingDir(path: String, jobId: String): Path = {
    new Path(path, ".spark-staging-" + jobId)
  }
}

  was:
i want't change different staging dir when write temporary data using , but 
spark3 seen can only write in table path

spark.yarn.stagingDir parameter only work when use spark2

!image-2022-12-06-11-31-29-723.png!


> spark2 stagedir can't be change 
> 
>
> Key: SPARK-41401
> URL: https://issues.apache.org/jira/browse/SPARK-41401
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.2.3
>Reporter: sinlang
>Priority: Major
>
> i want't change different staging dir when write temporary data using , but 
> spark3 seen can only write in table path
> spark.yarn.stagingDir parameter only work when use spark2
>  
> in FileCommitProtocol  file :  
>   def getStagingDir(path: String, jobId: String): Path = {
>     new Path(path, ".spark-staging-" + jobId)
>   }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41401) spark2 stagedir can't be change

2022-12-05 Thread sinlang (Jira)

sinlang created SPARK-41401:
---

 Summary: spark2 stagedir can't be change 
 Key: SPARK-41401
 URL: https://issues.apache.org/jira/browse/SPARK-41401
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.2, 3.2.3
Reporter: sinlang


i want't change different staging dir when write temporary data using , but 
spark3 seen can only write in table path

spark.yarn.stagingDir parameter only work when use spark2

!image-2022-12-06-11-31-29-723.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41247) Unify the protobuf versions in Spark connect and protobuf connector

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643638#comment-17643638
 ] 

Apache Spark commented on SPARK-41247:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38926

> Unify the protobuf versions in Spark connect and protobuf connector
> ---
>
> Key: SPARK-41247
> URL: https://issues.apache.org/jira/browse/SPARK-41247
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0
>
>
> Make the two versions consistent. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41399.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38925
[https://github.com/apache/spark/pull/38925]

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41399:
--
Component/s: Tests

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41399:
-

Assignee: Rui Wang

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643614#comment-17643614
 ] 

Apache Spark commented on SPARK-41399:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38925

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643613#comment-17643613
 ] 

Apache Spark commented on SPARK-41399:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38925

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41399:


Assignee: (was: Apache Spark)

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41399:


Assignee: Apache Spark

> Refactor column related tests to test_connect_column
> 
>
> Key: SPARK-41399
> URL: https://issues.apache.org/jira/browse/SPARK-41399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41400) Split of API classes from Catalyst

2022-12-05 Thread Jira

Herman van Hövell created SPARK-41400:
-

 Summary: Split of API classes from Catalyst
 Key: SPARK-41400
 URL: https://issues.apache.org/jira/browse/SPARK-41400
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Herman van Hövell


For the Spark Connect Scala Client we need a couple of classes that currently 
reside in Catalyst to be moved to a new sql/api project.

Concretely the following classes will be moved:
 * Row
 * DataType the entire hierarchy
 * Encoder



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41399) Refactor column related tests to test_connect_column

2022-12-05 Thread Rui Wang (Jira)

Rui Wang created SPARK-41399:


 Summary: Refactor column related tests to test_connect_column
 Key: SPARK-41399
 URL: https://issues.apache.org/jira/browse/SPARK-41399
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-41369:
-

Assignee: Venkata Sai Akhil Gudesa

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41369) Refactor connect directory structure

2022-12-05 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-41369.
---
Fix Version/s: 3.4.0
   Resolution: Resolved

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643611#comment-17643611
 ] 

Apache Spark commented on SPARK-41398:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/38924

> Relax constraints on Storage-Partitioned Join when partition keys after 
> runtime filtering do not match
> --
>
> Key: SPARK-41398
> URL: https://issues.apache.org/jira/browse/SPARK-41398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41398:


Assignee: Apache Spark

> Relax constraints on Storage-Partitioned Join when partition keys after 
> runtime filtering do not match
> --
>
> Key: SPARK-41398
> URL: https://issues.apache.org/jira/browse/SPARK-41398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41398:


Assignee: (was: Apache Spark)

> Relax constraints on Storage-Partitioned Join when partition keys after 
> runtime filtering do not match
> --
>
> Key: SPARK-41398
> URL: https://issues.apache.org/jira/browse/SPARK-41398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643610#comment-17643610
 ] 

Apache Spark commented on SPARK-41398:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/38924

> Relax constraints on Storage-Partitioned Join when partition keys after 
> runtime filtering do not match
> --
>
> Key: SPARK-41398
> URL: https://issues.apache.org/jira/browse/SPARK-41398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41398) Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match

2022-12-05 Thread Chao Sun (Jira)

Chao Sun created SPARK-41398:


 Summary: Relax constraints on Storage-Partitioned Join when 
partition keys after runtime filtering do not match
 Key: SPARK-41398
 URL: https://issues.apache.org/jira/browse/SPARK-41398
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.1
Reporter: Chao Sun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41397:


Assignee: Apache Spark

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41397) Implement part of string/binary functions

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643594#comment-17643594
 ] 

Apache Spark commented on SPARK-41397:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38921

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41397) Implement part of string/binary functions

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41397:


Assignee: (was: Apache Spark)

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41397) Implement part of String/Binary functions

2022-12-05 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-41397:


 Summary: Implement part of String/Binary functions
 Key: SPARK-41397
 URL: https://issues.apache.org/jira/browse/SPARK-41397
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41397) Implement part of string/binary functions

2022-12-05 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-41397:
-
Summary: Implement part of string/binary functions  (was: Implement part of 
String/Binary functions)

> Implement part of string/binary functions
> -
>
> Key: SPARK-41397
> URL: https://issues.apache.org/jira/browse/SPARK-41397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643588#comment-17643588
 ] 

Apache Spark commented on SPARK-41395:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/38923

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forced codegen to fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolea

[jira] [Assigned] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41395:


Assignee: (was: Apache Spark)

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forced codegen to fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolean'
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
>

[jira] [Commented] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643587#comment-17643587
 ] 

Apache Spark commented on SPARK-41395:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/38923

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forced codegen to fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolea

[jira] [Assigned] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41395:


Assignee: Apache Spark

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forced codegen to fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolean'
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture

[jira] [Resolved] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41394.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38920
[https://github.com/apache/spark/pull/38920]

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41394:
-

Assignee: Dongjoon Hyun

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41395:
--
Description: 
The following returns the wrong answer:

{noformat}
set spark.sql.codegen.wholeStage=false;
set spark.sql.codegen.factoryMode=NO_CODEGEN;

select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

+-+-+
|max(col1)|max(col2)|
+-+-+
|null |239.88   |
+-+-+
{noformat}
This is because {{InterpretedMutableProjection}} inappropriately uses 
{{InternalRow#setNullAt}} to set null for decimal types with precision > 
{{Decimal.MAX_LONG_DIGITS}}.

The path to corruption goes like this:

Unsafe buffer at start:

{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300 1800 2800  
  
{noformat}

When processing the first incoming row ([null, null]), 
{{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As 
a result, the pointers to the storage areas for the two decimals in the 
variable length region get zeroed out.

Buffer after projecting first row (null, null):
{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300    
  
{noformat}

When it's time to project the second row into the buffer, UnsafeRow#setDecimal 
uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the 
null-tracking bit set with decimal data:

{noformat}
null-tracking
bit area
offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   5db4  0200  
   
{noformat}
The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
turns off the null-tracking bit associated with the field at index 1.


In addition, the decimal at field index 0 is now null because of the corruption 
of the null-tracking bit set.


When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
{{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
than call {{setNullAt}} (see.)

This bug could get exercised during codegen fallback. Take for example this 
case where I forced codegen to fail for the {{Greatest}} expression:

{noformat}
spark-sql> select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
at 
org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
at org.codehaus.janino.Parser.read(Parser.java:3787)
...
22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, 
Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean'
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 36 more
...

NULL239.88   <== incorrect result, should be (77.77, 245.00)
Time taken: 6.132 seconds, Fetched 1 row(s)
spark-sql>
{nofor

[jira] [Commented] (SPARK-41396) Oneof field support and recursive fields

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643568#comment-17643568
 ] 

Apache Spark commented on SPARK-41396:
--

User 'SandishKumarHN' has created a pull request for this issue:
https://github.com/apache/spark/pull/38922

> Oneof field support and recursive fields
> 
>
> Key: SPARK-41396
> URL: https://issues.apache.org/jira/browse/SPARK-41396
> Project: Spark
>  Issue Type: Task
>  Components: Protobuf
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
>
> we should add support for protobuf OneOf fields to Spark-Protobuf. This will 
> involve implementing logic to detect when a protobuf message contains a OneOf 
> field, and to handle it appropriately when using from_protobuf and 
> to_protobuf. 
> we should add unit tests to ensure that the implementation of protobuf OneOf 
> field support is correct.
> Users can use protobuf OneOf fields with Spark-protobuf, making it more 
> complete and useful for processing protobuf data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41396) Oneof field support and recursive fields

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41396:


Assignee: Apache Spark

> Oneof field support and recursive fields
> 
>
> Key: SPARK-41396
> URL: https://issues.apache.org/jira/browse/SPARK-41396
> Project: Spark
>  Issue Type: Task
>  Components: Protobuf
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Assignee: Apache Spark
>Priority: Major
>
> we should add support for protobuf OneOf fields to Spark-Protobuf. This will 
> involve implementing logic to detect when a protobuf message contains a OneOf 
> field, and to handle it appropriately when using from_protobuf and 
> to_protobuf. 
> we should add unit tests to ensure that the implementation of protobuf OneOf 
> field support is correct.
> Users can use protobuf OneOf fields with Spark-protobuf, making it more 
> complete and useful for processing protobuf data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41396) Oneof field support and recursive fields

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643567#comment-17643567
 ] 

Apache Spark commented on SPARK-41396:
--

User 'SandishKumarHN' has created a pull request for this issue:
https://github.com/apache/spark/pull/38922

> Oneof field support and recursive fields
> 
>
> Key: SPARK-41396
> URL: https://issues.apache.org/jira/browse/SPARK-41396
> Project: Spark
>  Issue Type: Task
>  Components: Protobuf
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
>
> we should add support for protobuf OneOf fields to Spark-Protobuf. This will 
> involve implementing logic to detect when a protobuf message contains a OneOf 
> field, and to handle it appropriately when using from_protobuf and 
> to_protobuf. 
> we should add unit tests to ensure that the implementation of protobuf OneOf 
> field support is correct.
> Users can use protobuf OneOf fields with Spark-protobuf, making it more 
> complete and useful for processing protobuf data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41396) Oneof field support and recursive fields

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41396:


Assignee: (was: Apache Spark)

> Oneof field support and recursive fields
> 
>
> Key: SPARK-41396
> URL: https://issues.apache.org/jira/browse/SPARK-41396
> Project: Spark
>  Issue Type: Task
>  Components: Protobuf
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
>
> we should add support for protobuf OneOf fields to Spark-Protobuf. This will 
> involve implementing logic to detect when a protobuf message contains a OneOf 
> field, and to handle it appropriately when using from_protobuf and 
> to_protobuf. 
> we should add unit tests to ensure that the implementation of protobuf OneOf 
> field support is correct.
> Users can use protobuf OneOf fields with Spark-protobuf, making it more 
> complete and useful for processing protobuf data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41396) Oneof field support and recursive fields

2022-12-05 Thread Sandish Kumar HN (Jira)

Sandish Kumar HN created SPARK-41396:


 Summary: Oneof field support and recursive fields
 Key: SPARK-41396
 URL: https://issues.apache.org/jira/browse/SPARK-41396
 Project: Spark
  Issue Type: Task
  Components: Protobuf
Affects Versions: 2.3.0
Reporter: Sandish Kumar HN


we should add support for protobuf OneOf fields to Spark-Protobuf. This will 
involve implementing logic to detect when a protobuf message contains a OneOf 
field, and to handle it appropriately when using from_protobuf and to_protobuf. 

we should add unit tests to ensure that the implementation of protobuf OneOf 
field support is correct.

Users can use protobuf OneOf fields with Spark-protobuf, making it more 
complete and useful for processing protobuf data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41395:
--
Affects Version/s: 3.3.1

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forcibly made codegen fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolean'
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google

[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41395:
--
Affects Version/s: 3.2.3

> InterpretedMutableProjection can corrupt unsafe buffer when used with decimal 
> data
> --
>
> Key: SPARK-41395
> URL: https://issues.apache.org/jira/browse/SPARK-41395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The following returns the wrong answer:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> +-+-+
> |max(col1)|max(col2)|
> +-+-+
> |null |239.88   |
> +-+-+
> {noformat}
> This is because {{InterpretedMutableProjection}} inappropriately uses 
> {{InternalRow#setNullAt}} to set null for decimal types with precision > 
> {{Decimal.MAX_LONG_DIGITS}}.
> The path to corruption goes like this:
> Unsafe buffer at start:
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300 1800 2800  
>   
> {noformat}
> When processing the first incoming row ([null, null]), 
> {{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. 
> As a result, the pointers to the storage areas for the two decimals in the 
> variable length region get zeroed out.
> Buffer after projecting first row (null, null):
> {noformat}
>   offset/len for   offset/len for
>   1st decimal  2nd decimal
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   0300    
>   
> {noformat}
> When it's time to project the second row into the buffer, 
> UnsafeRow#setDecimal uses the zero offsets, which causes 
> {{UnsafeRow#setDecimal}} to overwrite the null-tracking bit set with decimal 
> data:
> {noformat}
> null-tracking
> bit area
> offset: 0816 (0x10)24 (0x18)
> 32 (0x20)
> data:   5db4  0200  
>    
> {noformat}
> The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
> 245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
> turns off the null-tracking bit associated with the field at index 1.
> In addition, the decimal at field index 0 is now null because of the 
> corruption of the null-tracking bit set.
> When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
> {{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
> than call {{setNullAt}} (see.)
> This bug could get exercised during codegen fallback. Take for example this 
> case where I forcibly made codegen fail for the {{Greatest}} expression:
> {noformat}
> spark-sql> select max(col1), max(col2) from values
> (cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
> (cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
> as data(col1, col2);
> 22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 1: ';' expected instead of 'if'
>   at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>   at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
>   at org.codehaus.janino.Parser.read(Parser.java:3787)
> ...
> 22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
> to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 43, Column 1: ';' expected instead of 'boolean'
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common

[jira] [Resolved] (SPARK-41390) Update the script used to generate register function in UDFRegistration

2022-12-05 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41390.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38916
[https://github.com/apache/spark/pull/38916]

> Update the script used to generate register function in UDFRegistration 
> 
>
> Key: SPARK-41390
> URL: https://issues.apache.org/jira/browse/SPARK-41390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} 
> instead of {{throw new AnalysisException(...)}} for {{register}} function in 
> {{{}UDFRegistration{}}}, but the script used to generate xx has not been 
> updated, so this pr update the script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration

2022-12-05 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41390:


Assignee: Yang Jie

> Update the script used to generate register function in UDFRegistration 
> 
>
> Key: SPARK-41390
> URL: https://issues.apache.org/jira/browse/SPARK-41390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} 
> instead of {{throw new AnalysisException(...)}} for {{register}} function in 
> {{{}UDFRegistration{}}}, but the script used to generate xx has not been 
> updated, so this pr update the script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-41395:
--
Description: 
The following returns the wrong answer:

{noformat}
set spark.sql.codegen.wholeStage=false;
set spark.sql.codegen.factoryMode=NO_CODEGEN;

select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

+-+-+
|max(col1)|max(col2)|
+-+-+
|null |239.88   |
+-+-+
{noformat}
This is because {{InterpretedMutableProjection}} inappropriately uses 
{{InternalRow#setNullAt}} to set null for decimal types with precision > 
{{Decimal.MAX_LONG_DIGITS}}.

The path to corruption goes like this:

Unsafe buffer at start:

{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300 1800 2800  
  
{noformat}

When processing the first incoming row ([null, null]), 
{{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As 
a result, the pointers to the storage areas for the two decimals in the 
variable length region get zeroed out.

Buffer after projecting first row (null, null):
{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300    
  
{noformat}

When it's time to project the second row into the buffer, UnsafeRow#setDecimal 
uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the 
null-tracking bit set with decimal data:

{noformat}
null-tracking
bit area
offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   5db4  0200  
   
{noformat}
The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
turns off the null-tracking bit associated with the field at index 1.


In addition, the decimal at field index 0 is now null because of the corruption 
of the null-tracking bit set.


When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
{{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
than call {{setNullAt}} (see.)

This bug could get exercised during codegen fallback. Take for example this 
case where I forcibly made codegen fail for the {{Greatest}} expression:

{noformat}
spark-sql> select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
at 
org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
at org.codehaus.janino.Parser.read(Parser.java:3787)
...
22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, 
Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean'
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 36 more
...

NULL239.88   <== incorrect result, should be (77.77, 245.00)
Time taken: 6.132 seconds, Fetched 1 row(s)
spark-sql>
{n

[jira] [Created] (SPARK-41395) InterpretedMutableProjection can corrupt unsafe buffer when used with decimal data

2022-12-05 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-41395:
-

 Summary: InterpretedMutableProjection can corrupt unsafe buffer 
when used with decimal data
 Key: SPARK-41395
 URL: https://issues.apache.org/jira/browse/SPARK-41395
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bruce Robbins


The following returns the wrong answer:

{noformat}
set spark.sql.codegen.wholeStage=false;
set spark.sql.codegen.factoryMode=NO_CODEGEN;

select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

+-+-+
|max(col1)|max(col2)|
+-+-+
|null |239.88   |
+-+-+
{noformat}
This is because {{InterpretedMutableProjection}} inappropriately uses 
{{InternalRow#setNullAt}} to set null for decimal types with precision > 
{{Decimal.MAX_LONG_DIGITS}}.

The path to corruption goes like this:

Unsafe buffer at start:

{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300 1800 2800  
  
{noformat}

When processing the first incoming row ([null, null]), 
{{InterpretedMutableProjection}} calls {{setNullAt}} for the decimal types. As 
a result, the pointers to the storage areas for the two decimals in the 
variable length region get zeroed out.

Buffer after projecting first row (null, null):
{noformat}
  offset/len for   offset/len for
  1st decimal  2nd decimal

offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   0300    
  
{noformat}

When it's time to project the second row into the buffer, UnsafeRow#setDecimal 
uses the zero offsets, which causes {{UnsafeRow#setDecimal}} to overwrite the 
null-tracking bit set with decimal data:

{noformat}
null-tracking
bit area
offset: 0816 (0x10)24 (0x18)32 
(0x20)
data:   5db4  0200  
   
{noformat}
The null-tracking bit set is overwritten with 239.88 (0x5db4) rather than 
245.00 (0x5fb4) because setDecimal indirectly calls setNotNullAt(1), which 
turns off the null-tracking bit associated with the field at index 1.


In addition, the decimal at field index 0 is now null because of the corruption 
of the null-tracking bit set.


When a decimal type with precision > {{Decimal.MAX_LONG_DIGITS}} is null, 
{{InterpretedMutableProjection}} should write a null {{Decimal}} value rather 
than call {{setNullAt}} (see.)

This bug could get exercised during codgen fallback. Take for example this case 
where I forcibly made codegen fail for the {{Greatest}} expression:

{noformat}
spark-sql> select max(col1), max(col2) from values
(cast(null  as decimal(27,2)), cast(null   as decimal(27,2))),
(cast(77.77 as decimal(27,2)), cast(245.00 as decimal(27,2)))
as data(col1, col2);

22/12/05 08:18:54 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 58, 
Column 1: ';' expected instead of 'if'
at 
org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:149)
at org.codehaus.janino.Parser.read(Parser.java:3787)
...
22/12/05 08:18:56 WARN MutableProjection: Expr codegen error and falling back 
to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, 
Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 43, Column 1: ';' expected instead of 'boolean'
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1583)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1580)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
com.goog

[jira] [Comment Edited] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2022-12-05 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643524#comment-17643524
 ] 

Bjørn Jørgensen edited comment on SPARK-18502 at 12/5/22 7:36 PM:
--

I just answered this problem in u...@spark.org 
 
 
df = spark.createDataFrame(
    [("china", "asia"), ("colombia", "south america`")],
    ["country", "continent`"]
)
df.show()

 
{code:java}
// ++--+
| country|continent`|
++--+
|   china|  asia|
|colombia|south america`|
++--+ {code}
 
 

df.select("continent`").show(1)
(...)

AnalysisException: Syntax error in attribute name: continent`.

 

clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns))
clean_df.show()
{code:java}
// ++--+
| country|continent_|
++--+
|   china|  asia|
|colombia|south america`|
++--+ {code}
 

clean_df.select("continent_").show(2)
{code:java}
// +--+
|continent_|
+--+
|  asia|
|south america`|
+--+ {code}

Examples are from [MungingData Avoiding Dots / Periods in PySpark Column 
Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/]


was (Author: bjornjorgensen):
I just answered this problem in u...@spark.org 
 
 
df = spark.createDataFrame(
    [("china", "asia"), ("colombia", "south america`")],
    ["country", "continent`"]
)
df.show()

 
 
++--+ | country| continent`| ++--+ | 
china| asia| |colombia|south america`| ++--+

df.select("continent`").show(1)
(...)AnalysisException: Syntax error in attribute name: continent`.

clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns))
clean_df.show()

++--+ | country| continent_| ++--+ | 
china| asia| |colombia|south america`| ++--+
clean_df.select("continent_").show(2)

+--+ | continent_| +--+ | asia| |south america`| 
+--+
Examples are from [MungingData Avoiding Dots / Periods in PySpark Column 
Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/]

> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>  Labels: bulk-closed
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2022-12-05 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643524#comment-17643524
 ] 

Bjørn Jørgensen commented on SPARK-18502:
-

I just answered this problem in u...@spark.org 
 
 
df = spark.createDataFrame(
    [("china", "asia"), ("colombia", "south america`")],
    ["country", "continent`"]
)
df.show()

 
 
++--+ | country| continent`| ++--+ | 
china| asia| |colombia|south america`| ++--+

df.select("continent`").show(1)
(...)AnalysisException: Syntax error in attribute name: continent`.

clean_df = df.toDF(*(c.replace('`', '_') for c in df.columns))
clean_df.show()

++--+ | country| continent_| ++--+ | 
china| asia| |colombia|south america`| ++--+
clean_df.select("continent_").show(2)

+--+ | continent_| +--+ | asia| |south america`| 
+--+
Examples are from [MungingData Avoiding Dots / Periods in PySpark Column 
Names|https://mungingdata.com/pyspark/avoid-dots-periods-column-names/]

> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>  Labels: bulk-closed
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643510#comment-17643510
 ] 

Apache Spark commented on SPARK-41394:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38920

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41394:


Assignee: (was: Apache Spark)

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643509#comment-17643509
 ] 

Apache Spark commented on SPARK-41394:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38920

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41394:


Assignee: Apache Spark

> Skip MemoryProfilerTests when pandas is not installed
> -
>
> Key: SPARK-41394
> URL: https://issues.apache.org/jira/browse/SPARK-41394
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41394) Skip MemoryProfilerTests when pandas is not installed

2022-12-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-41394:
-

 Summary: Skip MemoryProfilerTests when pandas is not installed
 Key: SPARK-41394
 URL: https://issues.apache.org/jira/browse/SPARK-41394
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side

2022-12-05 Thread Sandeep Katta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta resolved SPARK-39257.
---
Resolution: Not A Problem

Issue is caused by mssql-jdbc driver and is fixed in 12.1.0 version using the 
PR [1942|https://github.com/microsoft/mssql-jdbc/pull/1942]

> use spark.read.jdbc() to read data from SQL databse into dataframe, it fails 
> silently, when the session is killed from SQL server side
> --
>
> Key: SPARK-39257
> URL: https://issues.apache.org/jira/browse/SPARK-39257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.2, 3.2.1
> Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1
> *Microsoft JDBC Driver* *for SQL server:* 
> mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar
>Reporter: Xinran Tao
>Priority: Major
>
> I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, 
> which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the 
> SQL server.
> *codes:*
>  
> {code:java}
> %scala
> val token = "xxx"
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = 1433
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="
> import java.util.Properties
> val connectionProperties = new Properties()
> val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> connectionProperties.setProperty("Driver", driverClass)
> connectionProperties.setProperty("accesstoken", token)
> val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias"
> val df_stripe_dispute = spark.read.option("connectRetryCount", 
> 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, 
> properties=connectionProperties)
> df_stripe_dispute.count()
> {code}
>  
>  
> The session was accidentally killed by some automatic scripts from SQL server 
> side, but no errors shows up from the spark side, no failure was observed. 
> But from the count() result, the reords are far less than it should be.
>  
> If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the 
> query and print the data out, which doesn't involve spark, there would be a 
> connection reset error thrown out.
> *codes:*
>  
> {code:java}
> %scala
> import java.sql.DriverManager
> import java.sql.Connection
> import java.util.Properties;
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = "1433"
> val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> val token = ""
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="+token
>  
> var connection:Connection = null
> val info:Properties = new Properties();
> info.setProperty("accesstoken", token);
>     
> // make the connection
> Class.forName(driver)
> connection = DriverManager.getConnection(jdbcUrl,info )
> // create the statement, and run the select query
> val statement = connection.createStatement()
> val resultSet = statement.executeQuery("select UNITS from 
> payment_balance_new")
> while ( resultSet.next() ) {
>   println("__"+resultSet.getString(1))
> }
> {code}
>  
> *errors:*
>  
> {code:java}
> com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998)
>  at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at 
> com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at 
> com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) 
> at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at 
> com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309)
>  at 
> com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) 
> at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at 
> com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at 
> com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at 
> com.microsoft.sqlserver.jdbc.Column.ge

[jira] [Commented] (SPARK-39257) use spark.read.jdbc() to read data from SQL databse into dataframe, it fails silently, when the session is killed from SQL server side

2022-12-05 Thread Sandeep Katta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643501#comment-17643501
 ] 

Sandeep Katta commented on SPARK-39257:
---

Closing this jira as this is fixed by *mssql-jdbc 
[1942|https://github.com/microsoft/mssql-jdbc/pull/1942]*

> use spark.read.jdbc() to read data from SQL databse into dataframe, it fails 
> silently, when the session is killed from SQL server side
> --
>
> Key: SPARK-39257
> URL: https://issues.apache.org/jira/browse/SPARK-39257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.2, 3.2.1
> Environment: {*}Spark version{*}: spark 3.0.1/3.1.2/3.2.1
> *Microsoft JDBC Driver* *for SQL server:* 
> mssql-jdbc-8.2.1.jre8/mssql-jdbc-10.2.1.jre8.jar
>Reporter: Xinran Tao
>Priority: Major
>
> I'm using *spark.read.jdbc()* to read form SQL database into a dataframe, 
> which utilizes *Microsoft JDBC Driver* *for SQL server* to get data from the 
> SQL server.
> *codes:*
>  
> {code:java}
> %scala
> val token = "xxx"
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = 1433
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="
> import java.util.Properties
> val connectionProperties = new Properties()
> val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> connectionProperties.setProperty("Driver", driverClass)
> connectionProperties.setProperty("accesstoken", token)
> val sql_pushdown = "(select UNITS from payment_balance_new) emp_alias"
> val df_stripe_dispute = spark.read.option("connectRetryCount", 
> 200).option("numPartitions",1).jdbc(url=jdbcUrl, table=sql_pushdown, 
> properties=connectionProperties)
> df_stripe_dispute.count()
> {code}
>  
>  
> The session was accidentally killed by some automatic scripts from SQL server 
> side, but no errors shows up from the spark side, no failure was observed. 
> But from the count() result, the reords are far less than it should be.
>  
> If I'm directly using *Microsoft JDBC Driver* *for SQL server* to run the 
> query and print the data out, which doesn't involve spark, there would be a 
> connection reset error thrown out.
> *codes:*
>  
> {code:java}
> %scala
> import java.sql.DriverManager
> import java.sql.Connection
> import java.util.Properties;
> val jdbcHostname = "xinrandatabseserver.database.windows.net"
> val jdbcDatabase = "xinranSQLDatabase"
> val jdbcPort = "1433"
> val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
> val token = ""
> val jdbcUrl = 
> "jdbc:sqlserver://%s:%s;databaseName=%s;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net".format(jdbcHostname,
>  jdbcPort, jdbcDatabase)+ ";accessToken="+token
>  
> var connection:Connection = null
> val info:Properties = new Properties();
> info.setProperty("accesstoken", token);
>     
> // make the connection
> Class.forName(driver)
> connection = DriverManager.getConnection(jdbcUrl,info )
> // create the statement, and run the select query
> val statement = connection.createStatement()
> val resultSet = statement.executeQuery("select UNITS from 
> payment_balance_new")
> while ( resultSet.next() ) {
>   println("__"+resultSet.getString(1))
> }
> {code}
>  
> *errors:*
>  
> {code:java}
> com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2998)
>  at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:2034) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:6446) at 
> com.microsoft.sqlserver.jdbc.TDSReader.nextPacket(IOBuffer.java:6396) at 
> com.microsoft.sqlserver.jdbc.TDSReader.ensurePayload(IOBuffer.java:6374) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readBytes(IOBuffer.java:6675) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readWrappedBytes(IOBuffer.java:6696) 
> at com.microsoft.sqlserver.jdbc.TDSReader.readInt(IOBuffer.java:6645) at 
> com.microsoft.sqlserver.jdbc.TDSReader.readUnsignedInt(IOBuffer.java:6659) at 
> com.microsoft.sqlserver.jdbc.PLPInputStream.readBytesInternal(PLPInputStream.java:309)
>  at 
> com.microsoft.sqlserver.jdbc.PLPInputStream.getBytes(PLPInputStream.java:105) 
> at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:757) at 
> com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:3748) at 
> com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:247) at 
> com.microsoft.sqlserver.jdbc.Column.getValu

[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2022-12-05 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643492#comment-17643492
 ] 

Steve Loughran commented on SPARK-41392:


MBP m1 with

{code}
 uname -a
Darwin stevel-MBP16 21.6.0 Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 
PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000 arm64

{code}

java 8

{code}
 java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Zulu 8.60.0.21-CA-macos-aarch64) (build 
1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Zulu 8.60.0.21-CA-macos-aarch64) (build 25.322-b06, 
mixed mode)

{code}
 build/mvn invokes homebrew maven which I run at -T 1 as sometimes the build 
hangs (maven bug, presumably)

{code}
build/mvn -v
Using `mvn` from path: /opt/homebrew/bin/mvn
Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: /opt/homebrew/Cellar/maven/3.8.6/libexec
Java version: 1.8.0_322, vendor: Azul Systems, Inc., runtime: 
/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "12.6.1", arch: "aarch64", family: "mac"
{code}

this setup works with older hadoop releases (inc the forthcoming 3.3.5), 
somehow the plugin can't cope with the trunk release


> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/st

[jira] [Commented] (SPARK-41372) Support DataFrame TempView

2022-12-05 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643473#comment-17643473
 ] 

Xinrong Meng commented on SPARK-41372:
--

Resolved by https://github.com/apache/spark/pull/38891.

> Support DataFrame TempView
> --
>
> Key: SPARK-41372
> URL: https://issues.apache.org/jira/browse/SPARK-41372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41372) Support DataFrame TempView

2022-12-05 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-41372.
--
Resolution: Resolved

> Support DataFrame TempView
> --
>
> Key: SPARK-41372
> URL: https://issues.apache.org/jira/browse/SPARK-41372
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643461#comment-17643461
 ] 

Apache Spark commented on SPARK-40419:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38919

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2022-12-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643414#comment-17643414
 ] 

Yang Jie commented on SPARK-41392:
--

Can you give the complete compilation command and the compilation tools used? 
For example, java version and maven version

 

> spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin
> ---
>
> Key: SPARK-41392
> URL: https://issues.apache.org/jira/browse/SPARK-41392
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE
> {code}
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> {code}
> full stack
> {code}
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
> (scala-test-compile-first) on project spark-sql_2.12: Execution 
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
> class was missing while executing 
> net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
> org/bouncycastle/jce/provider/BouncyCastleProvider
> [ERROR] -
> [ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
> [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
> [ERROR] urls[0] = 
> file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
> [ERROR] urls[1] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
> [ERROR] urls[2] = 
> file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
> [ERROR] urls[3] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
> [ERROR] urls[4] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
> [ERROR] urls[5] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
> [ERROR] urls[6] = 
> file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
> [ERROR] urls[7] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
> [ERROR] urls[8] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
> [ERROR] urls[9] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
> [ERROR] urls[10] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
> [ERROR] urls[11] = 
> file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
> [ERROR] urls[12] = 
> file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
> [ERROR] urls[13] = 
> file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
> [ERROR] urls[14] = 
> file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
> [ERROR] urls[15] = 
> file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
> [ERROR] urls[16] = 
> file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
> [ERROR] urls[17] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
> [ERROR] urls[18] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
> [ERROR] urls[19] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
> [ERROR] urls[20] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
> [ERROR] urls[21] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
> [ERROR] urls[22] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
> [ERROR] urls[23] = 
> file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
> [ERROR] urls[24] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
> [ERROR] urls[25] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
> [ERROR] urls[26] = 
> file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
> [E

[jira] [Assigned] (SPARK-41393) Upgrade slf4j to 2.0.5

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41393:


Assignee: Apache Spark

> Upgrade slf4j to 2.0.5
> --
>
> Key: SPARK-41393
> URL: https://issues.apache.org/jira/browse/SPARK-41393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41393) Upgrade slf4j to 2.0.5

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41393:


Assignee: (was: Apache Spark)

> Upgrade slf4j to 2.0.5
> --
>
> Key: SPARK-41393
> URL: https://issues.apache.org/jira/browse/SPARK-41393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41393) Upgrade slf4j to 2.0.5

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643411#comment-17643411
 ] 

Apache Spark commented on SPARK-41393:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38918

> Upgrade slf4j to 2.0.5
> --
>
> Key: SPARK-41393
> URL: https://issues.apache.org/jira/browse/SPARK-41393
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://www.slf4j.org/news.html#2.0.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41393) Upgrade slf4j to 2.0.5

2022-12-05 Thread Yang Jie (Jira)

Yang Jie created SPARK-41393:


 Summary: Upgrade slf4j to 2.0.5
 Key: SPARK-41393
 URL: https://issues.apache.org/jira/browse/SPARK-41393
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


https://www.slf4j.org/news.html#2.0.5



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41389) Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`

2022-12-05 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41389.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38913
[https://github.com/apache/spark/pull/38913]

> Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`
> ---
>
> Key: SPARK-41389
> URL: https://issues.apache.org/jira/browse/SPARK-41389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41389) Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`

2022-12-05 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41389:


Assignee: Yang Jie

> Reuse `WRONG_NUM_ARGS` instead of `_LEGACY_ERROR_TEMP_1044`
> ---
>
> Key: SPARK-41389
> URL: https://issues.apache.org/jira/browse/SPARK-41389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40642) wrong doc on memory tuning regarding String object memory size, changed since version>=9

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40642.
--
Resolution: Won't Fix

> wrong doc on memory tuning regarding String object memory size, changed since 
> version>=9
> 
>
> Key: SPARK-40642
> URL: https://issues.apache.org/jira/browse/SPARK-40642
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2
>Reporter: Arnaud Nauwynck
>Priority: Trivial
>
> The documentation is wrong regarding memory consumption of java.lang.String
> https://spark.apache.org/docs/latest/tuning.html#memory-tuning
> internally, the source for this doc section is written here:
> https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100
> {noformat}
> * Java `String`s have about 40 bytes of overhead over the raw string data 
> (since they store it in an
>   array of `Char`s and keep extra data such as the length), and store each 
> character
>   as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 
> 10-character string can
>   easily consume 60 bytes.
> {noformat}
> reason: since java version >= 9 ... Java has optimized the problem described 
> in the doc.
> It used to be 16 bytes of header + using internally char coded as UTF-16
> Notice  that before jdk 9 (since jdk 6, there was also an internal flag for 
> HotSpot JVM :  -XX:+UseCompressedStrings , but it was not enabled by default 
> ) 
> Since OpenJdk >= 9...  with the implementation of JEP 254 ( 
> https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 
> when they are simple Latin1 text, otherwise as before. There is now an extra 
> byte   field in class java.lang.String to say if the "coder" is optimized for 
> Latin1.
> This field is described here in OpenJdk source code: 
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
> The computation for the memory size of String was   "40+2*charCount" ... it 
> is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it 
> is not Latin1 text
> the object overhead is 44 because of alignment... not 40+1 for adding one 
> "byte" field



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40141) Task listener overloads no longer needed with JDK 8+

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40141:
-
Priority: Minor  (was: Major)

> Task listener overloads no longer needed with JDK 8+
> 
>
> Key: SPARK-40141
> URL: https://issues.apache.org/jira/browse/SPARK-40141
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ryan Johnson
>Priority: Minor
>
> TaskContext defines methods for registering completion and failure listeners, 
> and the respective listener types qualify as functional interfaces in JDK 8+. 
> This leads to awkward ambiguous overload errors with the overload of each 
> function, that takes a function directly instead of a listener. Now that JDK 
> 8 is the minimum allowed, we can remove the unnecessary overloads, which not 
> only simplifies the code, but also removes a source of frustration since it 
> can be nearly impossible to predict when an ambiguous overload might be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39948) exclude velocity 1.5 jar

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39948.
--
Resolution: Not A Problem

> exclude velocity 1.5 jar
> 
>
> Key: SPARK-39948
> URL: https://issues.apache.org/jira/browse/SPARK-39948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: melin
>Priority: Major
>
> hive-exec depends on importing velocity. Velocity has an older version and 
> has many security issues
> https://issues.apache.org/jira/browse/HIVE-25726
>  
> !image-2022-08-02-14-05-55-756.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40141) Task listener overloads no longer needed with JDK 8+

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40141.
--
Resolution: Won't Fix

> Task listener overloads no longer needed with JDK 8+
> 
>
> Key: SPARK-40141
> URL: https://issues.apache.org/jira/browse/SPARK-40141
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Ryan Johnson
>Priority: Major
>
> TaskContext defines methods for registering completion and failure listeners, 
> and the respective listener types qualify as functional interfaces in JDK 8+. 
> This leads to awkward ambiguous overload errors with the overload of each 
> function, that takes a function directly instead of a listener. Now that JDK 
> 8 is the minimum allowed, we can remove the unnecessary overloads, which not 
> only simplifies the code, but also removes a source of frustration since it 
> can be nearly impossible to predict when an ambiguous overload might be 
> triggered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40284) spark concurrent overwrite mode writes data to files in HDFS format, all request data write success

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40284.
--
Resolution: Not A Problem

> spark  concurrent overwrite mode writes data to files in HDFS format, all 
> request data write success
> 
>
> Key: SPARK-40284
> URL: https://issues.apache.org/jira/browse/SPARK-40284
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Liu
>Priority: Major
>
> We use Spark as a service. The same Spark service needs to handle multiple 
> requests, but I have a problem with this
> When multiple requests are overwritten to a directory at the same time, the 
> results of two overwrite requests may be written successfully. I think this 
> does not meet the definition of overwrite write
> First I ran Write SQL1, then I ran Write SQL2, and I found that both data 
> were written in the end, which I thought was unreasonable
> {code:java}
> sparkSession.udf.register("sleep",  (time: Long) => Thread.sleep(time))
> -- write sql1
> sparkSession.sql("select 1 as id, sleep(4) as 
> time").write.mode(SaveMode.Overwrite).parquet("path")
> -- write sql2
>  sparkSession.sql("select 2 as id, 1 as 
> time").write.mode(SaveMode.Overwrite).parquet("path") {code}
> When the spark source, and I saw that all these logic in 
> InsertIntoHadoopFsRelationCommand this class.
>  
> When the target directory already exists, Spark directly deletes the target 
> directory and writes to the _temporary directory that it requests. However, 
> when multiple requests are written, the data will all append in; For example, 
> in Write SQL above, this procedure occurs
> 1. excute write sql1, spark  create the _temporary directory for SQL1, and 
> continue
> 2. excute write sql2 , spark will  delete the entire target directory and 
> create its own 
> _temporary
> 3. sql2 writes  its data
> 4. sql1 complete the calculation, The corresponding _temporary /0/attemp_id 
> directory does not exist and so the request fail. However, the task is 
> retried, but the _temporary  directory is not deleted when the task is 
> retried. Therefore, the execution result of sql1  result is append to the 
> target directory 
>  
> Based on the above process, the write process, can  spark do a directory 
> check before the write task or some other way to avoid this kind of problem?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40253) Data read exception in orc format

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40253.
--
Resolution: Won't Fix

>  Data read exception in orc format
> --
>
> Key: SPARK-40253
> URL: https://issues.apache.org/jira/browse/SPARK-40253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>Reporter: yihangqiao
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-0-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)

[jira] [Resolved] (SPARK-40286) Load Data from S3 deletes data source file

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40286.
--
Resolution: Not A Problem

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36853) Code failing on checkstyle

2022-12-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36853.
--
Resolution: Won't Fix

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Priority: Trivial
> Attachments: image-2021-10-18-01-57-00-714.png, 
> spark_mvn_clean_install_skip_tests_in_windows.log
>
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match pattern 
> '^[a-z][a-z0

[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-05 Thread Ahmed Mahran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643374#comment-17643374
 ] 

Ahmed Mahran commented on SPARK-41008:
--

Thanks, I'll manage to have a PR in a couple of days.

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-05 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643373#comment-17643373
 ] 

Sean R. Owen commented on SPARK-41008:
--

No need for an option, this seems like a bug fix. Yes if you can propose a pull 
request that fixes it, by all means.

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41167) Optimize LikeSimplification rule to improve multi like performance

2022-12-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-41167.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38682
[https://github.com/apache/spark/pull/38682]

> Optimize LikeSimplification rule to improve multi like performance
> --
>
> Key: SPARK-41167
> URL: https://issues.apache.org/jira/browse/SPARK-41167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
> Fix For: 3.4.0
>
>
> We can improve multi like by reorder the match expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41167) Optimize LikeSimplification rule to improve multi like performance

2022-12-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-41167:
---

Assignee: Wan Kun

> Optimize LikeSimplification rule to improve multi like performance
> --
>
> Key: SPARK-41167
> URL: https://issues.apache.org/jira/browse/SPARK-41167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Assignee: Wan Kun
>Priority: Major
>
> We can improve multi like by reorder the match expressions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2022-12-05 Thread Maziyar PANAHI (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643344#comment-17643344
 ] 

Maziyar PANAHI commented on SPARK-32530:


Not sure if this matters, but as a Scala developer myself primarily building 
Scala applications to use Apache Spark natively, I highly support this decision 
to have this as part of ASF officially. 

I also agree with a maintenance cost, however, unlike .NET, it's much easier 
for any of us from the Java/Scala world to contribute to Kotlin. I think it's a 
price that should be paid for the sake of longevity. It is clear that Java and 
Scala are not going anywhere, but they are not the first choice for newcomers 
either. More native languages on JVM likeKotlin can really help to bring more 
users and contributors to the Spark ecosystem in the long term.

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41392) spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in scala-maven plugin

2022-12-05 Thread Steve Loughran (Jira)

Steve Loughran created SPARK-41392:
--

 Summary: spark builds against hadoop trunk/3.4.0-SNAPSHOT fail in 
scala-maven plugin
 Key: SPARK-41392
 URL: https://issues.apache.org/jira/browse/SPARK-41392
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.0
Reporter: Steve Loughran


on hadoop trunk (but not the 3.3.x line), spark builds fail with a CNFE

{code}
net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
org/bouncycastle/jce/provider/BouncyCastleProvider

{code}

full stack

{code}
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile 
(scala-test-compile-first) on project spark-sql_2.12: Execution 
scala-test-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile failed: A required 
class was missing while executing 
net.alchim31.maven:scala-maven-plugin:4.7.2:testCompile: 
org/bouncycastle/jce/provider/BouncyCastleProvider
[ERROR] -
[ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.7.2
[ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy
[ERROR] urls[0] = 
file:/Users/stevel/.m2/repository/net/alchim31/maven/scala-maven-plugin/4.7.2/scala-maven-plugin-4.7.2.jar
[ERROR] urls[1] = 
file:/Users/stevel/.m2/repository/org/apache/maven/shared/maven-dependency-tree/3.2.0/maven-dependency-tree-3.2.0.jar
[ERROR] urls[2] = 
file:/Users/stevel/.m2/repository/org/eclipse/aether/aether-util/1.0.0.v20140518/aether-util-1.0.0.v20140518.jar
[ERROR] urls[3] = 
file:/Users/stevel/.m2/repository/org/apache/maven/reporting/maven-reporting-api/3.1.1/maven-reporting-api-3.1.1.jar
[ERROR] urls[4] = 
file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-sink-api/1.11.1/doxia-sink-api-1.11.1.jar
[ERROR] urls[5] = 
file:/Users/stevel/.m2/repository/org/apache/maven/doxia/doxia-logging-api/1.11.1/doxia-logging-api-1.11.1.jar
[ERROR] urls[6] = 
file:/Users/stevel/.m2/repository/org/apache/maven/maven-archiver/3.6.0/maven-archiver-3.6.0.jar
[ERROR] urls[7] = 
file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-io/3.4.0/plexus-io-3.4.0.jar
[ERROR] urls[8] = 
file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-interpolation/1.26/plexus-interpolation-1.26.jar
[ERROR] urls[9] = 
file:/Users/stevel/.m2/repository/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
[ERROR] urls[10] = 
file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-utils/3.4.2/plexus-utils-3.4.2.jar
[ERROR] urls[11] = 
file:/Users/stevel/.m2/repository/org/codehaus/plexus/plexus-archiver/4.5.0/plexus-archiver-4.5.0.jar
[ERROR] urls[12] = 
file:/Users/stevel/.m2/repository/commons-io/commons-io/2.11.0/commons-io-2.11.0.jar
[ERROR] urls[13] = 
file:/Users/stevel/.m2/repository/org/apache/commons/commons-compress/1.21/commons-compress-1.21.jar
[ERROR] urls[14] = 
file:/Users/stevel/.m2/repository/org/iq80/snappy/snappy/0.4/snappy-0.4.jar
[ERROR] urls[15] = 
file:/Users/stevel/.m2/repository/org/tukaani/xz/1.9/xz-1.9.jar
[ERROR] urls[16] = 
file:/Users/stevel/.m2/repository/com/github/luben/zstd-jni/1.5.2-4/zstd-jni-1.5.2-4.jar
[ERROR] urls[17] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/zinc_2.13/1.7.1/zinc_2.13-1.7.1.jar
[ERROR] urls[18] = 
file:/Users/stevel/.m2/repository/org/scala-lang/scala-library/2.13.8/scala-library-2.13.8.jar
[ERROR] urls[19] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-core_2.13/1.7.1/zinc-core_2.13-1.7.1.jar
[ERROR] urls[20] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-apiinfo_2.13/1.7.1/zinc-apiinfo_2.13-1.7.1.jar
[ERROR] urls[21] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-bridge_2.13/1.7.1/compiler-bridge_2.13-1.7.1.jar
[ERROR] urls[22] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-classpath_2.13/1.7.1/zinc-classpath_2.13-1.7.1.jar
[ERROR] urls[23] = 
file:/Users/stevel/.m2/repository/org/scala-lang/scala-compiler/2.13.8/scala-compiler-2.13.8.jar
[ERROR] urls[24] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/compiler-interface/1.7.1/compiler-interface-1.7.1.jar
[ERROR] urls[25] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/util-interface/1.7.0/util-interface-1.7.0.jar
[ERROR] urls[26] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/zinc-persist-core-assembly/1.7.1/zinc-persist-core-assembly-1.7.1.jar
[ERROR] urls[27] = 
file:/Users/stevel/.m2/repository/org/scala-lang/modules/scala-parallel-collections_2.13/0.2.0/scala-parallel-collections_2.13-0.2.0.jar
[ERROR] urls[28] = 
file:/Users/stevel/.m2/repository/org/scala-sbt/io_2.13/1.7.0/io_2.13-1.7.0.jar
[ERROR] urls[29] = 
file:/Users/stevel/.m2/repository/com/swoval/file-tree-views/2.1.9/file-tree-views-2.1.9.jar
[ERROR] urls[30] = 
file:/Users/stevel/.m2/repository/net/java/dev/jna/jna/5.12.0/jna-5.12.0.jar
[ERROR] urls[31] = 
file:/Users/stevel/.m2/repository/net/java/dev/jna/

[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643297#comment-17643297
 ] 

Apache Spark commented on SPARK-41391:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38917

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643296#comment-17643296
 ] 

Apache Spark commented on SPARK-41391:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38917

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41391:


Assignee: (was: Apache Spark)

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41391:


Assignee: Apache Spark

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41391) The output column name of `groupBy.agg(count_distinct)` is incorrect

2022-12-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-41391:
--
Summary: The output column name of `groupBy.agg(count_distinct)` is 
incorrect  (was: The output column name of `groupby.agg(count_distinct)` is 
incorrect)

> The output column name of `groupBy.agg(count_distinct)` is incorrect
> 
>
> Key: SPARK-41391
> URL: https://issues.apache.org/jira/browse/SPARK-41391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-41391) The output column name of `groupby.agg(count_distinct)` is incorrect

2022-12-05 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-41391:
-

 Summary: The output column name of `groupby.agg(count_distinct)` 
is incorrect
 Key: SPARK-41391
 URL: https://issues.apache.org/jira/browse/SPARK-41391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41390:


Assignee: Apache Spark

> Update the script used to generate register function in UDFRegistration 
> 
>
> Key: SPARK-41390
> URL: https://issues.apache.org/jira/browse/SPARK-41390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} 
> instead of {{throw new AnalysisException(...)}} for {{register}} function in 
> {{{}UDFRegistration{}}}, but the script used to generate xx has not been 
> updated, so this pr update the script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41390) Update the script used to generate register function in UDFRegistration

2022-12-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41390:


Assignee: (was: Apache Spark)

> Update the script used to generate register function in UDFRegistration 
> 
>
> Key: SPARK-41390
> URL: https://issues.apache.org/jira/browse/SPARK-41390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} 
> instead of {{throw new AnalysisException(...)}} for {{register}} function in 
> {{{}UDFRegistration{}}}, but the script used to generate xx has not been 
> updated, so this pr update the script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41390) Update the script used to generate register function in UDFRegistration

2022-12-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643279#comment-17643279
 ] 

Apache Spark commented on SPARK-41390:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38916

> Update the script used to generate register function in UDFRegistration 
> 
>
> Key: SPARK-41390
> URL: https://issues.apache.org/jira/browse/SPARK-41390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> SPARK-35065 use {{QueryCompilationErrors.invalidFunctionArgumentsError}} 
> instead of {{throw new AnalysisException(...)}} for {{register}} function in 
> {{{}UDFRegistration{}}}, but the script used to generate xx has not been 
> updated, so this pr update the script.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo