[jira] [Created] (SPARK-37776) Upgrade silencer to 1.7.7

2021-12-28 Thread William Hyun (Jira)
William Hyun created SPARK-37776:


 Summary: Upgrade silencer to 1.7.7
 Key: SPARK-37776
 URL: https://issues.apache.org/jira/browse/SPARK-37776
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37774.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35051
[https://github.com/apache/spark/pull/35051]

> Upgrade log4j from 2.17 to 2.17.1
> -
>
> Key: SPARK-37774
> URL: https://issues.apache.org/jira/browse/SPARK-37774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Major
> Fix For: 3.3.0
>
>
> There is another CVE 
> ([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
>  in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37774:
-

Assignee: Chia-Ping Tsai

> Upgrade log4j from 2.17 to 2.17.1
> -
>
> Key: SPARK-37774
> URL: https://issues.apache.org/jira/browse/SPARK-37774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Major
>
> There is another CVE 
> ([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
>  in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37775) [PYSPARK] Fix mlflow doctest

2021-12-28 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-37775:
---

 Summary: [PYSPARK] Fix mlflow doctest
 Key: SPARK-37775
 URL: https://issues.apache.org/jira/browse/SPARK-37775
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
 Environment: {code:java}
**
1324File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 149, in 
pyspark.pandas.mlflow.load_model
1325Failed example:
1326mlflow.set_experiment("my_experiment")
1327Expected nothing
1328Got:
1329
1330**
1331   1 of  26 in pyspark.pandas.mlflow.load_model
1332***Test Failed*** 1 failures. {code}
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37775) [PYSPARK] Fix mlflow doctest

2021-12-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37775:

Description: 
{code:java}
**
1324File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 149, in 
pyspark.pandas.mlflow.load_model
1325Failed example:
1326mlflow.set_experiment("my_experiment")
1327Expected nothing
1328Got:
1329
1330**
1331   1 of  26 in pyspark.pandas.mlflow.load_model
1332***Test Failed*** 1 failures. {code}

> [PYSPARK] Fix mlflow doctest
> 
>
> Key: SPARK-37775
> URL: https://issues.apache.org/jira/browse/SPARK-37775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> **
> 1324File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 149, in 
> pyspark.pandas.mlflow.load_model
> 1325Failed example:
> 1326mlflow.set_experiment("my_experiment")
> 1327Expected nothing
> 1328Got:
> 1329 artifact_location='file:///__w/spark/spark/python/target/5d5d5841-43b7-41c4-bc0a-1c5763b355ca/tmpfhad7kycpandas_on_spark_mlflow/0',
>  experiment_id='0', lifecycle_stage='active', name='my_experiment', tags={}>
> 1330**
> 1331   1 of  26 in pyspark.pandas.mlflow.load_model
> 1332***Test Failed*** 1 failures. {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37775) [PYSPARK] Fix mlflow doctest

2021-12-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37775:

Environment: (was: {code:java}
**
1324File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 149, in 
pyspark.pandas.mlflow.load_model
1325Failed example:
1326mlflow.set_experiment("my_experiment")
1327Expected nothing
1328Got:
1329
1330**
1331   1 of  26 in pyspark.pandas.mlflow.load_model
1332***Test Failed*** 1 failures. {code})

> [PYSPARK] Fix mlflow doctest
> 
>
> Key: SPARK-37775
> URL: https://issues.apache.org/jira/browse/SPARK-37775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37644) Support datasource v2 complete aggregate pushdown

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466325#comment-17466325
 ] 

Apache Spark commented on SPARK-37644:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35052

> Support datasource v2 complete aggregate pushdown 
> --
>
> Key: SPARK-37644
> URL: https://issues.apache.org/jira/browse/SPARK-37644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently , Spark supports push down aggregate with partial-agg and final-agg 
> . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg 
> by running completely on database.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-28 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-37578:

Issue Type: Improvement  (was: Bug)

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-28 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-37578:
---

Assignee: L. C. Hsieh

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: L. C. Hsieh
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-28 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-37578.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35028
[https://github.com/apache/spark/pull/35028]

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37774:


Assignee: (was: Apache Spark)

> Upgrade log4j from 2.17 to 2.17.1
> -
>
> Key: SPARK-37774
> URL: https://issues.apache.org/jira/browse/SPARK-37774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Major
>
> There is another CVE 
> ([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
>  in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466310#comment-17466310
 ] 

Apache Spark commented on SPARK-37774:
--

User 'chia7712' has created a pull request for this issue:
https://github.com/apache/spark/pull/35051

> Upgrade log4j from 2.17 to 2.17.1
> -
>
> Key: SPARK-37774
> URL: https://issues.apache.org/jira/browse/SPARK-37774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Priority: Major
>
> There is another CVE 
> ([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
>  in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37774:


Assignee: Apache Spark

> Upgrade log4j from 2.17 to 2.17.1
> -
>
> Key: SPARK-37774
> URL: https://issues.apache.org/jira/browse/SPARK-37774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Apache Spark
>Priority: Major
>
> There is another CVE 
> ([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
>  in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37773:


Assignee: Xinrong Meng

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37773.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35050
[https://github.com/apache/spark/pull/35050]

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37774) Upgrade log4j from 2.17 to 2.17.1

2021-12-28 Thread Chia-Ping Tsai (Jira)
Chia-Ping Tsai created SPARK-37774:
--

 Summary: Upgrade log4j from 2.17 to 2.17.1
 Key: SPARK-37774
 URL: https://issues.apache.org/jira/browse/SPARK-37774
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chia-Ping Tsai


There is another CVE 
([CVE-2021-44832)|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44832]
 in 2.17 (https://issues.apache.org/jira/browse/LOG4J2-3293)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37773:


Assignee: Apache Spark

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37773:


Assignee: (was: Apache Spark)

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466305#comment-17466305
 ] 

Apache Spark commented on SPARK-37773:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35050

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37773) Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5

2021-12-28 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37773:
-
Summary: Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5  
(was: Disable certain doctests for pandas<=1.0.5)

> Disable certain doctests of `ps.to_timedelta` for pandas<=1.0.5
> ---
>
> Key: SPARK-37773
> URL: https://issues.apache.org/jira/browse/SPARK-37773
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37773) Disable certain doctests for pandas<=1.0.5

2021-12-28 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37773:


 Summary: Disable certain doctests for pandas<=1.0.5
 Key: SPARK-37773
 URL: https://issues.apache.org/jira/browse/SPARK-37773
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Disable certain doctests for pandas<=1.0.5



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37766:
-

Assignee: Dongjoon Hyun

> Regenerate benchmark results
> 
>
> Key: SPARK-37766
> URL: https://issues.apache.org/jira/browse/SPARK-37766
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37766.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35046
[https://github.com/apache/spark/pull/35046]

> Regenerate benchmark results
> 
>
> Key: SPARK-37766
> URL: https://issues.apache.org/jira/browse/SPARK-37766
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37758) [PYSPARK] Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37758:

Summary: [PYSPARK] Enable PySpark scheduled job on ARM based self-hosted 
runner  (was: [Scala] Enable PySpark scheduled job on ARM based self-hosted 
runner)

> [PYSPARK] Enable PySpark scheduled job on ARM based self-hosted runner
> --
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37757) [Scala] Enable Spark test scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37757:

Summary: [Scala] Enable Spark test scheduled job on ARM based self-hosted 
runner  (was: Enable Spark test scheduled job on ARM based self-hosted runner)

> [Scala] Enable Spark test scheduled job on ARM based self-hosted runner
> ---
>
> Key: SPARK-37757
> URL: https://issues.apache.org/jira/browse/SPARK-37757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37758) [Scala] Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-37758:

Summary: [Scala] Enable PySpark scheduled job on ARM based self-hosted 
runner  (was: Enable PySpark scheduled job on ARM based self-hosted runner)

> [Scala] Enable PySpark scheduled job on ARM based self-hosted runner
> 
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37772) [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 docker image

2021-12-28 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-37772:
---

 Summary: [PYSPARK] Publish ApacheSparkGitHubActionImage arm64 
docker image
 Key: SPARK-37772
 URL: https://issues.apache.org/jira/browse/SPARK-37772
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.3.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37758) Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466294#comment-17466294
 ] 

Yikun Jiang edited comment on SPARK-37758 at 12/29/21, 3:03 AM:


* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM, all other tests passed

30355 tests run, 679 skipped, 3 failed.

[1][https://github.com/Yikun/spark/pull/47/checks?check_run_id=4652714227]


was (Author: yikunkero):
* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

 

[1][https://github.com/Yikun/spark/pull/47/checks?check_run_id=4652714227]

> Enable PySpark scheduled job on ARM based self-hosted runner
> 
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37758) Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466294#comment-17466294
 ] 

Yikun Jiang edited comment on SPARK-37758 at 12/29/21, 3:02 AM:


* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

 

[1][https://github.com/Yikun/spark/pull/47/checks?check_run_id=4652714227]


was (Author: yikunkero):
* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

 

[1]https://github.com/Yikun/spark/pull/47/checks?check_run_id=4652714227

> Enable PySpark scheduled job on ARM based self-hosted runner
> 
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37758) Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466294#comment-17466294
 ] 

Yikun Jiang edited comment on SPARK-37758 at 12/29/21, 3:02 AM:


* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

 

[1]https://github.com/Yikun/spark/pull/47/checks?check_run_id=4652714227


was (Author: yikunkero):
* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

> Enable PySpark scheduled job on ARM based self-hosted runner
> 
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37758) Enable PySpark scheduled job on ARM based self-hosted runner

2021-12-28 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466294#comment-17466294
 ] 

Yikun Jiang commented on SPARK-37758:
-

* DistributedSuite.caching in memory, replicated (encryption = on)
 * DistributedSuite.caching on disk, replicated 2 (encryption = on)
 * DistributedSuite.caching in memory and disk, serialized, replicated 
(encryption = on)

These tests are flaky in ARM

> Enable PySpark scheduled job on ARM based self-hosted runner
> 
>
> Key: SPARK-37758
> URL: https://issues.apache.org/jira/browse/SPARK-37758
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37757) Enable Spark test scheduled job on ARM based self-hosted runner

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466292#comment-17466292
 ] 

Apache Spark commented on SPARK-37757:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35049

> Enable Spark test scheduled job on ARM based self-hosted runner
> ---
>
> Key: SPARK-37757
> URL: https://issues.apache.org/jira/browse/SPARK-37757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37757) Enable Spark test scheduled job on ARM based self-hosted runner

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37757:


Assignee: Apache Spark

> Enable Spark test scheduled job on ARM based self-hosted runner
> ---
>
> Key: SPARK-37757
> URL: https://issues.apache.org/jira/browse/SPARK-37757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37757) Enable Spark test scheduled job on ARM based self-hosted runner

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37757:


Assignee: (was: Apache Spark)

> Enable Spark test scheduled job on ARM based self-hosted runner
> ---
>
> Key: SPARK-37757
> URL: https://issues.apache.org/jira/browse/SPARK-37757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37757) Enable Spark test scheduled job on ARM based self-hosted runner

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466291#comment-17466291
 ] 

Apache Spark commented on SPARK-37757:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35049

> Enable Spark test scheduled job on ARM based self-hosted runner
> ---
>
> Key: SPARK-37757
> URL: https://issues.apache.org/jira/browse/SPARK-37757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37727) Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466286#comment-17466286
 ] 

Apache Spark commented on SPARK-37727:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35048

> Show ignored confs & hide warnings for conf already set in 
> SparkSession.builder.getOrCreate
> ---
>
> Key: SPARK-37727
> URL: https://issues.apache.org/jira/browse/SPARK-37727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, {{SparkSession.builder.getOrCreate()}} is too noisy even when 
> duplicate configurations are set. And users cannot tell which configurations 
> are to fix. See the example below:
> {code}
> ./bin/spark-shell --conf spark.abc=abc
> {code}
> {code}
> import org.apache.spark.sql.SparkSession
> spark.sparkContext.setLogLevel("DEBUG")
> SparkSession.builder.config("spark.abc", "abc").getOrCreate
> {code}
> {code}
> ...
> 21:12:40.601 [main] WARN  org.apache.spark.sql.SparkSession - Using an 
> existing SparkSession; some spark core configurations may not take effect.
> {code}
> This is strait forward when there are few configurations but it is difficult 
> for users to figure out when there are too many configurations especially 
> when these configurations are defined in property files like 
> {{spark-default.conf}} that is sometimes maintained separately by system 
> admins.
> See also https://github.com/apache/spark/pull/34757#discussion_r769248275



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37175:


Assignee: (was: Apache Spark)

> Performance improvement to hash joins with many duplicate keys
> --
>
> Key: SPARK-37175
> URL: https://issues.apache.org/jira/browse/SPARK-37175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
> Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly 
> slower than HashedRelations with similar number of entries but few or no 
> duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a 
> given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each 
> other in memory.
> To test this hypothesis, I made a [somewhat brute force change to 
> HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
>  to reorganize the map such that all rows for a given key are adjacent in 
> memory. This yielded some performance improvements, at least in my contrived 
> examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular 
> SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular 
> BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, 
> I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a 
> memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform 
> broadcast hash join when the build side has lots of duplicate keys. So maybe 
> a long term improvement would be to avoid hash joins (broadcast or shuffle) 
> if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37175:


Assignee: Apache Spark

> Performance improvement to hash joins with many duplicate keys
> --
>
> Key: SPARK-37175
> URL: https://issues.apache.org/jira/browse/SPARK-37175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
> Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly 
> slower than HashedRelations with similar number of entries but few or no 
> duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a 
> given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each 
> other in memory.
> To test this hypothesis, I made a [somewhat brute force change to 
> HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
>  to reorganize the map such that all rows for a given key are adjacent in 
> memory. This yielded some performance improvements, at least in my contrived 
> examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular 
> SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular 
> BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, 
> I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a 
> memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform 
> broadcast hash join when the build side has lots of duplicate keys. So maybe 
> a long term improvement would be to avoid hash joins (broadcast or shuffle) 
> if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466287#comment-17466287
 ] 

Apache Spark commented on SPARK-37175:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/35047

> Performance improvement to hash joins with many duplicate keys
> --
>
> Key: SPARK-37175
> URL: https://issues.apache.org/jira/browse/SPARK-37175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
> Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly 
> slower than HashedRelations with similar number of entries but few or no 
> duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a 
> given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each 
> other in memory.
> To test this hypothesis, I made a [somewhat brute force change to 
> HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
>  to reorganize the map such that all rows for a given key are adjacent in 
> memory. This yielded some performance improvements, at least in my contrived 
> examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular 
> SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular 
> BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, 
> I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a 
> memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform 
> broadcast hash join when the build side has lots of duplicate keys. So maybe 
> a long term improvement would be to avoid hash joins (broadcast or shuffle) 
> if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2021-12-28 Thread Ivan Sadikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated SPARK-37771:
-
Description: 
There is a race condition between creating a Hive client and loading classes 
that do not appear in shared prefixes config. For example, we confirmed that 
the code fails for the following configuration:
{code:java}
spark.sql.hive.metastore.version 0.13.0
spark.sql.hive.metastore.jars maven
spark.sql.hive.metastore.sharedPrefixes 
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code}
And code: 
{code:java}
-- Prerequisite commands to set up the table
-- drop table if exists ivan_test_2;
-- create table ivan_test_2 (a int, part string) using csv location 
's3://bucket/hive-test' partitioned by (part);
-- insert into ivan_test_2 values (1, 'a'); 

-- Command that triggers failure
ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION 
's3://bucket/hive-test'{code}
 

Stacktrace (line numbers might differ):
{code:java}
21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: 
com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null
21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for 
path s3://bucket/hive-test
java.io.IOException: From option fs.s3a.aws.credentials.provider 
java.lang.ClassNotFoundException: Class 
com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found
    at 
org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725)
    at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
    at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
    at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
    at com.sun.proxy.$Proxy59.add_partitions(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:1514)
    at 
org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:773)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:683)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:346)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:247)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:283)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:239)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:326)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.createPartitions(HiveClientImpl.scala:676)
    at 
org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$createPartitions$1(PoolingHiveClient.scala:345)

[jira] [Updated] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2021-12-28 Thread Ivan Sadikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated SPARK-37771:
-
Description: 
There is a race condition between creating a Hive client and loading classes 
that do not appear in shared prefixes config. For example, we confirmed that 
the code fails for the following configuration:
{code:java}
spark.sql.hive.metastore.version 0.13.0
spark.sql.hive.metastore.jars maven
spark.sql.hive.metastore.sharedPrefixes 
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem{code}
And code: 
{code:java}
-- Prerequisite commands to set up the table
-- drop table if exists ivan_test_2;
-- create table ivan_test_2 (a int, part string) using csv location 
's3://bucket/hive-test' partitioned by (part);
-- insert into ivan_test_2 values (1, 'a'); 

-- Command that triggers failure
ALTER TABLE ivan_test_2 ADD PARTITION (part='b') LOCATION 
's3://bucket/hive-test'{code}
 

Stacktrace (line numbers might differ):
{code:java}
21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
21/12/22 04:37:05 DEBUG IsolatedClientLoader: shared class: 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
21/12/22 04:37:05 DEBUG IsolatedClientLoader: hive class: 
com.amazonaws.auth.EnvironmentVariableCredentialsProvider - null
21/12/22 04:37:05 ERROR S3AFileSystem: Failed to initialize S3AFileSystem for 
path s3://bucket/hive-test
java.io.IOException: From option fs.s3a.aws.credentials.provider 
java.lang.ClassNotFoundException: Class 
com.amazonaws.auth.EnvironmentVariableCredentialsProvider not found
    at 
org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:725)
    at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:688)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:411)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
    at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createLocationForAddedPartition(HiveMetaStore.java:1993)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_core(HiveMetaStore.java:1865)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partitions_req(HiveMetaStore.java:1910)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
    at com.sun.proxy.$Proxy58.add_partitions_req(Unknown Source)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.add_partitions(HiveMetaStoreClient.java:457)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
    at com.sun.proxy.$Proxy59.add_partitions(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.createPartitions(Hive.java:1514)
    at 
org.apache.spark.sql.hive.client.Shim_v0_13.createPartitions(HiveShim.scala:773)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createPartitions$1(HiveClientImpl.scala:683)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:346)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$retryLocked$1(HiveClientImpl.scala:247)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:283)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:239)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:326)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.createPartitions(HiveClientImpl.scala:676)
    at 
org.apache.spark.sql.hive.client.PoolingHiveClient.$anonfun$createPartitions$1(PoolingHiveClient.scala:345)

[jira] [Updated] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2021-12-28 Thread Ivan Sadikov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated SPARK-37771:
-
Issue Type: Bug  (was: Improvement)

> Race condition in withHiveState and limited logic in IsolatedClientLoader 
> result in ClassNotFoundException
> --
>
> Key: SPARK-37771
> URL: https://issues.apache.org/jira/browse/SPARK-37771
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.2, 3.2.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37771) Race condition in withHiveState and limited logic in IsolatedClientLoader result in ClassNotFoundException

2021-12-28 Thread Ivan Sadikov (Jira)
Ivan Sadikov created SPARK-37771:


 Summary: Race condition in withHiveState and limited logic in 
IsolatedClientLoader result in ClassNotFoundException
 Key: SPARK-37771
 URL: https://issues.apache.org/jira/browse/SPARK-37771
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0, 3.1.2, 3.1.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37770:
---

 Summary: Performance improvements for ColumnVector `putByteArray`
 Key: SPARK-37770
 URL: https://issues.apache.org/jira/browse/SPARK-37770
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37769) Filter on the metadata struct

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37769:
---

 Summary: Filter on the metadata struct
 Key: SPARK-37769
 URL: https://issues.apache.org/jira/browse/SPARK-37769
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Be able to skip reading some files based on filterings.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37705) Write session time zone in the Parquet file metadata so that rebase can use it instead of JVM timezone

2021-12-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37705.
--
Fix Version/s: 3.2.1
   Resolution: Fixed

Issue resolved by pull request 35042
[https://github.com/apache/spark/pull/35042]

> Write session time zone in the Parquet file metadata so that rebase can use 
> it instead of JVM timezone
> --
>
> Key: SPARK-37705
> URL: https://issues.apache.org/jira/browse/SPARK-37705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.1
>
>
> We could write session time zone in the Parquet file that was used to write 
> timestamps so we can use the same time zone to reconstruct the values instead 
> of using JVM time zone (which could be different).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37768) Schema pruning for the metadata struct

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37768:
---

 Summary: Schema pruning for the metadata struct
 Key: SPARK-37768
 URL: https://issues.apache.org/jira/browse/SPARK-37768
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37767) Follow-up Improvements of Hidden File Metadata Support for Spark SQL

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37767:
---

 Summary: Follow-up Improvements of Hidden File Metadata Support 
for Spark SQL
 Key: SPARK-37767
 URL: https://issues.apache.org/jira/browse/SPARK-37767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Follow-up of https://issues.apache.org/jira/browse/SPARK-37273



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37766:


Assignee: (was: Apache Spark)

> Regenerate benchmark results
> 
>
> Key: SPARK-37766
> URL: https://issues.apache.org/jira/browse/SPARK-37766
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37766:


Assignee: Apache Spark

> Regenerate benchmark results
> 
>
> Key: SPARK-37766
> URL: https://issues.apache.org/jira/browse/SPARK-37766
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466272#comment-17466272
 ] 

Apache Spark commented on SPARK-37766:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35046

> Regenerate benchmark results
> 
>
> Key: SPARK-37766
> URL: https://issues.apache.org/jira/browse/SPARK-37766
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37766) Regenerate benchmark results

2021-12-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37766:
-

 Summary: Regenerate benchmark results
 Key: SPARK-37766
 URL: https://issues.apache.org/jira/browse/SPARK-37766
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36361.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35044
[https://github.com/apache/spark/pull/35044]

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37761.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35044
[https://github.com/apache/spark/pull/35044]

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37765) PySpark Dynamic DataFrame for easier inheritance

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37765:


Assignee: Apache Spark

> PySpark Dynamic DataFrame for easier inheritance
> 
>
> Key: SPARK-37765
> URL: https://issues.apache.org/jira/browse/SPARK-37765
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Pablo Alcain
>Assignee: Apache Spark
>Priority: Major
>
> In typical development settings, multiple tables with very different concepts 
> are mapped to the same `DataFrame` class. The inheritance from the pyspark 
> `DataFrame` class is a bit cumbersome because of the chainable methods and it 
> also makes it difficult to abstract regularly used queries. The proposal is 
> to generate a `DynamicDataFrame` that allows easy inheritance retaining 
> `DataFrame` methods without losing chainability neither for the newly 
> generated queries nor for the usual dataframe ones.
> In our experience, this allowed us to iterate *much* faster, generating 
> business-centric classes in a couple of lines of code. Here's an example of 
> what the application code would look like. Attached in the end is a summary 
> of the different strategies that are usually pursued when trying to abstract 
> queries.
> {code:python}
> import pyspark
> from pyspark.sql import DynamicDataFrame
> from pyspark.sql import functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> class Inventory(DynamicDataFrame):
> def update_prices(self, factor: float = 2.0):
> return self.withColumn("price", F.col("price") * factor)
> base_dataframe = spark.createDataFrame(
> data=[["product_1", 2.0], ["product_2", 4.0]],
> schema=["name", "price"],
> )
> print("Doing an inheritance mediated by DynamicDataFrame")
> inventory = Inventory(base_dataframe)
> inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
> print("inventory_updated.show():")
> inventory_updated.show()
> print("After multiple uses of the query we still have the desired type")
> print(f"type(inventory_updated): {type(inventory_updated)}")
> print("We can still use the usual dataframe methods")
> expensive_inventory = inventory_updated.filter(F.col("price") > 25)
> print("expensive_inventory.show():")
> expensive_inventory.show()
> print("And retain the desired type")
> print(f"type(expensive_inventory): {type(expensive_inventory)}")
> {code}
> The PR linked to this ticket is an implementation of the DynamicDataFramed 
> used in this snippet.
>  
> Other strategies found for handling the query abstraction:
> 1. Functions: using functions that call dataframes and returns them 
> transformed. It had a couple of pitfalls: we had to manage the namespaces 
> carefully, there is no clear new object and also the "chainability" didn't 
> feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched 
> ([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) 
> methods with the regularly done queries inside the DataFrame class. This one 
> kept it pyspark-y, but there was no easy way to handle segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
> `DataFrame` and implement the methods there. This one solves all the issues, 
> but with a caveat: the chainable methods cast the result explicitly to 
> `DataFrame` (see 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
>  e g). Therefore, everytime you use one of the parent's methods you'd have to 
> re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>  
> (see 
> [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] for 
> the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37765) PySpark Dynamic DataFrame for easier inheritance

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37765:


Assignee: (was: Apache Spark)

> PySpark Dynamic DataFrame for easier inheritance
> 
>
> Key: SPARK-37765
> URL: https://issues.apache.org/jira/browse/SPARK-37765
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Pablo Alcain
>Priority: Major
>
> In typical development settings, multiple tables with very different concepts 
> are mapped to the same `DataFrame` class. The inheritance from the pyspark 
> `DataFrame` class is a bit cumbersome because of the chainable methods and it 
> also makes it difficult to abstract regularly used queries. The proposal is 
> to generate a `DynamicDataFrame` that allows easy inheritance retaining 
> `DataFrame` methods without losing chainability neither for the newly 
> generated queries nor for the usual dataframe ones.
> In our experience, this allowed us to iterate *much* faster, generating 
> business-centric classes in a couple of lines of code. Here's an example of 
> what the application code would look like. Attached in the end is a summary 
> of the different strategies that are usually pursued when trying to abstract 
> queries.
> {code:python}
> import pyspark
> from pyspark.sql import DynamicDataFrame
> from pyspark.sql import functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> class Inventory(DynamicDataFrame):
> def update_prices(self, factor: float = 2.0):
> return self.withColumn("price", F.col("price") * factor)
> base_dataframe = spark.createDataFrame(
> data=[["product_1", 2.0], ["product_2", 4.0]],
> schema=["name", "price"],
> )
> print("Doing an inheritance mediated by DynamicDataFrame")
> inventory = Inventory(base_dataframe)
> inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
> print("inventory_updated.show():")
> inventory_updated.show()
> print("After multiple uses of the query we still have the desired type")
> print(f"type(inventory_updated): {type(inventory_updated)}")
> print("We can still use the usual dataframe methods")
> expensive_inventory = inventory_updated.filter(F.col("price") > 25)
> print("expensive_inventory.show():")
> expensive_inventory.show()
> print("And retain the desired type")
> print(f"type(expensive_inventory): {type(expensive_inventory)}")
> {code}
> The PR linked to this ticket is an implementation of the DynamicDataFramed 
> used in this snippet.
>  
> Other strategies found for handling the query abstraction:
> 1. Functions: using functions that call dataframes and returns them 
> transformed. It had a couple of pitfalls: we had to manage the namespaces 
> carefully, there is no clear new object and also the "chainability" didn't 
> feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched 
> ([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) 
> methods with the regularly done queries inside the DataFrame class. This one 
> kept it pyspark-y, but there was no easy way to handle segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
> `DataFrame` and implement the methods there. This one solves all the issues, 
> but with a caveat: the chainable methods cast the result explicitly to 
> `DataFrame` (see 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
>  e g). Therefore, everytime you use one of the parent's methods you'd have to 
> re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>  
> (see 
> [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] for 
> the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37765) PySpark Dynamic DataFrame for easier inheritance

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466252#comment-17466252
 ] 

Apache Spark commented on SPARK-37765:
--

User 'pabloalcain' has created a pull request for this issue:
https://github.com/apache/spark/pull/35045

> PySpark Dynamic DataFrame for easier inheritance
> 
>
> Key: SPARK-37765
> URL: https://issues.apache.org/jira/browse/SPARK-37765
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Pablo Alcain
>Priority: Major
>
> In typical development settings, multiple tables with very different concepts 
> are mapped to the same `DataFrame` class. The inheritance from the pyspark 
> `DataFrame` class is a bit cumbersome because of the chainable methods and it 
> also makes it difficult to abstract regularly used queries. The proposal is 
> to generate a `DynamicDataFrame` that allows easy inheritance retaining 
> `DataFrame` methods without losing chainability neither for the newly 
> generated queries nor for the usual dataframe ones.
> In our experience, this allowed us to iterate *much* faster, generating 
> business-centric classes in a couple of lines of code. Here's an example of 
> what the application code would look like. Attached in the end is a summary 
> of the different strategies that are usually pursued when trying to abstract 
> queries.
> {code:python}
> import pyspark
> from pyspark.sql import DynamicDataFrame
> from pyspark.sql import functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> class Inventory(DynamicDataFrame):
> def update_prices(self, factor: float = 2.0):
> return self.withColumn("price", F.col("price") * factor)
> base_dataframe = spark.createDataFrame(
> data=[["product_1", 2.0], ["product_2", 4.0]],
> schema=["name", "price"],
> )
> print("Doing an inheritance mediated by DynamicDataFrame")
> inventory = Inventory(base_dataframe)
> inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
> print("inventory_updated.show():")
> inventory_updated.show()
> print("After multiple uses of the query we still have the desired type")
> print(f"type(inventory_updated): {type(inventory_updated)}")
> print("We can still use the usual dataframe methods")
> expensive_inventory = inventory_updated.filter(F.col("price") > 25)
> print("expensive_inventory.show():")
> expensive_inventory.show()
> print("And retain the desired type")
> print(f"type(expensive_inventory): {type(expensive_inventory)}")
> {code}
> The PR linked to this ticket is an implementation of the DynamicDataFramed 
> used in this snippet.
>  
> Other strategies found for handling the query abstraction:
> 1. Functions: using functions that call dataframes and returns them 
> transformed. It had a couple of pitfalls: we had to manage the namespaces 
> carefully, there is no clear new object and also the "chainability" didn't 
> feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched 
> ([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) 
> methods with the regularly done queries inside the DataFrame class. This one 
> kept it pyspark-y, but there was no easy way to handle segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
> `DataFrame` and implement the methods there. This one solves all the issues, 
> but with a caveat: the chainable methods cast the result explicitly to 
> `DataFrame` (see 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
>  e g). Therefore, everytime you use one of the parent's methods you'd have to 
> re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>  
> (see 
> [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] for 
> the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37765) PySpark Dynamic DataFrame for easier inheritance

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466251#comment-17466251
 ] 

Apache Spark commented on SPARK-37765:
--

User 'pabloalcain' has created a pull request for this issue:
https://github.com/apache/spark/pull/35045

> PySpark Dynamic DataFrame for easier inheritance
> 
>
> Key: SPARK-37765
> URL: https://issues.apache.org/jira/browse/SPARK-37765
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Pablo Alcain
>Priority: Major
>
> In typical development settings, multiple tables with very different concepts 
> are mapped to the same `DataFrame` class. The inheritance from the pyspark 
> `DataFrame` class is a bit cumbersome because of the chainable methods and it 
> also makes it difficult to abstract regularly used queries. The proposal is 
> to generate a `DynamicDataFrame` that allows easy inheritance retaining 
> `DataFrame` methods without losing chainability neither for the newly 
> generated queries nor for the usual dataframe ones.
> In our experience, this allowed us to iterate *much* faster, generating 
> business-centric classes in a couple of lines of code. Here's an example of 
> what the application code would look like. Attached in the end is a summary 
> of the different strategies that are usually pursued when trying to abstract 
> queries.
> {code:python}
> import pyspark
> from pyspark.sql import DynamicDataFrame
> from pyspark.sql import functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> class Inventory(DynamicDataFrame):
> def update_prices(self, factor: float = 2.0):
> return self.withColumn("price", F.col("price") * factor)
> base_dataframe = spark.createDataFrame(
> data=[["product_1", 2.0], ["product_2", 4.0]],
> schema=["name", "price"],
> )
> print("Doing an inheritance mediated by DynamicDataFrame")
> inventory = Inventory(base_dataframe)
> inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
> print("inventory_updated.show():")
> inventory_updated.show()
> print("After multiple uses of the query we still have the desired type")
> print(f"type(inventory_updated): {type(inventory_updated)}")
> print("We can still use the usual dataframe methods")
> expensive_inventory = inventory_updated.filter(F.col("price") > 25)
> print("expensive_inventory.show():")
> expensive_inventory.show()
> print("And retain the desired type")
> print(f"type(expensive_inventory): {type(expensive_inventory)}")
> {code}
> The PR linked to this ticket is an implementation of the DynamicDataFramed 
> used in this snippet.
>  
> Other strategies found for handling the query abstraction:
> 1. Functions: using functions that call dataframes and returns them 
> transformed. It had a couple of pitfalls: we had to manage the namespaces 
> carefully, there is no clear new object and also the "chainability" didn't 
> feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched 
> ([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) 
> methods with the regularly done queries inside the DataFrame class. This one 
> kept it pyspark-y, but there was no easy way to handle segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
> `DataFrame` and implement the methods there. This one solves all the issues, 
> but with a caveat: the chainable methods cast the result explicitly to 
> `DataFrame` (see 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
>  e g). Therefore, everytime you use one of the parent's methods you'd have to 
> re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>  
> (see 
> [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] for 
> the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37765) PySpark Dynamic DataFrame for easier inheritance

2021-12-28 Thread Pablo Alcain (Jira)
Pablo Alcain created SPARK-37765:


 Summary: PySpark Dynamic DataFrame for easier inheritance
 Key: SPARK-37765
 URL: https://issues.apache.org/jira/browse/SPARK-37765
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Pablo Alcain


In typical development settings, multiple tables with very different concepts 
are mapped to the same `DataFrame` class. The inheritance from the pyspark 
`DataFrame` class is a bit cumbersome because of the chainable methods and it 
also makes it difficult to abstract regularly used queries. The proposal is to 
generate a `DynamicDataFrame` that allows easy inheritance retaining 
`DataFrame` methods without losing chainability neither for the newly generated 
queries nor for the usual dataframe ones.

In our experience, this allowed us to iterate *much* faster, generating 
business-centric classes in a couple of lines of code. Here's an example of 
what the application code would look like. Attached in the end is a summary of 
the different strategies that are usually pursued when trying to abstract 
queries.
{code:python}
import pyspark
from pyspark.sql import DynamicDataFrame
from pyspark.sql import functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()


class Inventory(DynamicDataFrame):
def update_prices(self, factor: float = 2.0):
return self.withColumn("price", F.col("price") * factor)


base_dataframe = spark.createDataFrame(
data=[["product_1", 2.0], ["product_2", 4.0]],
schema=["name", "price"],
)
print("Doing an inheritance mediated by DynamicDataFrame")
inventory = Inventory(base_dataframe)
inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
print("inventory_updated.show():")
inventory_updated.show()
print("After multiple uses of the query we still have the desired type")
print(f"type(inventory_updated): {type(inventory_updated)}")
print("We can still use the usual dataframe methods")
expensive_inventory = inventory_updated.filter(F.col("price") > 25)
print("expensive_inventory.show():")
expensive_inventory.show()
print("And retain the desired type")
print(f"type(expensive_inventory): {type(expensive_inventory)}")
{code}
The PR linked to this ticket is an implementation of the DynamicDataFramed used 
in this snippet.

 

Other strategies found for handling the query abstraction:


1. Functions: using functions that call dataframes and returns them 
transformed. It had a couple of pitfalls: we had to manage the namespaces 
carefully, there is no clear new object and also the "chainability" didn't feel 
very pyspark-y.
2. MonkeyPatching DataFrame: we monkeypatched 
([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) methods 
with the regularly done queries inside the DataFrame class. This one kept it 
pyspark-y, but there was no easy way to handle segregated namespaces/
3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
`DataFrame` and implement the methods there. This one solves all the issues, 
but with a caveat: the chainable methods cast the result explicitly to 
`DataFrame` (see 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
 e g). Therefore, everytime you use one of the parent's methods you'd have to 
re-cast to `MyBusinessDataFrame`, making the code cumbersome.

 

(see [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] 
for the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36361:
-

Assignee: Dongjoon Hyun

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37761:
-

Assignee: Dongjoon Hyun

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466211#comment-17466211
 ] 

Apache Spark commented on SPARK-37761:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35044

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466212#comment-17466212
 ] 

Apache Spark commented on SPARK-37761:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35044

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37761:


Assignee: (was: Apache Spark)

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37761:


Assignee: Apache Spark

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466210#comment-17466210
 ] 

Apache Spark commented on SPARK-37761:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35044

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466209#comment-17466209
 ] 

Apache Spark commented on SPARK-36361:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35044

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466208#comment-17466208
 ] 

Apache Spark commented on SPARK-36361:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35044

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36361:


Assignee: (was: Apache Spark)

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36361:


Assignee: Apache Spark

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36361) Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466201#comment-17466201
 ] 

Dongjoon Hyun commented on SPARK-36361:
---

Oh, it was filed on July. Sorry for missing this for a long time, 
[~hyukjin.kwon].

> Install coverage in Python 3.9 and PyPy 3 in GitHub Actions image
> -
>
> Key: SPARK-36361
> URL: https://issues.apache.org/jira/browse/SPARK-36361
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-36092 requires coverage package to be installed in both Python 3.9 and 
> PyPy. Currently this is being manually installed.
> To save installtation time, it would be great to have them installed in the 
> image we use.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37761) Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image

2021-12-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466196#comment-17466196
 ] 

Dongjoon Hyun commented on SPARK-37761:
---

Sure! I noticed it and have been working on it, [~itholic]. Thank you for 
pinging me again.

> Install matplotlib in Python 3.9 and PyPy 3 in GitHub Actions image
> ---
>
> Key: SPARK-37761
> URL: https://issues.apache.org/jira/browse/SPARK-37761
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Same as SPARK-36361,
> SPARK-37756 requires to install `matplotlib` in both Python 3.9 and PyPy.
> It would be great if we have them installed in the image we use to save 
> installation time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-2421) Spark should treat writable as serializable for keys

2021-12-28 Thread Guillaume Desforges (Jira)


[ https://issues.apache.org/jira/browse/SPARK-2421 ]


Guillaume Desforges deleted comment on SPARK-2421:


was (Author: JIRAUSER280319):
It seems that the issue was indeed resolved with the introduction of 
SerializableWritable

https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/SerializableWritable.html

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>Priority: Major
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2021-12-28 Thread Guillaume Desforges (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466171#comment-17466171
 ] 

Guillaume Desforges commented on SPARK-2421:


It seems that the issue was indeed resolved with the introduction of 
SerializableWritable

https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/SerializableWritable.html

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>Priority: Major
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37751) Apache Commons Crypto doesn't support Java 11

2021-12-28 Thread Shipeng Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shipeng Feng updated SPARK-37751:
-
Priority: Blocker  (was: Major)

> Apache Commons Crypto doesn't support Java 11
> -
>
> Key: SPARK-37751
> URL: https://issues.apache.org/jira/browse/SPARK-37751
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.1.2, 3.2.0
> Environment: Spark 3.2.0 on kubernetes
>Reporter: Shipeng Feng
>Priority: Blocker
>
> For kubernetes, we are using Java 11 in docker, 
> [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:]
> {code:java}
> ARG java_image_tag=11-jre-slim
> {code}
> We have a simple app:
> {code:scala}
> object SimpleApp {
>   def main(args: Array[String]) {
> val session = SparkSession.builder.getOrCreate
>   
> // the size of demo.csv is 5GB
> val rdd = session.read.option("header", "true").option("inferSchema", 
> "true").csv("/data/demo.csv").rdd
> val lines = rdd.repartition(200)
> val count = lines.count()
>   }
> }
> {code}
>  
> Enable AES-based encryption for RPC connection by the following config:
> {code:java}
> --conf spark.authenticate=true
> --conf spark.network.crypto.enabled=true
> {code}
> This would cause the following error:
> {code:java}
> java.lang.IllegalArgumentException: Frame length should be positive: 
> -6119185687804983867
>   at 
> org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>   at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Unknown Source) {code}
> The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 
> only works with Java 8: 
> [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37705) Write session time zone in the Parquet file metadata so that rebase can use it instead of JVM timezone

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466149#comment-17466149
 ] 

Apache Spark commented on SPARK-37705:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35042

> Write session time zone in the Parquet file metadata so that rebase can use 
> it instead of JVM timezone
> --
>
> Key: SPARK-37705
> URL: https://issues.apache.org/jira/browse/SPARK-37705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> We could write session time zone in the Parquet file that was used to write 
> timestamps so we can use the same time zone to reconstruct the values instead 
> of using JVM time zone (which could be different).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37764) Reserve bucket information when relation conversion from metastore relations to data source relations

2021-12-28 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf resolved SPARK-37764.
--
Resolution: Invalid

> Reserve bucket information when relation conversion from metastore relations 
> to data source relations
> -
>
> Key: SPARK-37764
> URL: https://issues.apache.org/jira/browse/SPARK-37764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently, we discarded bucket information when relation conversion from 
> metastore relations to data source relations
> The PR aims to fix the bug, get bucket information from catalogtable and pass 
> it to data source relations



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37691) Support ANSI Aggregation Function: percentile_disc

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466064#comment-17466064
 ] 

Apache Spark commented on SPARK-37691:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35041

> Support ANSI Aggregation Function: percentile_disc
> --
>
> Key: SPARK-37691
> URL: https://issues.apache.org/jira/browse/SPARK-37691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> PERCENTILE_DISC is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37691) Support ANSI Aggregation Function: percentile_disc

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37691:


Assignee: (was: Apache Spark)

> Support ANSI Aggregation Function: percentile_disc
> --
>
> Key: SPARK-37691
> URL: https://issues.apache.org/jira/browse/SPARK-37691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> PERCENTILE_DISC is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37691) Support ANSI Aggregation Function: percentile_disc

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37691:


Assignee: Apache Spark

> Support ANSI Aggregation Function: percentile_disc
> --
>
> Key: SPARK-37691
> URL: https://issues.apache.org/jira/browse/SPARK-37691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> PERCENTILE_DISC is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37691) Support ANSI Aggregation Function: percentile_disc

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466062#comment-17466062
 ] 

Apache Spark commented on SPARK-37691:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35041

> Support ANSI Aggregation Function: percentile_disc
> --
>
> Key: SPARK-37691
> URL: https://issues.apache.org/jira/browse/SPARK-37691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> PERCENTILE_DISC is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37754) Fix black version in dev/reformat-python

2021-12-28 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37754:
--

Assignee: Chia-Ping Tsai

> Fix black version in dev/reformat-python
> 
>
> Key: SPARK-37754
> URL: https://issues.apache.org/jira/browse/SPARK-37754
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Trivial
>
> SPARK-37737 updated the black from 21.5b2 to 21.12b0 and applies 
> dev/reformat-python. However, reformat-python still recommend to install 
> black:21.5b2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37754) Fix black version in dev/reformat-python

2021-12-28 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37754.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35033
[https://github.com/apache/spark/pull/35033]

> Fix black version in dev/reformat-python
> 
>
> Key: SPARK-37754
> URL: https://issues.apache.org/jira/browse/SPARK-37754
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chia-Ping Tsai
>Assignee: Chia-Ping Tsai
>Priority: Trivial
> Fix For: 3.3.0
>
>
> SPARK-37737 updated the black from 21.5b2 to 21.12b0 and applies 
> dev/reformat-python. However, reformat-python still recommend to install 
> black:21.5b2



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37764) Reserve bucket information when relation conversion from metastore relations to data source relations

2021-12-28 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-37764:
-
Description: 
Currently, we discarded bucket information when relation conversion from 
metastore relations to data source relations

The PR aims to fix the bug, get bucket information from catalogtable and pass 
it to data source relations

  was:Currently, we discarded bucket information when 


> Reserve bucket information when relation conversion from metastore relations 
> to data source relations
> -
>
> Key: SPARK-37764
> URL: https://issues.apache.org/jira/browse/SPARK-37764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently, we discarded bucket information when relation conversion from 
> metastore relations to data source relations
> The PR aims to fix the bug, get bucket information from catalogtable and pass 
> it to data source relations



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37764) Reserve bucket information when relation conversion from metastore relations to data source relations

2021-12-28 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-37764:
-
Description: Currently, we discarded bucket information when 

> Reserve bucket information when relation conversion from metastore relations 
> to data source relations
> -
>
> Key: SPARK-37764
> URL: https://issues.apache.org/jira/browse/SPARK-37764
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently, we discarded bucket information when 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37764) Reserve bucket information when relation conversion from metastore relations to data source relations

2021-12-28 Thread yikf (Jira)
yikf created SPARK-37764:


 Summary: Reserve bucket information when relation conversion from 
metastore relations to data source relations
 Key: SPARK-37764
 URL: https://issues.apache.org/jira/browse/SPARK-37764
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: yikf
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466022#comment-17466022
 ] 

Apache Spark commented on SPARK-37367:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35040

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466021#comment-17466021
 ] 

Apache Spark commented on SPARK-37367:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35040

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37367:


Assignee: (was: Apache Spark)

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-12-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37367:


Assignee: Apache Spark

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37503) Improve SparkSession/PySpark SparkSession startup

2021-12-28 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466009#comment-17466009
 ] 

angerszhu commented on SPARK-37503:
---

Yea, thanks


> Improve SparkSession/PySpark SparkSession startup
> -
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function

2021-12-28 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465995#comment-17465995
 ] 

angerszhu commented on SPARK-37507:
---

Yea , ok.

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465994#comment-17465994
 ] 

Apache Spark commented on SPARK-37728:
--

User 'yym1995' has created a pull request for this issue:
https://github.com/apache/spark/pull/35038

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yimin Yang
>Assignee: Yimin Yang
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)

[jira] [Commented] (SPARK-37503) Improve SparkSession/PySpark SparkSession startup

2021-12-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465993#comment-17465993
 ] 

Hyukjin Kwon commented on SPARK-37503:
--

oh you did it correctly. I just switched it from "Fixed" to "Done"

> Improve SparkSession/PySpark SparkSession startup
> -
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37728) reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException

2021-12-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465992#comment-17465992
 ] 

Apache Spark commented on SPARK-37728:
--

User 'yym1995' has created a pull request for this issue:
https://github.com/apache/spark/pull/35038

> reading nested columns with ORC vectorized reader can cause 
> ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-37728
> URL: https://issues.apache.org/jira/browse/SPARK-37728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yimin Yang
>Assignee: Yimin Yang
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.orc.enableNestedColumnVectorizedReader is set to true, reading 
> nested columns of ORC files can cause ArrayIndexOutOfBoundsException. Here is 
> a simple reproduction:
> 1) create an ORC file which contains records of type Array>:
> {code:java}
> ./bin/spark-shell {code}
> {code:java}
> case class Item(record: Array[Array[String]])
> val data = new Array[Array[Array[String]]](100)
>     for (i <- 0 to 99) {
>       val temp = new Array[Array[String]](50)
>       for (j <- 0 to 49) {
>         temp(j) = new Array[String](1000)
>         for (k <- 0 to 999) {
>           temp(j)(k) = k.toString
>         }
>       }
>       data(i) = temp
>     }
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = rdd.map(x => Item(x)).toDF
> df.write.orc("file:///home/user_name/data") {code}
>  
> 2) read the orc with spark.sql.orc.enableNestedColumnVectorizedReader=true
> {code:java}
> ./bin/spark-shell --conf spark.sql.orc.enableVectorizedReader=true --conf 
> spark.sql.codegen.wholeStage=true --conf 
> spark.sql.orc.enableNestedColumnVectorizedReader=true --conf 
> spark.sql.orc.columnarReaderBatchSize=4096 {code}
> {code:java}
> val df = spark.read.orc("file:///home/user_name/data")
> df.show(100) {code}
>  
> Then Spark threw ArrayIndexOutOfBoundsException:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1162)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1162)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:940)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2227)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2248)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:490)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3833)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3824)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3822)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:3053)

[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function

2021-12-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465991#comment-17465991
 ] 

Hyukjin Kwon commented on SPARK-37507:
--

Oh actually [~XinrongM] is already working on it. Sorry she had to leave a 
comment here. Is that okay? [~angerszhuuu] and [~beliefer]?

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37503) Improve SparkSession/PySpark SparkSession startup

2021-12-28 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465990#comment-17465990
 ] 

angerszhu commented on SPARK-37503:
---

Seem I didn't see the tag of Resolved, not have the priviliage? 

> Improve SparkSession/PySpark SparkSession startup
> -
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function

2021-12-28 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465989#comment-17465989
 ] 

angerszhu commented on SPARK-37507:
---

Checked with [~beliefer] I will take this one.

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-37503) Improve SparkSession/PySpark SparkSession startup

2021-12-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-37503:
--

> Improve SparkSession/PySpark SparkSession startup
> -
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37503) Improve SparkSession/PySpark SparkSession startup

2021-12-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37503.
--
Resolution: Done

> Improve SparkSession/PySpark SparkSession startup
> -
>
> Key: SPARK-37503
> URL: https://issues.apache.org/jira/browse/SPARK-37503
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >