date:20220615

[jira] [Assigned] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39490:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
>  - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
>  -- v1.16 [alpha]
>  -- v1.21 [beta]
>  -- v1.23 [stable]
> To support IPv6-only environment, we need to control this features.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554911#comment-17554911
 ] 

Apache Spark commented on SPARK-39490:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36887

> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
>  - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
>  -- v1.16 [alpha]
>  -- v1.21 [beta]
>  -- v1.23 [stable]
> To support IPv6-only environment, we need to control this features.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39490:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
>  - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
>  -- v1.16 [alpha]
>  -- v1.21 [beta]
>  -- v1.23 [stable]
> To support IPv6-only environment, we need to control this features.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39490:
--
Description: 
K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
 - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
 -- v1.16 [alpha]
 -- v1.21 [beta]
 -- v1.23 [stable]

To support IPv6-only environment, we need to control this features.

  was:
K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
 - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
 -- v1.16 [alpha]
 -- v1.21 [beta]
 -- v1.23 [stable]


> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
>  - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
>  -- v1.16 [alpha]
>  -- v1.21 [beta]
>  -- v1.23 [stable]
> To support IPv6-only environment, we need to control this features.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39490:
--
Description: 
K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
 - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
 -- v1.16 [alpha]
 -- v1.21 [beta]
 -- v1.23 [stable]

> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> K8s IPv4/IPv6 dual-stack Feature reached `Stable` stage at v1.23.
>  - [https://kubernetes.io/docs/concepts/services-networking/dual-stack/]
>  -- v1.16 [alpha]
>  -- v1.21 [beta]
>  -- v1.23 [stable]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39490) Support `ipFamilyPolicy` and `ipFamilies` in Driver Service

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39490:
--
Summary: Support `ipFamilyPolicy` and `ipFamilies` in Driver Service  (was: 
Support ipFamilyPolicy and ipFamilies in Driver Service)

> Support `ipFamilyPolicy` and `ipFamilies` in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39490) Support ipFamilyPolicy and ipFamilies in Driver Service

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39490:
-

Assignee: Dongjoon Hyun

> Support ipFamilyPolicy and ipFamilies in Driver Service
> ---
>
> Key: SPARK-39490
> URL: https://issues.apache.org/jira/browse/SPARK-39490
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39490) Support ipFamilyPolicy and ipFamilies in Driver Service

2022-06-15 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-39490:
-

 Summary: Support ipFamilyPolicy and ipFamilies in Driver Service
 Key: SPARK-39490
 URL: https://issues.apache.org/jira/browse/SPARK-39490
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39399) proxy-user support not working for Spark on k8s in cluster deploy mode

2022-06-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554904#comment-17554904
 ] 

pralabhkumar commented on SPARK-39399:
--

ping [~hyukjin.kwon]  , please help us on the same or please provide some 
reference who can take this forward.  

> proxy-user support not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user on the 
> spark-submit command to the childArgs. The actual functionality of 
> authentication using the proxy user is not working in case of cluster deploy 
> mode for Spark on K8s.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \
> --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \--conf 
> spark.kubernetes.file.upload.path=hdfs:///tmp \--conf 
> spark.kubernetes.container.image.pullPolicy=Always \
> --conf 
> spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/log4j/log4j.properties
>  \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
>

[jira] [Commented] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554899#comment-17554899
 ] 

Hyukjin Kwon commented on SPARK-39074:
--

Reverted at 
https://github.com/apache/spark/commit/ae10ff8837385871c3f72b2b7bb97dd235872602

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39383) Support V2 data sources with DEFAULT values

2022-06-15 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-39383:
--

Assignee: Daniel

> Support V2 data sources with DEFAULT values
> ---
>
> Key: SPARK-39383
> URL: https://issues.apache.org/jira/browse/SPARK-39383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39383) Support V2 data sources with DEFAULT values

2022-06-15 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-39383.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36880
[https://github.com/apache/spark/pull/36880]

> Support V2 data sources with DEFAULT values
> ---
>
> Key: SPARK-39383
> URL: https://issues.apache.org/jira/browse/SPARK-39383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-39074:
--
  Assignee: (was: Enrico Minack)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
> Fix For: 3.4.0
>
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554895#comment-17554895
 ] 

Hyukjin Kwon commented on SPARK-39074:
--

Fixed in 
https://github.com/apache/spark/commit/ae10ff8837385871c3f72b2b7bb97dd235872602

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39074:


Assignee: (was: Apache Spark)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39074:


Assignee: Apache Spark

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39074:
-
Fix Version/s: (was: 3.4.0)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39489) Improve EventLoggingListener and ReplayListener performance by replacing Json4S ASTs with Jackson trees

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554874#comment-17554874
 ] 

Apache Spark commented on SPARK-39489:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/36885

> Improve EventLoggingListener and ReplayListener performance by replacing 
> Json4S ASTs with Jackson trees
> ---
>
> Key: SPARK-39489
> URL: https://issues.apache.org/jira/browse/SPARK-39489
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark's event log JsonProtocol currently uses Json4s ASTs to generate and 
> parse JSON. Performance overheads from Json4s account for a significant 
> proportion of all time spent in JsonProtocol. If we replace Json4s usage with 
> direct usage of Jackson APIs then we can significantly improve performance 
> (~2x improvement for writing and reading in my own local microbenchmarks).
> This performance improvement translates to faster history server load times 
> and reduced load on the Spark driver (and reduced likelihood of dropping 
> events because the listener cannot keep up, therefore reducing the likelihood 
> of inconsistent Spark UIs).
> Reducing our usage of Json4s is also a step towards being able to eventually 
> remove our dependency on Json4s: Spark's current use of Json4s creates 
> library conflicts for end users who want to adopt Json4s 4 (see discussion on 
> PRs for SPARK-36408). If Spark can eventually remove its Json4s dependency 
> then we will completely eliminate such conflicts.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39489) Improve EventLoggingListener and ReplayListener performance by replacing Json4S ASTs with Jackson trees

2022-06-15 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-39489:
--

 Summary: Improve EventLoggingListener and ReplayListener 
performance by replacing Json4S ASTs with Jackson trees
 Key: SPARK-39489
 URL: https://issues.apache.org/jira/browse/SPARK-39489
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


Spark's event log JsonProtocol currently uses Json4s ASTs to generate and parse 
JSON. Performance overheads from Json4s account for a significant proportion of 
all time spent in JsonProtocol. If we replace Json4s usage with direct usage of 
Jackson APIs then we can significantly improve performance (~2x improvement for 
writing and reading in my own local microbenchmarks).

This performance improvement translates to faster history server load times and 
reduced load on the Spark driver (and reduced likelihood of dropping events 
because the listener cannot keep up, therefore reducing the likelihood of 
inconsistent Spark UIs).

Reducing our usage of Json4s is also a step towards being able to eventually 
remove our dependency on Json4s: Spark's current use of Json4s creates library 
conflicts for end users who want to adopt Json4s 4 (see discussion on PRs for 
SPARK-36408). If Spark can eventually remove its Json4s dependency then we will 
completely eliminate such conflicts.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39488) Simplify the error handling of TempResolvedColumn

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39488:


Assignee: (was: Apache Spark)

> Simplify the error handling of TempResolvedColumn
> -
>
> Key: SPARK-39488
> URL: https://issues.apache.org/jira/browse/SPARK-39488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39488) Simplify the error handling of TempResolvedColumn

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39488:


Assignee: Apache Spark

> Simplify the error handling of TempResolvedColumn
> -
>
> Key: SPARK-39488
> URL: https://issues.apache.org/jira/browse/SPARK-39488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39488) Simplify the error handling of TempResolvedColumn

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554851#comment-17554851
 ] 

Apache Spark commented on SPARK-39488:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/36809

> Simplify the error handling of TempResolvedColumn
> -
>
> Key: SPARK-39488
> URL: https://issues.apache.org/jira/browse/SPARK-39488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39488) Simplify the error handling of TempResolvedColumn

2022-06-15 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-39488:
---

 Summary: Simplify the error handling of TempResolvedColumn
 Key: SPARK-39488
 URL: https://issues.apache.org/jira/browse/SPARK-39488
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39476) Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float

2022-06-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-39476.
-
Fix Version/s: 3.3.1
   3.2.2
   3.1.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 36873
[https://github.com/apache/spark/pull/36873]

> Disable Unwrap cast optimize when casting from Long to Float/ Double or from 
> Integer to Float
> -
>
> Key: SPARK-39476
> URL: https://issues.apache.org/jira/browse/SPARK-39476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1, 3.3.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
> Fix For: 3.3.1, 3.2.2, 3.1.3, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39476) Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float

2022-06-15 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-39476:
---

Assignee: EdisonWang

> Disable Unwrap cast optimize when casting from Long to Float/ Double or from 
> Integer to Float
> -
>
> Key: SPARK-39476
> URL: https://issues.apache.org/jira/browse/SPARK-39476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1, 3.3.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39465) Log4j version upgrade to 2.17.2

2022-06-15 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-39465.

Resolution: Done

> Log4j version upgrade to 2.17.2
> ---
>
> Key: SPARK-39465
> URL: https://issues.apache.org/jira/browse/SPARK-39465
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Java API
>Affects Versions: 3.2.1
> Environment: Production 
>Reporter: Chethan G B
>Priority: Major
>
> Hi Team, 
> There were talks about upgrading log4j to latest version available as part of 
> security fix.
> Wanted to know, if it is already upgraded.
>  
> Note: We are using  below dependencies,
>  
> 
>           org.apache.spark
>           spark-core_2.12
>           3.0.1
> 
> 
>           org.apache.spark
>           spark-sql_2.12
>           3.0.1
> 
> Kindly let us know when the log4j upgrade will be available for users ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-39465) Log4j version upgrade to 2.17.2

2022-06-15 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-39465:


> Log4j version upgrade to 2.17.2
> ---
>
> Key: SPARK-39465
> URL: https://issues.apache.org/jira/browse/SPARK-39465
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Java API
>Affects Versions: 3.2.1
> Environment: Production 
>Reporter: Chethan G B
>Priority: Major
>
> Hi Team, 
> There were talks about upgrading log4j to latest version available as part of 
> security fix.
> Wanted to know, if it is already upgraded.
>  
> Note: We are using  below dependencies,
>  
> 
>           org.apache.spark
>           spark-core_2.12
>           3.0.1
> 
> 
>           org.apache.spark
>           spark-sql_2.12
>           3.0.1
> 
> Kindly let us know when the log4j upgrade will be available for users ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39465) Log4j version upgrade to 2.17.2

2022-06-15 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554844#comment-17554844
 ] 

Josh Rosen edited comment on SPARK-39465 at 6/16/22 1:21 AM:
-

Spark uses Log4J 2.x starting in Spark 3.3.0+; see SPARK-37814 

The migration from Log4J 1.x to Log4J 2.x is too large of a change for us to 
backport to existing Spark versions (see [related discussion on another 
ticket|https://issues.apache.org/jira/browse/SPARK-37883?focusedCommentId=17481521=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481521]).

As a result, if you want to use Log4J 2.x then you will need to upgrade to 
Spark 3.3.0.

The [Spark 3.3.0 release vote just passed 
yesterday|https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm], so 
the release should be published in the next couple of days. 


was (Author: joshrosen):
Spark uses Log4J 2.x starting in Spark 3.3.0+; see SPARK-37814 

The migration from Log4J 1.x to Log4J 2.x is too large of a change for us to 
backport to existing Spark versions (see related discussion on another ticket).

As a result, if you want to use Log4J 2.x then you will need to upgrade to 
Spark 3.3.0.

The [Spark 3.3.0 release vote just passed 
yesterday|https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm], so 
the release should be published in the next couple of days. 

> Log4j version upgrade to 2.17.2
> ---
>
> Key: SPARK-39465
> URL: https://issues.apache.org/jira/browse/SPARK-39465
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Java API
>Affects Versions: 3.2.1
> Environment: Production 
>Reporter: Chethan G B
>Priority: Major
>
> Hi Team, 
> There were talks about upgrading log4j to latest version available as part of 
> security fix.
> Wanted to know, if it is already upgraded.
>  
> Note: We are using  below dependencies,
>  
> 
>           org.apache.spark
>           spark-core_2.12
>           3.0.1
> 
> 
>           org.apache.spark
>           spark-sql_2.12
>           3.0.1
> 
> Kindly let us know when the log4j upgrade will be available for users ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39465) Log4j version upgrade to 2.17.2

2022-06-15 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-39465.

Resolution: Won't Fix

> Log4j version upgrade to 2.17.2
> ---
>
> Key: SPARK-39465
> URL: https://issues.apache.org/jira/browse/SPARK-39465
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Java API
>Affects Versions: 3.2.1
> Environment: Production 
>Reporter: Chethan G B
>Priority: Major
>
> Hi Team, 
> There were talks about upgrading log4j to latest version available as part of 
> security fix.
> Wanted to know, if it is already upgraded.
>  
> Note: We are using  below dependencies,
>  
> 
>           org.apache.spark
>           spark-core_2.12
>           3.0.1
> 
> 
>           org.apache.spark
>           spark-sql_2.12
>           3.0.1
> 
> Kindly let us know when the log4j upgrade will be available for users ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39465) Log4j version upgrade to 2.17.2

2022-06-15 Thread Josh Rosen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554844#comment-17554844
 ] 

Josh Rosen commented on SPARK-39465:


Spark uses Log4J 2.x starting in Spark 3.3.0+; see SPARK-37814 

The migration from Log4J 1.x to Log4J 2.x is too large of a change for us to 
backport to existing Spark versions (see related discussion on another ticket).

As a result, if you want to use Log4J 2.x then you will need to upgrade to 
Spark 3.3.0.

The [Spark 3.3.0 release vote just passed 
yesterday|https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm], so 
the release should be published in the next couple of days. 

> Log4j version upgrade to 2.17.2
> ---
>
> Key: SPARK-39465
> URL: https://issues.apache.org/jira/browse/SPARK-39465
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Java API
>Affects Versions: 3.2.1
> Environment: Production 
>Reporter: Chethan G B
>Priority: Major
>
> Hi Team, 
> There were talks about upgrading log4j to latest version available as part of 
> security fix.
> Wanted to know, if it is already upgraded.
>  
> Note: We are using  below dependencies,
>  
> 
>           org.apache.spark
>           spark-core_2.12
>           3.0.1
> 
> 
>           org.apache.spark
>           spark-sql_2.12
>           3.0.1
> 
> Kindly let us know when the log4j upgrade will be available for users ?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39485) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39485:


Assignee: (was: Apache Spark)

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39485
> URL: https://issues.apache.org/jira/browse/SPARK-39485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Assigned] (SPARK-39485) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39485:


Assignee: Apache Spark

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39485
> URL: https://issues.apache.org/jira/browse/SPARK-39485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Assignee: Apache Spark
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Commented] (SPARK-39485) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554840#comment-17554840
 ] 

Apache Spark commented on SPARK-39485:
--

User 'koodin9' has created a pull request for this issue:
https://github.com/apache/spark/pull/36884

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39485
> URL: https://issues.apache.org/jira/browse/SPARK-39485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Updated] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39061:
-
Fix Version/s: 3.3.1
   (was: 3.3.0)

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.2, 3.3.1
>
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39061:


Assignee: Bruce Robbins

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39061.
--
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36883
[https://github.com/apache/spark/pull/36883]

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.3.0, 3.2.2
>
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-15 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554836#comment-17554836
 ] 

Hyukjin Kwon commented on SPARK-38292:
--

please go ahead!

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39482) Add build and test documentation on IPv6

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39482.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36879
[https://github.com/apache/spark/pull/36879]

> Add build and test documentation on IPv6
> 
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39457) Support IPv6-only environment

2022-06-15 Thread DB Tsai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554830#comment-17554830
 ] 

DB Tsai commented on SPARK-39457:
-

If there is any IPv6 issue in Hadoop client side, we might hit it once we get 
Spark fully working on pure IPv6 env. We will test it once we get there.

> Support IPv6-only environment
> -
>
> Key: SPARK-39457
> URL: https://issues.apache.org/jira/browse/SPARK-39457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: releasenotes
>
> Spark doesn't fully work in pure IPV6 environment that doesn't have IPV4 at 
> all. This is an umbrella jira tracking the support of pure IPV6 deployment. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-39486) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SeongHoon Ku closed SPARK-39486.


duplicated https://issues.apache.org/jira/browse/SPARK-39485

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39486
> URL: https://issues.apache.org/jira/browse/SPARK-39486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Closed] (SPARK-39487) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SeongHoon Ku closed SPARK-39487.


duplicated https://issues.apache.org/jira/browse/SPARK-39485

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39487
> URL: https://issues.apache.org/jira/browse/SPARK-39487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Resolved] (SPARK-39487) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SeongHoon Ku resolved SPARK-39487.
--
Resolution: Duplicate

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39487
> URL: https://issues.apache.org/jira/browse/SPARK-39487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Resolved] (SPARK-39486) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SeongHoon Ku resolved SPARK-39486.
--
Resolution: Duplicate

> When fetching hiveMetastoreJars from path, IsolatedClientLoader should get 
> hive settings from origLoader.
> -
>
> Key: SPARK-39486
> URL: https://issues.apache.org/jira/browse/SPARK-39486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: SeongHoon Ku
>Priority: Major
>
> Hi all, 
> I made a spark application where deploy-mode is YARN cluster and 
> spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
> And "spark.yarn.dist.files" was set so that the driver could refer to 
> hive-related xml files in cluster mode.
> {code}
> spark.yarn.dist.files 
> viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
> {code}
> application failed with the following error.
> {code}
> 22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.spark.sql.AnalysisException: 
> java.lang.ExceptionInInitializerError: null
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
>   at 
> org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
>

[jira] [Created] (SPARK-39487) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)

SeongHoon Ku created SPARK-39487:


 Summary: When fetching hiveMetastoreJars from path, 
IsolatedClientLoader should get hive settings from origLoader.
 Key: SPARK-39487
 URL: https://issues.apache.org/jira/browse/SPARK-39487
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: SeongHoon Ku


Hi all, 

I made a spark application where deploy-mode is YARN cluster and 
spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
And "spark.yarn.dist.files" was set so that the driver could refer to 
hive-related xml files in cluster mode.
{code}
spark.yarn.dist.files 
viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
{code}

application failed with the following error.

{code}
22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED (diag message: User class threw exception: 
org.apache.spark.sql.AnalysisException: java.lang.ExceptionInInitializerError: 
null
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
at 
org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at

[jira] [Created] (SPARK-39486) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)

SeongHoon Ku created SPARK-39486:


 Summary: When fetching hiveMetastoreJars from path, 
IsolatedClientLoader should get hive settings from origLoader.
 Key: SPARK-39486
 URL: https://issues.apache.org/jira/browse/SPARK-39486
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: SeongHoon Ku


Hi all, 

I made a spark application where deploy-mode is YARN cluster and 
spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
And "spark.yarn.dist.files" was set so that the driver could refer to 
hive-related xml files in cluster mode.
{code}
spark.yarn.dist.files 
viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
{code}

application failed with the following error.

{code}
22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED (diag message: User class threw exception: 
org.apache.spark.sql.AnalysisException: java.lang.ExceptionInInitializerError: 
null
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
at 
org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at

[jira] [Created] (SPARK-39485) When fetching hiveMetastoreJars from path, IsolatedClientLoader should get hive settings from origLoader.

2022-06-15 Thread SeongHoon Ku (Jira)

SeongHoon Ku created SPARK-39485:


 Summary: When fetching hiveMetastoreJars from path, 
IsolatedClientLoader should get hive settings from origLoader.
 Key: SPARK-39485
 URL: https://issues.apache.org/jira/browse/SPARK-39485
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: SeongHoon Ku


Hi all, 

I made a spark application where deploy-mode is YARN cluster and 
spark.sql.hive.metastore.jars is path and hive metastore version is 2.3.2.
And "spark.yarn.dist.files" was set so that the driver could refer to 
hive-related xml files in cluster mode.
{code}
spark.yarn.dist.files 
viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hive-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hivemetastore-site.xml,viewfs:///app/spark-3.2.1-bin-without-hadoop/conf/hiveserver2-site.xml
{code}

application failed with the following error.

{code}
22/06/14 13:51:46 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with FAILED (diag message: User class threw exception: 
org.apache.spark.sql.AnalysisException: java.lang.ExceptionInInitializerError: 
null
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:111)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:298)
at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:205)
at 
org.apache.spark.sql.execution.datasources.v2.ShowNamespacesExec.run(ShowNamespacesExec.scala:42)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at

[jira] [Assigned] (SPARK-39469) Infer date type for CSV schema inference

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39469:


Assignee: (was: Apache Spark)

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Priority: Major
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39469) Infer date type for CSV schema inference

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554809#comment-17554809
 ] 

Apache Spark commented on SPARK-39469:
--

User 'Jonathancui123' has created a pull request for this issue:
https://github.com/apache/spark/pull/36871

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Priority: Major
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39469) Infer date type for CSV schema inference

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554808#comment-17554808
 ] 

Apache Spark commented on SPARK-39469:
--

User 'Jonathancui123' has created a pull request for this issue:
https://github.com/apache/spark/pull/36871

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Priority: Major
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39469) Infer date type for CSV schema inference

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39469:


Assignee: Apache Spark

> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Assignee: Apache Spark
>Priority: Major
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39469) Infer date type for CSV schema inference

2022-06-15 Thread Jonathan Cui (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Cui updated SPARK-39469:
-
Description: 
1. If a column contains only dates, it should be of “date” type in the inferred 
schema
 * If the date format and the timestamp format are identical (e.g. both are 
/mm/dd), entries will default to being interpreted as Date

2. If a column contains dates and timestamps, it should be of “timestamp” type 
in the inferred schema

 

A similar issue was opened in the past but was reverted due to the lack of 
strict pattern matching. 

  was:
1. If a column contains only dates, it should be of “date” type in the inferred 
schema
 * If the date format and the timestamp format are identical (e.g. both are 
/mm/dd), entries will default to being interpreted as Date

2. If a column contains dates and timestamps, it should be of “timestamp” type 
in the inferred schema


> Infer date type for CSV schema inference
> 
>
> Key: SPARK-39469
> URL: https://issues.apache.org/jira/browse/SPARK-39469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Jonathan Cui
>Priority: Major
>
> 1. If a column contains only dates, it should be of “date” type in the 
> inferred schema
>  * If the date format and the timestamp format are identical (e.g. both are 
> /mm/dd), entries will default to being interpreted as Date
> 2. If a column contains dates and timestamps, it should be of “timestamp” 
> type in the inferred schema
>  
> A similar issue was opened in the past but was reverted due to the lack of 
> strict pattern matching. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39061:


Assignee: (was: Apache Spark)

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554801#comment-17554801
 ] 

Apache Spark commented on SPARK-39061:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36883

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39061) Incorrect results or NPE when using Inline function against an array of dynamically created structs

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39061:


Assignee: Apache Spark

> Incorrect results or NPE when using Inline function against an array of 
> dynamically created structs
> ---
>
> Key: SPARK-39061
> URL: https://issues.apache.org/jira/browse/SPARK-39061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> The following query returns incorrect results:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
> 1 2
> -1-1
> Time taken: 4.053 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> In Hive, the last row is {{NULL, NULL}}:
> {noformat}
> Beeline version 2.3.9 by Apache Hive
> 0: jdbc:hive2://localhost:1> select inline(array(named_struct('a', 1, 
> 'b', 2), null));
> +---+---+
> |   a   |   b   |
> +---+---+
> | 1 | 2 |
> | NULL  | NULL  |
> +---+---+
> 2 rows selected (1.355 seconds)
> 0: jdbc:hive2://localhost:1> 
> {noformat}
> If the struct has string fields, you get a {{NullPointerException}}:
> {noformat}
> spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
> 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>  ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) ~[?:?]
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  ~[spark-sql_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
> {noformat}
> You can work around the issue by casting the null entry of the array:
> {noformat}
> spark-sql> select inline(array(named_struct('a', 1, 'b', 2), cast(null as 
> struct)));
> 1 2
> NULL  NULL
> Time taken: 0.068 seconds, Fetched 2 row(s)
> spark-sql>
> {noformat}
> As far as I can tell, this issue only happens with arrays of structs where 
> the structs are created in an inline table or in a projection.
> The fields of the struct are not getting set to {{nullable = true}} when 
> there is no example in the array where the field is set to {{null}}. As a 
> result, {{GenerateUnsafeProjection.createCode}} generates bad code: it has no 
> code to create a row of null columns, so it just creates a row from variables 
> set with default values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39468) Improve RpcAddress to add [] to IPv6 if needed

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554800#comment-17554800
 ] 

Apache Spark commented on SPARK-39468:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36882

> Improve RpcAddress to add [] to IPv6 if needed
> --
>
> Key: SPARK-39468
> URL: https://issues.apache.org/jira/browse/SPARK-39468
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39484) V2 write for type struct fails to handle case sensitivity on field names during resolution of V2 write command

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554777#comment-17554777
 ] 

Apache Spark commented on SPARK-39484:
--

User 'edgarRd' has created a pull request for this issue:
https://github.com/apache/spark/pull/36881

> V2 write for type struct fails to handle case sensitivity on field names 
> during resolution of V2 write command
> --
>
> Key: SPARK-39484
> URL: https://issues.apache.org/jira/browse/SPARK-39484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1
> Environment: {{{}master{}}}, {{3.1.1}}
>Reporter: Edgar Rodriguez
>Priority: Minor
>
> Summary:
> When a V2 write uses an input with a {{struct}} type which contains 
> differences in the casing of field names, the {{caseSensitive}} config is not 
> being honored, always doing a strict case sensitive comparison.
> Repro:
> {code:java}
> CREATE TABLE tmp.test_table_to (key int, object struct) USING 
> ICEBERG;
> CREATE TABLE tmp.test_table_from (key int, object struct) USING 
> HIVE;
> INSERT OVERWRITE tmp.test_table_to SELECT 1 as key, object FROM 
> tmp.test_table_from;{code}
> The above results in Exception:
> {code:java}
> Error in query: unresolved operator 'OverwriteByExpression RelationV2[key#3, 
> object#4] spark_catalog.tmp.test_table_to, true, false;
> 'OverwriteByExpression RelationV2[key#3, object#4] 
> spark_catalog.tmp.test_table_to, true, false
> +- Project [1 AS key#0, object#2]
>    +- SubqueryAlias spark_catalog.tmp.test_table_from
>       +- HiveTableRelation [`tmp`.`test_table_from`, 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: 
> [key#1, object#2], Partition Cols: []]{code}
>  
> If the casing matches in the struct field names, the v2 write works as 
> expected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39484) V2 write for type struct fails to handle case sensitivity on field names during resolution of V2 write command

2022-06-15 Thread Edgar Rodriguez (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554776#comment-17554776
 ] 

Edgar Rodriguez commented on SPARK-39484:
-

Proposed solution PR: https://github.com/apache/spark/pull/36881

> V2 write for type struct fails to handle case sensitivity on field names 
> during resolution of V2 write command
> --
>
> Key: SPARK-39484
> URL: https://issues.apache.org/jira/browse/SPARK-39484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1
> Environment: {{{}master{}}}, {{3.1.1}}
>Reporter: Edgar Rodriguez
>Priority: Minor
>
> Summary:
> When a V2 write uses an input with a {{struct}} type which contains 
> differences in the casing of field names, the {{caseSensitive}} config is not 
> being honored, always doing a strict case sensitive comparison.
> Repro:
> {code:java}
> CREATE TABLE tmp.test_table_to (key int, object struct) USING 
> ICEBERG;
> CREATE TABLE tmp.test_table_from (key int, object struct) USING 
> HIVE;
> INSERT OVERWRITE tmp.test_table_to SELECT 1 as key, object FROM 
> tmp.test_table_from;{code}
> The above results in Exception:
> {code:java}
> Error in query: unresolved operator 'OverwriteByExpression RelationV2[key#3, 
> object#4] spark_catalog.tmp.test_table_to, true, false;
> 'OverwriteByExpression RelationV2[key#3, object#4] 
> spark_catalog.tmp.test_table_to, true, false
> +- Project [1 AS key#0, object#2]
>    +- SubqueryAlias spark_catalog.tmp.test_table_from
>       +- HiveTableRelation [`tmp`.`test_table_from`, 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: 
> [key#1, object#2], Partition Cols: []]{code}
>  
> If the casing matches in the struct field names, the v2 write works as 
> expected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39484) V2 write for type struct fails to handle case sensitivity on field names during resolution of V2 write command

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39484:


Assignee: Apache Spark

> V2 write for type struct fails to handle case sensitivity on field names 
> during resolution of V2 write command
> --
>
> Key: SPARK-39484
> URL: https://issues.apache.org/jira/browse/SPARK-39484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1
> Environment: {{{}master{}}}, {{3.1.1}}
>Reporter: Edgar Rodriguez
>Assignee: Apache Spark
>Priority: Minor
>
> Summary:
> When a V2 write uses an input with a {{struct}} type which contains 
> differences in the casing of field names, the {{caseSensitive}} config is not 
> being honored, always doing a strict case sensitive comparison.
> Repro:
> {code:java}
> CREATE TABLE tmp.test_table_to (key int, object struct) USING 
> ICEBERG;
> CREATE TABLE tmp.test_table_from (key int, object struct) USING 
> HIVE;
> INSERT OVERWRITE tmp.test_table_to SELECT 1 as key, object FROM 
> tmp.test_table_from;{code}
> The above results in Exception:
> {code:java}
> Error in query: unresolved operator 'OverwriteByExpression RelationV2[key#3, 
> object#4] spark_catalog.tmp.test_table_to, true, false;
> 'OverwriteByExpression RelationV2[key#3, object#4] 
> spark_catalog.tmp.test_table_to, true, false
> +- Project [1 AS key#0, object#2]
>    +- SubqueryAlias spark_catalog.tmp.test_table_from
>       +- HiveTableRelation [`tmp`.`test_table_from`, 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: 
> [key#1, object#2], Partition Cols: []]{code}
>  
> If the casing matches in the struct field names, the v2 write works as 
> expected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39484) V2 write for type struct fails to handle case sensitivity on field names during resolution of V2 write command

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39484:


Assignee: (was: Apache Spark)

> V2 write for type struct fails to handle case sensitivity on field names 
> during resolution of V2 write command
> --
>
> Key: SPARK-39484
> URL: https://issues.apache.org/jira/browse/SPARK-39484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1
> Environment: {{{}master{}}}, {{3.1.1}}
>Reporter: Edgar Rodriguez
>Priority: Minor
>
> Summary:
> When a V2 write uses an input with a {{struct}} type which contains 
> differences in the casing of field names, the {{caseSensitive}} config is not 
> being honored, always doing a strict case sensitive comparison.
> Repro:
> {code:java}
> CREATE TABLE tmp.test_table_to (key int, object struct) USING 
> ICEBERG;
> CREATE TABLE tmp.test_table_from (key int, object struct) USING 
> HIVE;
> INSERT OVERWRITE tmp.test_table_to SELECT 1 as key, object FROM 
> tmp.test_table_from;{code}
> The above results in Exception:
> {code:java}
> Error in query: unresolved operator 'OverwriteByExpression RelationV2[key#3, 
> object#4] spark_catalog.tmp.test_table_to, true, false;
> 'OverwriteByExpression RelationV2[key#3, object#4] 
> spark_catalog.tmp.test_table_to, true, false
> +- Project [1 AS key#0, object#2]
>    +- SubqueryAlias spark_catalog.tmp.test_table_from
>       +- HiveTableRelation [`tmp`.`test_table_from`, 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: 
> [key#1, object#2], Partition Cols: []]{code}
>  
> If the casing matches in the struct field names, the v2 write works as 
> expected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39484) V2 write for type struct fails to handle case sensitivity on field names during resolution of V2 write command

2022-06-15 Thread Edgar Rodriguez (Jira)

Edgar Rodriguez created SPARK-39484:
---

 Summary: V2 write for type struct fails to handle case sensitivity 
on field names during resolution of V2 write command
 Key: SPARK-39484
 URL: https://issues.apache.org/jira/browse/SPARK-39484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.1.1
 Environment: {{{}master{}}}, {{3.1.1}}
Reporter: Edgar Rodriguez


Summary:

When a V2 write uses an input with a {{struct}} type which contains differences 
in the casing of field names, the {{caseSensitive}} config is not being 
honored, always doing a strict case sensitive comparison.

Repro:
{code:java}
CREATE TABLE tmp.test_table_to (key int, object struct) USING 
ICEBERG;
CREATE TABLE tmp.test_table_from (key int, object struct) USING 
HIVE;
INSERT OVERWRITE tmp.test_table_to SELECT 1 as key, object FROM 
tmp.test_table_from;{code}
The above results in Exception:
{code:java}
Error in query: unresolved operator 'OverwriteByExpression RelationV2[key#3, 
object#4] spark_catalog.tmp.test_table_to, true, false;
'OverwriteByExpression RelationV2[key#3, object#4] 
spark_catalog.tmp.test_table_to, true, false
+- Project [1 AS key#0, object#2]
   +- SubqueryAlias spark_catalog.tmp.test_table_from
      +- HiveTableRelation [`tmp`.`test_table_from`, 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [key#1, 
object#2], Partition Cols: []]{code}
 

If the casing matches in the struct field names, the v2 write works as expected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39383) Support V2 data sources with DEFAULT values

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554761#comment-17554761
 ] 

Apache Spark commented on SPARK-39383:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36880

> Support V2 data sources with DEFAULT values
> ---
>
> Key: SPARK-39383
> URL: https://issues.apache.org/jira/browse/SPARK-39383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39383) Support V2 data sources with DEFAULT values

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554760#comment-17554760
 ] 

Apache Spark commented on SPARK-39383:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36880

> Support V2 data sources with DEFAULT values
> ---
>
> Key: SPARK-39383
> URL: https://issues.apache.org/jira/browse/SPARK-39383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554733#comment-17554733
 ] 

Apache Spark commented on SPARK-39483:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36870

>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array
> ---
>
> Key: SPARK-39483
> URL: https://issues.apache.org/jira/browse/SPARK-39483
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554734#comment-17554734
 ] 

Apache Spark commented on SPARK-39483:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36870

>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array
> ---
>
> Key: SPARK-39483
> URL: https://issues.apache.org/jira/browse/SPARK-39483
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39483:


Assignee: (was: Apache Spark)

>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array
> ---
>
> Key: SPARK-39483
> URL: https://issues.apache.org/jira/browse/SPARK-39483
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39483:


Assignee: Apache Spark

>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array
> ---
>
> Key: SPARK-39483
> URL: https://issues.apache.org/jira/browse/SPARK-39483
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
>  Construct the schema from `np.dtype` when `createDataFrame` from a NumPy 
> array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39483) Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array

2022-06-15 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39483:


 Summary:  Construct the schema from `np.dtype` when 
`createDataFrame` from a NumPy array
 Key: SPARK-39483
 URL: https://issues.apache.org/jira/browse/SPARK-39483
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


 Construct the schema from `np.dtype` when `createDataFrame` from a NumPy array.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39482) Add build and test documentation on IPv6

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39482:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Add build and test documentation on IPv6
> 
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39482) Add build and test documentation on IPv6

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554698#comment-17554698
 ] 

Apache Spark commented on SPARK-39482:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36879

> Add build and test documentation on IPv6
> 
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39482) Add build and test documentation on IPv6

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39482:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Add build and test documentation on IPv6
> 
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39482) Add build and test documentation on IPv6

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39482:
--
Summary: Add build and test documentation on IPv6  (was: Add IPv6 
documentation)

> Add build and test documentation on IPv6
> 
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39482) Add IPv6 documentation

2022-06-15 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39482:
-

Assignee: Dongjoon Hyun

> Add IPv6 documentation
> --
>
> Key: SPARK-39482
> URL: https://issues.apache.org/jira/browse/SPARK-39482
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39482) Add IPv6 documentation

2022-06-15 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-39482:
-

 Summary: Add IPv6 documentation
 Key: SPARK-39482
 URL: https://issues.apache.org/jira/browse/SPARK-39482
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39481) Pandas UDF executed twice if used in projection followed by filter

2022-06-15 Thread Timothy Dijamco (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Dijamco updated SPARK-39481:

Description: 
In this scenario, a Pandas UDF will be executed twice:
 # Projection that applies a Pandas UDF
 # Filter

In the {{explain}} output of the example below, the Optimized Logical Plan and 
Physical Plan contain {{ArrowEvalPython}} twice:
{code:python}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.master('local[1]').getOrCreate()

df = spark.createDataFrame(
[
[1, 'one'],
[2, 'two'],
[3, 'three'],
],
'int_col int, string_col string',
)

@F.pandas_udf('int')
def copy_int_col(s):
return s

df = df.withColumn('int_col_copy', copy_int_col(df['int_col']))
df = df.filter(F.col('int_col_copy') >= 3)

df.explain(True)
{code}
{code:java}
== Parsed Logical Plan ==
'Filter ('int_col_copy >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Analyzed Logical Plan ==
int_col: int, string_col: string, int_col_copy: int
Filter (int_col_copy#327 >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Optimized Logical Plan ==
Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- Project [int_col#322, string_col#323]
  +- Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- LogicalRDD [int_col#322, string_col#323], false

== Physical Plan ==
*(3) Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- *(2) Project [int_col#322, string_col#323]
  +- *(2) Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- *(1) Scan ExistingRDD[int_col#322,string_col#323]
{code}
If the Pandas UDF is marked as non-deterministic (e.g. {{{}copy_int_col = 
copy_int_col.asNondeterministic(){}}}), then it is not executed twice.

  was:
In this scenario, a Pandas UDF will be executed twice:
 # Projection that applies a Pandas UDF
 # Filter

In the {{explain}} output of the example below, the Optimized Logical Plan and 
Physical Plan contain {{ArrowEvalPython}} twice:
{code:python}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.master('local[1]').getOrCreate()

df = spark.createDataFrame(
[
[1, 'one'],
[2, 'two'],
[3, 'three'],
],
'int_col int, string_col string',
)

@F.pandas_udf('int')
def copy_int_col(s):
return s

df = df.withColumn('int_col_copy', copy_int_col(df['int_col']))
df = df.filter(F.col('int_col_copy') >= 3)

df.explain(True)
{code}
{code:java}
== Parsed Logical Plan ==
'Filter ('int_col_copy >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Analyzed Logical Plan ==
int_col: int, string_col: string, int_col_copy: int
Filter (int_col_copy#327 >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Optimized Logical Plan ==
Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- Project [int_col#322, string_col#323]
  +- Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- LogicalRDD [int_col#322, string_col#323], false

== Physical Plan ==
*(3) Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- *(2) Project [int_col#322, string_col#323]
  +- *(2) Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- *(1) Scan ExistingRDD[int_col#322,string_col#323]
{code}
If the Pandas UDF is marked as non-deterministic (e.g. {{{}copy_int_col = 
copy_int_col.asNondeterministic(){}}}, then it is not executed twice.


> Pandas UDF executed twice if used in projection followed by filter
> --
>
> Key: SPARK-39481
> URL: https://issues.apache.org/jira/browse/SPARK-39481
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Timothy Dijamco
>Priority: Minor
>
> In this scenario, a Pandas UDF will be executed twice:

[jira] [Created] (SPARK-39481) Pandas UDF executed twice if used in projection followed by filter

2022-06-15 Thread Timothy Dijamco (Jira)

Timothy Dijamco created SPARK-39481:
---

 Summary: Pandas UDF executed twice if used in projection followed 
by filter
 Key: SPARK-39481
 URL: https://issues.apache.org/jira/browse/SPARK-39481
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Timothy Dijamco


In this scenario, a Pandas UDF will be executed twice:
 # Projection that applies a Pandas UDF
 # Filter

In the {{explain}} output of the example below, the Optimized Logical Plan and 
Physical Plan contain {{ArrowEvalPython}} twice:
{code:python}
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.master('local[1]').getOrCreate()

df = spark.createDataFrame(
[
[1, 'one'],
[2, 'two'],
[3, 'three'],
],
'int_col int, string_col string',
)

@F.pandas_udf('int')
def copy_int_col(s):
return s

df = df.withColumn('int_col_copy', copy_int_col(df['int_col']))
df = df.filter(F.col('int_col_copy') >= 3)

df.explain(True)
{code}
{code:java}
== Parsed Logical Plan ==
'Filter ('int_col_copy >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Analyzed Logical Plan ==
int_col: int, string_col: string, int_col_copy: int
Filter (int_col_copy#327 >= 3)
+- Project [int_col#322, string_col#323, copy_int_col(int_col#322) AS 
int_col_copy#327]
   +- LogicalRDD [int_col#322, string_col#323], false

== Optimized Logical Plan ==
Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- Project [int_col#322, string_col#323]
  +- Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- LogicalRDD [int_col#322, string_col#323], false

== Physical Plan ==
*(3) Project [int_col#322, string_col#323, pythonUDF0#332 AS int_col_copy#327]
+- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#332], 200
   +- *(2) Project [int_col#322, string_col#323]
  +- *(2) Filter (pythonUDF0#331 >= 3)
 +- ArrowEvalPython [copy_int_col(int_col#322)], [pythonUDF0#331], 200
+- *(1) Scan ExistingRDD[int_col#322,string_col#323]
{code}
If the Pandas UDF is marked as non-deterministic (e.g. {{{}copy_int_col = 
copy_int_col.asNondeterministic(){}}}, then it is not executed twice.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39480:


Assignee: Apache Spark

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png, 
> image-2022-06-15-22-54-07-288.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!
>  
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-52-56-792.png|width=328,height=167!
> !image-2022-06-15-22-53-35-937.png|width=354,height=175!
> !image-2022-06-15-22-54-07-288.png|width=352,height=173!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39480:


Assignee: (was: Apache Spark)

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png, 
> image-2022-06-15-22-54-07-288.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!
>  
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-52-56-792.png|width=328,height=167!
> !image-2022-06-15-22-53-35-937.png|width=354,height=175!
> !image-2022-06-15-22-54-07-288.png|width=352,height=173!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554672#comment-17554672
 ] 

Apache Spark commented on SPARK-39480:
--

User 'Fang-Xie' has created a pull request for this issue:
https://github.com/apache/spark/pull/36878

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png, 
> image-2022-06-15-22-54-07-288.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!
>  
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-52-56-792.png|width=328,height=167!
> !image-2022-06-15-22-53-35-937.png|width=354,height=175!
> !image-2022-06-15-22-54-07-288.png|width=352,height=173!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Description: 
Current Spark use Parquet-mr as parquet reader/writer library, but the built-in 
bit-packing en/decode is not efficient enough. 

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in 
Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

*Below are our test results*

Functional test is based on open-source parquet-mr Bit-pack decoding function: 
*_public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)_* __

compared with our implementation with vector API *_public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
outPos)_*

We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:

!image-2022-06-15-22-50-12-759.png|width=513,height=259!

 

We integrated our bit-packing decode implementation into parquet-mr, tested the 
parquet batch reader ability from Spark VectorizedParquetRecordReader which get 
parquet column data by the batch way. We construct parquet file with different 
row count and column count, the column data type is Int32, the maximum int 
value is 127 which satisfies bit pack encode with bit width=7,   the count of 
the row is from 10k to 100 million and the count of the column is from 1 to 4.

!image-2022-06-15-22-52-56-792.png|width=328,height=167!

!image-2022-06-15-22-53-35-937.png|width=354,height=175!

!image-2022-06-15-22-54-07-288.png|width=352,height=173!

  was:
Current Spark use Parquet-mr as parquet reader/writer library, but the built-in 
bit-packing en/decode is not efficient enough. 

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in 
Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

*Below are our test results*

Functional test is based on open-source parquet-mr Bit-pack decoding function: 
*_public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)_* __

compared with our implementation with vector API *_public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
outPos)_*

We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:

!image-2022-06-15-22-50-12-759.png|width=513,height=259!


> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png, 
> image-2022-06-15-22-54-07-288.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!
>  
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-52-56-792.png|width=328,height=167!
>

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: image-2022-06-15-22-54-07-288.png

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png, 
> image-2022-06-15-22-54-07-288.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: image-2022-06-15-22-53-35-937.png

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png, image-2022-06-15-22-53-35-937.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: image-2022-06-15-22-52-56-792.png

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png, 
> image-2022-06-15-22-52-56-792.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: (was: image-2022-06-15-22-48-46-554.png)

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-50-12-759.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Description: 
Current Spark use Parquet-mr as parquet reader/writer library, but the built-in 
bit-packing en/decode is not efficient enough. 

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in 
Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

*Below are our test results*

Functional test is based on open-source parquet-mr Bit-pack decoding function: 
*_public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)_* __

compared with our implementation with vector API *_public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
outPos)_*

We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:

!image-2022-06-15-22-50-12-759.png|width=513,height=259!

  was:
Current Spark use Parquet-mr as parquet reader/writer library, but the built-in 
bit-packing en/decode is not efficient enough. 

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in 
Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

*Below are our test results*

Functional test is based on open-source parquet-mr Bit-pack decoding function: 
*_public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)_* __ 

compared with our implementation with vector API *_public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
outPos)_*

We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:

 


> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-48-46-554.png, 
> image-2022-06-15-22-50-12-759.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-50-12-759.png|width=513,height=259!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: image-2022-06-15-22-50-12-759.png

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-48-46-554.png, 
> image-2022-06-15-22-50-12-759.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __ 
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554639#comment-17554639
 ] 

Fang-Xie commented on SPARK-39480:
--

!image-2022-06-15-22-48-46-554.png!

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-48-46-554.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __ 
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



[ https://issues.apache.org/jira/browse/SPARK-39480 ]


Fang-Xie deleted comment on SPARK-39480:
--

was (Author: JIRAUSER288151):
!image-2022-06-15-22-48-46-554.png!

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-48-46-554.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __ 
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fang-Xie updated SPARK-39480:
-
Attachment: image-2022-06-15-22-48-46-554.png

> Parquet bit-packing de/encode optimization
> --
>
> Key: SPARK-39480
> URL: https://issues.apache.org/jira/browse/SPARK-39480
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Fang-Xie
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2022-06-15-22-48-46-554.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __ 
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39480) Parquet bit-packing de/encode optimization

2022-06-15 Thread Fang-Xie (Jira)

Fang-Xie created SPARK-39480:


 Summary: Parquet bit-packing de/encode optimization
 Key: SPARK-39480
 URL: https://issues.apache.org/jira/browse/SPARK-39480
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Fang-Xie
 Fix For: 3.3.0


Current Spark use Parquet-mr as parquet reader/writer library, but the built-in 
bit-packing en/decode is not efficient enough. 

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in 
Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request 
JDK16 or higher.

*Below are our test results*

Functional test is based on open-source parquet-mr Bit-pack decoding function: 
*_public final void unpack8Values(final byte[] in, final int inPos, final int[] 
out, final int outPos)_* __ 

compared with our implementation with vector API *_public final void 
unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int 
outPos)_*

We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
vectorized SIMD implementation) decode function with bit 
width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-39074:
-
Priority: Minor  (was: Major)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.4.0
>
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-39074:


Assignee: Enrico Minack

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-15 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39074.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36413
[https://github.com/apache/spark/pull/36413]

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.4.0
>
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554551#comment-17554551
 ] 

pralabhkumar commented on SPARK-38292:
--

[~itholic] I would like to work on this . 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39477) Remove "Number of queries" info from the golden files of SQLQueryTestSuite

2022-06-15 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39477.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/36875

> Remove "Number of queries" info from the golden files of SQLQueryTestSuite
> --
>
> Key: SPARK-39477
> URL: https://issues.apache.org/jira/browse/SPARK-39477
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39479) DS V2 supports push down math functions(non ANSI)

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554463#comment-17554463
 ] 

Apache Spark commented on SPARK-39479:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36877

> DS V2 supports push down math functions(non ANSI)
> -
>
> Key: SPARK-39479
> URL: https://issues.apache.org/jira/browse/SPARK-39479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark have a lot math functions which is not defined in ANSI 
> standard.
> But these functions is commonly used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39479) DS V2 supports push down math functions(non ANSI)

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39479:


Assignee: Apache Spark

> DS V2 supports push down math functions(non ANSI)
> -
>
> Key: SPARK-39479
> URL: https://issues.apache.org/jira/browse/SPARK-39479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark have a lot math functions which is not defined in ANSI 
> standard.
> But these functions is commonly used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39479) DS V2 supports push down math functions(non ANSI)

2022-06-15 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39479:


Assignee: (was: Apache Spark)

> DS V2 supports push down math functions(non ANSI)
> -
>
> Key: SPARK-39479
> URL: https://issues.apache.org/jira/browse/SPARK-39479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark have a lot math functions which is not defined in ANSI 
> standard.
> But these functions is commonly used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39479) DS V2 supports push down math functions(non ANSI)

2022-06-15 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554462#comment-17554462
 ] 

Apache Spark commented on SPARK-39479:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36877

> DS V2 supports push down math functions(non ANSI)
> -
>
> Key: SPARK-39479
> URL: https://issues.apache.org/jira/browse/SPARK-39479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark have a lot math functions which is not defined in ANSI 
> standard.
> But these functions is commonly used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39479) DS V2 supports push down math functions(non ANSI)

2022-06-15 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39479:
--

 Summary: DS V2 supports push down math functions(non ANSI)
 Key: SPARK-39479
 URL: https://issues.apache.org/jira/browse/SPARK-39479
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, Spark have a lot math functions which is not defined in ANSI 
standard.
But these functions is commonly used.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39430) The inconsistent timezone in Spark History Server UI

2022-06-15 Thread Surbhi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surbhi updated SPARK-39430:
---
Priority: Critical  (was: Major)

> The inconsistent timezone in Spark History Server UI
> 
>
> Key: SPARK-39430
> URL: https://issues.apache.org/jira/browse/SPARK-39430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit, Web UI
>Affects Versions: 3.2.1
>Reporter: Surbhi
>Priority: Critical
> Attachments: Screenshot 2022-06-10 at 12.59.36 AM.png, Screenshot 
> 2022-06-10 at 12.59.50 AM.png
>
>
> The spark history server is running in UTC timezone. But we are trying to 
> view history server in IST timezone.
> The history server landing page shows time in IST but Jobs, Stages, Storage, 
> Environment, Executors tabs shows time in UTC.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 106 matches

Mail list logo