date:20210719

[jira] [Assigned] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36184:
---

Assignee: Wenchen Fan

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36184.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33396
[https://github.com/apache/spark/pull/33396]

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36201) Add check for inner field of schema

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36201:


Assignee: (was: Apache Spark)

> Add check for inner field of schema
> ---
>
> Key: SPARK-36201
> URL: https://issues.apache.org/jira/browse/SPARK-36201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36201) Add check for inner field of schema

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36201:


Assignee: Apache Spark

> Add check for inner field of schema
> ---
>
> Key: SPARK-36201
> URL: https://issues.apache.org/jira/browse/SPARK-36201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36201) Add check for inner field of schema

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383125#comment-17383125
 ] 

Apache Spark commented on SPARK-36201:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33409

> Add check for inner field of schema
> ---
>
> Key: SPARK-36201
> URL: https://issues.apache.org/jira/browse/SPARK-36201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35806:

Description: 
pandas and pandas-on-Spark both have an argument named `mode` in the 
[DataFrame.to_csv, 
|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
 the acceptable strings are different.

pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can accept 
"append", "overwrite", "ignore", "error" or "errorifexists".

We should map these acceptable strings to pandas.

e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
"overwrite" too.

  was:
pandas and pandas-on-Spark both have a argument named `mode` in the 
[DataFrame.to_csv.|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]

And pandas has same argument, but the acceptable strings are different.

So, we should map the acceptable string to pandas.

e.g. mode=w will work as Spark's overwrite. In addition, mode can take Spark's 
overwrite too.


> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383130#comment-17383130
 ] 

Hyukjin Kwon commented on SPARK-36088:
--

You might have to call 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L393-L419
 logic when {{isKubernetesClient}} is on. Are you interested in submitting a PR?

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36203) Spark SQL can't use "group by" on the column of map type.

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383143#comment-17383143
 ] 

Hyukjin Kwon commented on SPARK-36203:
--

Can you show the fullly self-contained reproducer? BTW, Spark 2.4 is EOL so 
that won't be fixed in Spark 2.4.x. Can you also try it in a higher version of 
Spark?

> Spark SQL can't use "group by" on the column of map type.
> -
>
> Key: SPARK-36203
> URL: https://issues.apache.org/jira/browse/SPARK-36203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Bruce Wong
>Priority: Major
>
> I want to know why the 'group by' can't use in column of map tyep.
>  
> *sql:*
> select distinct idselect distinct id , cols , extend_value from 
> test.test_table 
> -- extend_value's type is map.
> *error:*
> {color:#FF}Sql执行错误:org.apache.spark.sql.AnalysisException: Cannot have 
> map type columns in DataFrame which calls set operations(intersect, except, 
> etc.), but the type of column extend_value is map;{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36203) Spark SQL can't use "group by" on the column of map type.

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36203.
--
Resolution: Incomplete

> Spark SQL can't use "group by" on the column of map type.
> -
>
> Key: SPARK-36203
> URL: https://issues.apache.org/jira/browse/SPARK-36203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Bruce Wong
>Priority: Major
>
> I want to know why the 'group by' can't use in column of map tyep.
>  
> *sql:*
> select distinct idselect distinct id , cols , extend_value from 
> test.test_table 
> -- extend_value's type is map.
> *error:*
> {color:#FF}Sql执行错误:org.apache.spark.sql.AnalysisException: Cannot have 
> map type columns in DataFrame which calls set operations(intersect, except, 
> etc.), but the type of column extend_value is map;{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36161) dropDuplicates does not type check argument

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36161:


Assignee: Apache Spark

> dropDuplicates does not type check argument
> ---
>
> Key: SPARK-36161
> URL: https://issues.apache.org/jira/browse/SPARK-36161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Samuel Moseley
>Assignee: Apache Spark
>Priority: Major
>
> When giving a single string to {{dropDuplicates}} method in pyspark a cryptic 
> error is returned. Rather than returning a cryptic error, handle gracefully 
> or return an exception. 
>  
> Proposal: Model after {{dropna}} behavior. If single string then use it to 
> create a list, otherwise if argument is not list/tuple return an exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36161) dropDuplicates does not type check argument

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383175#comment-17383175
 ] 

Apache Spark commented on SPARK-36161:
--

User 'sammyjmoseley' has created a pull request for this issue:
https://github.com/apache/spark/pull/33364

> dropDuplicates does not type check argument
> ---
>
> Key: SPARK-36161
> URL: https://issues.apache.org/jira/browse/SPARK-36161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Samuel Moseley
>Priority: Major
>
> When giving a single string to {{dropDuplicates}} method in pyspark a cryptic 
> error is returned. Rather than returning a cryptic error, handle gracefully 
> or return an exception. 
>  
> Proposal: Model after {{dropna}} behavior. If single string then use it to 
> create a list, otherwise if argument is not list/tuple return an exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36161) dropDuplicates does not type check argument

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36161:


Assignee: (was: Apache Spark)

> dropDuplicates does not type check argument
> ---
>
> Key: SPARK-36161
> URL: https://issues.apache.org/jira/browse/SPARK-36161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Samuel Moseley
>Priority: Major
>
> When giving a single string to {{dropDuplicates}} method in pyspark a cryptic 
> error is returned. Rather than returning a cryptic error, handle gracefully 
> or return an exception. 
>  
> Proposal: Model after {{dropna}} behavior. If single string then use it to 
> create a list, otherwise if argument is not list/tuple return an exception



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36184) Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles

2021-07-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36184:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Improvement)

> Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that 
> adds extra shuffles
> -
>
> Key: SPARK-36184
> URL: https://issues.apache.org/jira/browse/SPARK-36184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35806) Mapping the `mode` argument to pandas

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35806:

Summary: Mapping the `mode` argument to pandas  (was: Rename the `mode` 
argument to avoid confusion with `mode` argument in pandas)

> Mapping the `mode` argument to pandas
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas on Spark has a argument named `mode` in the APIs below:
>  * 
> [DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
>  * 
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]
>  * 
> [DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]
>  * 
> [DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]
>  * 
> [DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]
>  * 
> [DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]
>  * 
> [DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]
> And pandas has same argument, but the usage is different.
> So we should rename the argument to avoid confusion with pandas'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35806) Mapping the `mode` argument to pandas

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35806:

Description: 
pandas on Spark has a argument named `mode` in the APIs below:
 * 
[DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]

 * 
[DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]

 * 
[DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]

 * 
[DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]

 * 
[DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]

 * 
[DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]

 * 
[DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]

And pandas has same argument, but the acceptable strings are different.

So, we should map the acceptable string to pandas.

e.g. mode=w will work as Spark's overwrite. In addition, mode can take Spark's 
overwrite too.

  was:
pandas on Spark has a argument named `mode` in the APIs below:
 * 
[DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
 * 
[DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]
 * 
[DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]
 * 
[DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]
 * 
[DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]
 * 
[DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]
 * 
[DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]

And pandas has same argument, but the usage is different.

So we should rename the argument to avoid confusion with pandas'


> Mapping the `mode` argument to pandas
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas on Spark has a argument named `mode` in the APIs below:
>  * 
> [DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
>  * 
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]
>  * 
> [DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]
>  * 
> [DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]
>  * 
> [DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]
>  * 
> [DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]
>  * 
> [DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]
> And pandas has same argument, but the acceptable strings are different.
> So, we should map the acceptable string to pandas.
> e.g. mode=w will work as Spark's overwrite. In addition, mode can take 
> Spark's overwrite too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36197) InputFormat of PartitionDesc is not respected

2021-07-19 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-36197:


Assignee: Kent Yao

> InputFormat of PartitionDesc is not respected
> -
>
> Key: SPARK-36197
> URL: https://issues.apache.org/jira/browse/SPARK-36197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> A hive partition can have different PartitionDesc from TableDesc for 
> describing Serde/InputFormatClass/OutputFormatClass, for a hive partitioned 
> table, we shall respect this information in PartitionDesc first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36197) InputFormat of PartitionDesc is not respected

2021-07-19 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-36197.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33406
[https://github.com/apache/spark/pull/33406]

> InputFormat of PartitionDesc is not respected
> -
>
> Key: SPARK-36197
> URL: https://issues.apache.org/jira/browse/SPARK-36197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> A hive partition can have different PartitionDesc from TableDesc for 
> describing Serde/InputFormatClass/OutputFormatClass, for a hive partitioned 
> table, we shall respect this information in PartitionDesc first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33844) InsertIntoDir failed since query column name contains ',' cause column type and column names size not equal

2021-07-19 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33844:
--
Parent: SPARK-36200
Issue Type: Sub-task  (was: Improvement)

> InsertIntoDir failed since query column name contains ',' cause column type 
> and column names size not equal
> ---
>
> Key: SPARK-33844
> URL: https://issues.apache.org/jira/browse/SPARK-33844
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.2, 3.1.1, 3.2.0
>
>
>  
> After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col 
> name cntains ','since column list and column types in serde is splited by 
> COLUMN_NAME_DELIMITER.
>  In spark-2.4.0 + hive-1.2.1 we will failed when INSERT OVERWRITE DIR when 
> query result schema columan name contains ','as
> {code:java}
>  org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 14 elements 
> while columns.types has 11 elements! at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:121) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  Since this problem has been solved by 
> [https://github.com/apache/hive/blob/6f4c35c9e904d226451c465effdc5bfd31d395a0/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1044-L1075]
>  But I think we can do this in Spark side to make all version work well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-19 Thread wuyi (Jira)

wuyi created SPARK-36206:


 Summary: Diagnose shuffle data corruption by checksum
 Key: SPARK-36206
 URL: https://issues.apache.org/jira/browse/SPARK-36206
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383131#comment-17383131
 ] 

Hyukjin Kwon commented on SPARK-36088:
--

cc [~dongjoon] and [~holdenkarau] FYI

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383144#comment-17383144
 ] 

Hyukjin Kwon commented on SPARK-36187:
--

For question, let's interact it with Spark mailing list first before filing it 
as an issue.

> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36187.
--
Resolution: Incomplete

> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-35806) Mapping the `mode` argument to pandas

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee reopened SPARK-35806:
-

Reopen issue with revised title & description.

We should mapping the arguments rather than just rename.

> Mapping the `mode` argument to pandas
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas on Spark has a argument named `mode` in the APIs below:
>  * 
> [DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
>  * 
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]
>  * 
> [DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]
>  * 
> [DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]
>  * 
> [DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]
>  * 
> [DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]
>  * 
> [DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]
> And pandas has same argument, but the acceptable strings are different.
> So, we should map the acceptable string to pandas.
> e.g. mode=w will work as Spark's overwrite. In addition, mode can take 
> Spark's overwrite too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35806) Mapping the `mode` argument to pandas

2021-07-19 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383046#comment-17383046
 ] 

Haejoon Lee commented on SPARK-35806:
-

I'm working on this

> Mapping the `mode` argument to pandas
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas on Spark has a argument named `mode` in the APIs below:
>  * 
> [DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
>  * 
> [DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]
>  * 
> [DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]
>  * 
> [DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]
>  * 
> [DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]
>  * 
> [DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]
>  * 
> [DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]
> And pandas has same argument, but the acceptable strings are different.
> So, we should map the acceptable string to pandas.
> e.g. mode=w will work as Spark's overwrite. In addition, mode can take 
> Spark's overwrite too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36086) The case of the delta table is inconsistent with parquet

2021-07-19 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383195#comment-17383195
 ] 

Wenchen Fan commented on SPARK-36086:
-

Seems we should improve the v2 describe table command to include more 
information.

> The case of the delta table is inconsistent with parquet
> 
>
> Key: SPARK-36086
> URL: https://issues.apache.org/jira/browse/SPARK-36086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {noformat}
> 1. Add delta-core_2.12-1.0.0-SNAPSHOT.jar to ${SPARK_HOME}/jars.
> 2. bin/spark-shell --conf 
> spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf 
> spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
> {noformat}
> {code:scala}
> spark.sql("create table t1 using parquet as select id, id as lower_id from 
> range(5)")
> spark.sql("CREATE VIEW v1 as SELECT * FROM t1")
> spark.sql("CREATE TABLE t2 USING DELTA PARTITIONED BY (LOWER_ID) SELECT 
> LOWER_ID, ID FROM v1")
> spark.sql("CREATE TABLE t3 USING PARQUET PARTITIONED BY (LOWER_ID) SELECT 
> LOWER_ID, ID FROM v1")
> spark.sql("desc extended t2").show(false)
> spark.sql("desc extended t3").show(false)
> {code}
> {noformat}
> scala> spark.sql("desc extended t2").show(false)
> ++--+---+
> |col_name|data_type   
>   |comment|
> ++--+---+
> |lower_id|bigint  
>   |   |
> |id  |bigint  
>   |   |
> ||
>   |   |
> |# Partitioning  |
>   |   |
> |Part 0  |lower_id
>   |   |
> ||
>   |   |
> |# Detailed Table Information|
>   |   |
> |Name|default.t2  
>   |   |
> |Location
> |file:/Users/yumwang/Downloads/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t2|  
>  |
> |Provider|delta   
>   |   |
> |Table Properties
> |[Type=MANAGED,delta.minReaderVersion=1,delta.minWriterVersion=2]  |  
>  |
> ++--+---+
> scala> spark.sql("desc extended t3").show(false)
> ++--+---+
> |col_name|data_type   
>   |comment|
> ++--+---+
> |ID  |bigint  
>   |null   |
> |LOWER_ID|bigint  
>   |null   |
> |# Partition Information |
>   |   |
> |# col_name  |data_type   
>   |comment|
> |LOWER_ID|bigint  
>   |null   |
> ||
>   |   |
> |# Detailed Table Information|
>   |   |
> |Database|default 
>   |   |
> |Table   |t3  
>   |   |
> |Owner   |yumwang 
>   |   |
> |Created Time|Mon Jul 12 14:07:16 CST 2021

[jira] [Assigned] (SPARK-36175) Support TimestampNTZ in Avro data source

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36175:


Assignee: Apache Spark

> Support TimestampNTZ in Avro data source 
> -
>
> Key: SPARK-36175
> URL: https://issues.apache.org/jira/browse/SPARK-36175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> As per the Avro spec 
> https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29,
>  Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36175) Support TimestampNTZ in Avro data source

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383025#comment-17383025
 ] 

Apache Spark commented on SPARK-36175:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33413

> Support TimestampNTZ in Avro data source 
> -
>
> Key: SPARK-36175
> URL: https://issues.apache.org/jira/browse/SPARK-36175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> As per the Avro spec 
> https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29,
>  Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36175) Support TimestampNTZ in Avro data source

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36175:


Assignee: (was: Apache Spark)

> Support TimestampNTZ in Avro data source 
> -
>
> Key: SPARK-36175
> URL: https://issues.apache.org/jira/browse/SPARK-36175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> As per the Avro spec 
> https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29,
>  Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36206) Diagnose shuffle data corruption by checksum

2021-07-19 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-36206:
-
Description: After adding checksums in SPARK-35276, we can leverage the 
checksums to do diagnosis for shuffle data corruption now.

> Diagnose shuffle data corruption by checksum
> 
>
> Key: SPARK-36206
> URL: https://issues.apache.org/jira/browse/SPARK-36206
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wuyi
>Priority: Major
>
> After adding checksums in SPARK-35276, we can leverage the checksums to do 
> diagnosis for shuffle data corruption now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35806:

Summary: Mapping the `mode` argument to pandas in DataFrame.to_csv  (was: 
Mapping the `mode` argument to pandas)

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have a argument named `mode` in the 
> [DataFrame.to_csv.|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
> And pandas has same argument, but the acceptable strings are different.
> So, we should map the acceptable string to pandas.
> e.g. mode=w will work as Spark's overwrite. In addition, mode can take 
> Spark's overwrite too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35806) Mapping the `mode` argument to pandas

2021-07-19 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35806:

Description: 
pandas and pandas-on-Spark both have a argument named `mode` in the 
[DataFrame.to_csv.|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]

And pandas has same argument, but the acceptable strings are different.

So, we should map the acceptable string to pandas.

e.g. mode=w will work as Spark's overwrite. In addition, mode can take Spark's 
overwrite too.

  was:
pandas on Spark has a argument named `mode` in the APIs below:
 * 
[DataFrame.to_csv|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]

 * 
[DataFrame.to_json|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_json.html]

 * 
[DataFrame.to_table|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_table.html]

 * 
[DataFrame.to_delta|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_delta.html]

 * 
[DataFrame.to_parquet|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html]

 * 
[DataFrame.to_orc|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_orc.html]

 * 
[DataFrame.to_spark_io|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html]

And pandas has same argument, but the acceptable strings are different.

So, we should map the acceptable string to pandas.

e.g. mode=w will work as Spark's overwrite. In addition, mode can take Spark's 
overwrite too.


> Mapping the `mode` argument to pandas
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have a argument named `mode` in the 
> [DataFrame.to_csv.|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]
> And pandas has same argument, but the acceptable strings are different.
> So, we should map the acceptable string to pandas.
> e.g. mode=w will work as Spark's overwrite. In addition, mode can take 
> Spark's overwrite too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36134) jackson-databind RCE vulnerability

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36134.
--
Resolution: Invalid

> jackson-databind RCE vulnerability
> --
>
> Key: SPARK-36134
> URL: https://issues.apache.org/jira/browse/SPARK-36134
> Project: Spark
>  Issue Type: Task
>  Components: Java API
>Affects Versions: 3.1.2, 3.1.3
>Reporter: Sumit
>Priority: Major
> Attachments: Screenshot 2021-07-15 at 1.00.55 PM.png
>
>
> Need to upgrade   jackson-databind version to *2.9.3.1*
> At the beginning of 2018, jackson-databind was reported to contain another 
> remote code execution (RCE) vulnerability (CVE-2017-17485) that affects 
> versions 2.9.3 and earlier, 2.7.9.1 and earlier, and 2.8.10 and earlier. This 
> vulnerability is caused by jackson-dababind’s incomplete blacklist. An 
> application that uses jackson-databind will become vulnerable when the 
> enableDefaultTyping method is called via the ObjectMapper object within the 
> application. An attacker can thus compromise the application by sending 
> maliciously crafted JSON input to gain direct control over a server. 
> Currently, a proof of concept (POC) exploit for this vulnerability has been 
> publicly available. All users who are affected by this vulnerability should 
> upgrade to the latest versions as soon as possible to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36185) Implement functions in CategoricalAccessor/CategoricalIndex

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383146#comment-17383146
 ] 

Hyukjin Kwon commented on SPARK-36185:
--

I think it's for Spark 3.2. Most of fixes are being landed together to 
branch-3.2 for now (I guess it's considered as Alpha component, 
https://spark.apache.org/versioning-policy.html ?).

> Implement functions in CategoricalAccessor/CategoricalIndex
> ---
>
> Key: SPARK-36185
> URL: https://issues.apache.org/jira/browse/SPARK-36185
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are functions we haven't implemented in {{CategoricalAccessor}} and 
> {{CategoricalIndex}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35806:


Assignee: Apache Spark

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383172#comment-17383172
 ] 

Apache Spark commented on SPARK-35806:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33414

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35806:


Assignee: (was: Apache Spark)

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34806) Helper class for batch Dataset.observe()

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34806.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 31905
[https://github.com/apache/spark/pull/31905]

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.3.0
>
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34806) Helper class for batch Dataset.observe()

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34806:
---

Assignee: Enrico Minack

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-07-19 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383128#comment-17383128
 ] 

Hyukjin Kwon commented on SPARK-36088:
--

does your driver run inside a pod or on a physical host?

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36192) Better error messages when comparing against list

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36192:
-
Description: We shall throw TypeError messages rather than Spark exceptions.

> Better error messages when comparing against list 
> --
>
> Key: SPARK-36192
> URL: https://issues.apache.org/jira/browse/SPARK-36192
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: We shall throw TypeError messages rather than Spark 
> exceptions.
>Reporter: Xinrong Meng
>Priority: Major
>
> We shall throw TypeError messages rather than Spark exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36163.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33370

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Assignee: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36163:


Assignee: Ivan

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Assignee: Ivan
>Priority: Major
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2021-07-19 Thread tiejiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383192#comment-17383192
 ] 

tiejiang commented on SPARK-24965:
--

I have a similar question, see the link, can anyone answer it, thank you very 
much! :)

https://stackoverflow.com/questions/68437779/error-when-spark-sql-read-parquet-table-with-text-partition

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Priority: Major
>  Labels: bulk-closed, pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36178:


Assignee: Dominik Gehl

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Minor
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36178) Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36178.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33392
[https://github.com/apache/spark/pull/33392]

> Document PySpark Catalog APIs in docs/source/reference/pyspark.sql.rst
> --
>
> Key: SPARK-36178
> URL: https://issues.apache.org/jira/browse/SPARK-36178
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Minor
> Fix For: 3.2.0
>
>
> PySpark Catalog API currently isn't documented in 
> docs/source/reference/pyspark.sql.rst



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36091) Support TimestampNTZ type in expression TimeWindow

2021-07-19 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36091:
--

Assignee: jiaan.geng

> Support TimestampNTZ  type in expression TimeWindow
> ---
>
> Key: SPARK-36091
> URL: https://issues.apache.org/jira/browse/SPARK-36091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36091) Support TimestampNTZ type in expression TimeWindow

2021-07-19 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36091.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33341
[https://github.com/apache/spark/pull/33341]

> Support TimestampNTZ  type in expression TimeWindow
> ---
>
> Key: SPARK-36091
> URL: https://issues.apache.org/jira/browse/SPARK-36091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36208:
---
Summary: SparkScriptTransformation should support ANSI interval types  
(was: SparkScriptTransformation )

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383330#comment-17383330
 ] 

Apache Spark commented on SPARK-36208:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33419

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36208:


Assignee: Apache Spark  (was: Kousuke Saruta)

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383329#comment-17383329
 ] 

Apache Spark commented on SPARK-36208:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33419

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36208) SparkScriptTransformation should support ANSI interval types

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36208:


Assignee: Kousuke Saruta  (was: Apache Spark)

> SparkScriptTransformation should support ANSI interval types
> 
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Dominik Gehl (Jira)

Dominik Gehl created SPARK-36209:


 Summary: 
https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
invalid link to Python doc
 Key: SPARK-36209
 URL: https://issues.apache.org/jira/browse/SPARK-36209
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.1.2
 Environment: On 
https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
the python doc points to 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
 which returns a "Not found"
Reporter: Dominik Gehl






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread koert kuipers (Jira)

koert kuipers created SPARK-36210:
-

 Summary: Preserve column insertion order in Dataset.withColumns
 Key: SPARK-36210
 URL: https://issues.apache.org/jira/browse/SPARK-36210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: koert kuipers


Dataset.withColumns uses a Map (columnMap) to store the mapping of column name 
to column. however this loses the order of the columns. also none of the 
operations used on the Map (find and filter) benefits from the map's lookup 
features. so it seems simpler to use a Seq instead, which also preserves the 
insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36205) Use set-env instead of set-output in GitHub Actions

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36205.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33412
[https://github.com/apache/spark/pull/33412]

> Use set-env instead of set-output in GitHub Actions
> ---
>
> Key: SPARK-36205
> URL: https://issues.apache.org/jira/browse/SPARK-36205
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0
>
>
> Some places in GitHub Actions, we use set-output to set an environment 
> variable. we can just use set-env instead.
> PR was open first. Please refer to the PR open against this JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Dominik Gehl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Gehl updated SPARK-36209:
-
Description: 
On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
to the python doc points to 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
 which returns a "Not found"


> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36209:


Assignee: Apache Spark

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383354#comment-17383354
 ] 

Apache Spark commented on SPARK-36166:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33421

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36209:


Assignee: (was: Apache Spark)

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36205) Use set-env instead of set-output in GitHub Actions

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36205:


Assignee: Hyukjin Kwon

> Use set-env instead of set-output in GitHub Actions
> ---
>
> Key: SPARK-36205
> URL: https://issues.apache.org/jira/browse/SPARK-36205
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Some places in GitHub Actions, we use set-output to set an environment 
> variable. we can just use set-env instead.
> PR was open first. Please refer to the PR open against this JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36181:


Assignee: Dominik Gehl

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36181) Update pyspark sql readwriter documentation to Scala level

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36181.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33394
[https://github.com/apache/spark/pull/33394]

> Update pyspark sql readwriter documentation to Scala level
> --
>
> Key: SPARK-36181
> URL: https://issues.apache.org/jira/browse/SPARK-36181
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Update pyspark sql readwriter documentation to the level of detail the Scala 
> documentation provides



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35806.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33414
[https://github.com/apache/spark/pull/33414]

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35806) Mapping the `mode` argument to pandas in DataFrame.to_csv

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35806:


Assignee: Haejoon Lee

> Mapping the `mode` argument to pandas in DataFrame.to_csv
> -
>
> Key: SPARK-35806
> URL: https://issues.apache.org/jira/browse/SPARK-35806
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> pandas and pandas-on-Spark both have an argument named `mode` in the 
> [DataFrame.to_csv, 
> |https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html]but
>  the acceptable strings are different.
> pandas can accept the "w", "w+", "a", "a+" where as pandas-on-Spark can 
> accept "append", "overwrite", "ignore", "error" or "errorifexists".
> We should map these acceptable strings to pandas.
> e.g. "w" will work as Spark's "overwrite". In addition, mode can take Spark's 
> "overwrite" too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383322#comment-17383322
 ] 

Apache Spark commented on SPARK-36093:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33418

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36207:


Assignee: Apache Spark

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Apache Spark
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36207:


Assignee: (was: Apache Spark)

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36208) SparkScriptTransformation

2021-07-19 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36208:
---
Parent: SPARK-27790
Issue Type: Sub-task  (was: Bug)

> SparkScriptTransformation 
> --
>
> Key: SPARK-36208
> URL: https://issues.apache.org/jira/browse/SPARK-36208
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SparkScriptTransformation supports CalendarIntervalType so it's better to 
> support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383323#comment-17383323
 ] 

Apache Spark commented on SPARK-36207:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33416

> Export databaseExists in pyspark.sql.catalog
> 
>
> Key: SPARK-36207
> URL: https://issues.apache.org/jira/browse/SPARK-36207
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Minor
>
> expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36207) Export databaseExists in pyspark.sql.catalog

2021-07-19 Thread Dominik Gehl (Jira)

Dominik Gehl created SPARK-36207:


 Summary: Export databaseExists in pyspark.sql.catalog
 Key: SPARK-36207
 URL: https://issues.apache.org/jira/browse/SPARK-36207
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Dominik Gehl


expose in pyspark databaseExists which is part of the scala implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36208) SparkScriptTransformation

2021-07-19 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-36208:
--

 Summary: SparkScriptTransformation 
 Key: SPARK-36208
 URL: https://issues.apache.org/jira/browse/SPARK-36208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


SparkScriptTransformation supports CalendarIntervalType so it's better to 
support ANSI interval types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383321#comment-17383321
 ] 

Apache Spark commented on SPARK-36093:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/33417

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36209) https://spark.apache.org/docs/latest/sql-programming-guide.html contains invalid link to Python doc

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383355#comment-17383355
 ] 

Apache Spark commented on SPARK-36209:
--

User 'dominikgehl' has created a pull request for this issue:
https://github.com/apache/spark/pull/33420

> https://spark.apache.org/docs/latest/sql-programming-guide.html contains 
> invalid link to Python doc
> ---
>
> Key: SPARK-36209
> URL: https://issues.apache.org/jira/browse/SPARK-36209
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.2
> Environment: On 
> https://spark.apache.org/docs/latest/sql-programming-guide.html, the link to 
> the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"
>Reporter: Dominik Gehl
>Priority: Major
>
> On https://spark.apache.org/docs/latest/sql-programming-guide.html , the link 
> to the python doc points to 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
>  which returns a "Not found"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)

Luran He created SPARK-36211:


 Summary: type check fails for `F.udf(...).asNonDeterministic()
 Key: SPARK-36211
 URL: https://issues.apache.org/jira/browse/SPARK-36211
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
Reporter: Luran He


The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the `udf` signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luran He updated SPARK-36211:
-
Description: 
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong

  was:
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the `udf` signature is wrong


> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Luran He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luran He updated SPARK-36211:
-
Description: 
The following code should type-check, but doesn't:


{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}


In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong

  was:
The following code should type-check, but doesn't:

{{import uuid}}

{{pyspark.sql.functions as F}}

{{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}

In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong


> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383415#comment-17383415
 ] 

Apache Spark commented on SPARK-36210:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/33423

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36210:


Assignee: (was: Apache Spark)

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36093) The result incorrect if the partition path case is inconsistent

2021-07-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36093.
-
Fix Version/s: 3.1.3
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 33417
[https://github.com/apache/spark/pull/33417]

> The result incorrect if the partition path case is inconsistent
> ---
>
> Key: SPARK-36093
> URL: https://issues.apache.org/jira/browse/SPARK-36093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3
>
>
> Please reproduce this issue using HDFS. Local HDFS can not reproduce this 
> issue.
> {code:scala}
> sql("create table t1(cal_dt date) using parquet")
> sql("insert into t1 values 
> (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')")
> sql("create view t1_v as select * from t1")
> sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS 
> FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'")
> sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'")
> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> sql("SELECT * FROM t2 ").show
> {code}
> {noformat}
> // It should not empty.
> scala> sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND 
> '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> ++--+
> scala> sql("SELECT * FROM t2 ").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   1|2021-06-27|
> |   1|2021-06-28|
> ++--+
> scala> sql("SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN 
> '2021-06-29' AND '2021-06-30'").show
> ++--+
> |FLAG|CAL_DT|
> ++--+
> |   2|2021-06-29|
> |   2|2021-06-30|
> ++--+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383414#comment-17383414
 ] 

Apache Spark commented on SPARK-36210:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/33423

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36210) Preserve column insertion order in Dataset.withColumns

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36210:


Assignee: Apache Spark

> Preserve column insertion order in Dataset.withColumns
> --
>
> Key: SPARK-36210
> URL: https://issues.apache.org/jira/browse/SPARK-36210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> Dataset.withColumns uses a Map (columnMap) to store the mapping of column 
> name to column. however this loses the order of the columns. also none of the 
> operations used on the Map (find and filter) benefits from the map's lookup 
> features. so it seems simpler to use a Seq instead, which also preserves the 
> insertion order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36212) Add exception for Kafka readstream when decryption fails

2021-07-19 Thread Jon LaFlamme (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon LaFlamme updated SPARK-36212:
-
Fix Version/s: (was: 3.1.0)
   3.0.0

> Add exception for Kafka readstream when decryption fails
> 
>
> Key: SPARK-36212
> URL: https://issues.apache.org/jira/browse/SPARK-36212
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Jon LaFlamme
>Priority: Minor
>  Labels: exceptions, warnings
> Fix For: 3.0.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> A silent failure is possible when reading from a Kafka broker under the 
> following circumstances:
> SDF.isStreaming = True
> SDF.printSchema() => returns expected schema
> Query results are empty.
> Issue: Tsl Decryption has failed, but there is no exception or warning.
> Request: Add a warning throw an exception when decryption fails, such that 
> developers can efficiently diagnose the readstream problem.
>  
> This is my first ticket submitted. Please notify me if I should change 
> anything in this ticket to make it more conformant to community standards. 
> I'm still a beginner with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36212) Add exception for Kafka readstream when decryption fails

2021-07-19 Thread Jon LaFlamme (Jira)

Jon LaFlamme created SPARK-36212:


 Summary: Add exception for Kafka readstream when decryption fails
 Key: SPARK-36212
 URL: https://issues.apache.org/jira/browse/SPARK-36212
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark 3.0.0
Reporter: Jon LaFlamme
 Fix For: 3.1.0


A silent failure is possible when reading from a Kafka broker under the 
following circumstances:

SDF.isStreaming = True

SDF.printSchema() => returns expected schema

Query results are empty.

Issue: Tsl Decryption has failed, but there is no exception or warning.

Request: Add a warning throw an exception when decryption fails, such that 
developers can efficiently diagnose the readstream problem.

 

This is my first ticket submitted. Please notify me if I should change anything 
in this ticket to make it more conformant to community standards. I'm still a 
beginner with Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36211:


Assignee: Apache Spark

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383385#comment-17383385
 ] 

Apache Spark commented on SPARK-36211:
--

User 'luranhe' has created a pull request for this issue:
https://github.com/apache/spark/pull/33399

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36211) type check fails for `F.udf(...).asNonDeterministic()

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36211:


Assignee: (was: Apache Spark)

> type check fails for `F.udf(...).asNonDeterministic()
> -
>
> Key: SPARK-36211
> URL: https://issues.apache.org/jira/browse/SPARK-36211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Luran He
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The following code should type-check, but doesn't:
> 
> {{import uuid}}
> {{pyspark.sql.functions as F}}
> {{my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()}}
> 
> In {{python/pyspark/sql/functions.pyi}} the {{udf}} signature is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34806) Helper class for batch Dataset.observe()

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383394#comment-17383394
 ] 

Apache Spark commented on SPARK-34806:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/33422

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.3.0
>
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Kent Yao (Jira)

Kent Yao created SPARK-36213:


 Summary: Normalize PartitionSpec for DescTable with PartitionSpec
 Key: SPARK-36213
 URL: https://issues.apache.org/jira/browse/SPARK-36213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.0.3, 2.4.8, 3.2.0
Reporter: Kent Yao


!image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383485#comment-17383485
 ] 

Apache Spark commented on SPARK-36213:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/33424

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36213:


Assignee: Apache Spark

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36213) Normalize PartitionSpec for DescTable with PartitionSpec

2021-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36213:


Assignee: (was: Apache Spark)

> Normalize PartitionSpec for DescTable with PartitionSpec
> 
>
> Key: SPARK-36213
> URL: https://issues.apache.org/jira/browse/SPARK-36213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> !image-2021-07-20-01-03-49-573.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-07-19 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383529#comment-17383529
 ] 

Thomas Graves commented on SPARK-25075:
---

Just wanted to check the plans for scala 2.13 in 3.2.  It looks like scala 2.12 
will still be the default, correct?

Are we planning on releasing the Spark tgz artifacts for 2.13 and 2.12 or only 
2.12?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35997) Implement comparison operators for CategoricalDtype in pandas API on Spark

2021-07-19 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-35997.
--
Resolution: Done

> Implement comparison operators for CategoricalDtype in pandas API on Spark
> --
>
> Key: SPARK-35997
> URL: https://issues.apache.org/jira/browse/SPARK-35997
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> In pandas API on Spark, "<, <=, >, >=" have not been implemented for 
> CategoricalDtype.
> We ought to match pandas' behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36214) Add add_categories to CategoricalAccessor and CategoricalIndex.

2021-07-19 Thread Takuya Ueshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383641#comment-17383641
 ] 

Takuya Ueshin commented on SPARK-36214:
---

I'm working on this.

> Add add_categories to CategoricalAccessor and CategoricalIndex.
> ---
>
> Key: SPARK-36214
> URL: https://issues.apache.org/jira/browse/SPARK-36214
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36000) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-07-19 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383667#comment-17383667
 ] 

Xinrong Meng commented on SPARK-36000:
--

We might want to support spark.createDataFrame(data=[decimal.Decimal('NaN')], 
schema='decimal') first.

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36000
> URL: https://issues.apache.org/jira/browse/SPARK-36000
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32920) Add support in Spark driver to coordinate the finalization of the push/merge phase in push-based shuffle for a given shuffle and the initiation of the reduce stage

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383669#comment-17383669
 ] 

Apache Spark commented on SPARK-32920:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33426

> Add support in Spark driver to coordinate the finalization of the push/merge 
> phase in push-based shuffle for a given shuffle and the initiation of the 
> reduce stage
> ---
>
> Key: SPARK-32920
> URL: https://issues.apache.org/jira/browse/SPARK-32920
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> With push-based shuffle, we are currently decoupling map task executions from 
> the shuffle block push process. Thus, when all map tasks finish, we might 
> want to wait for some small extra time to allow more shuffle blocks to get 
> pushed and merged. This requires some extra coordination in the Spark driver 
> when it transitions from a shuffle map stage to the corresponding reduce 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36176) Expose tableExists in pyspark.sql.catalog

2021-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36176.
--
Fix Version/s: 3.2.0
 Assignee: Dominik Gehl
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33388

> Expose tableExists in pyspark.sql.catalog
> -
>
> Key: SPARK-36176
> URL: https://issues.apache.org/jira/browse/SPARK-36176
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Assignee: Dominik Gehl
>Priority: Minor
> Fix For: 3.2.0
>
>
> expose in pyspark tableExists which is part of the scala implementation 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-19 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36127.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng  (was: Apache Spark)
   Resolution: Fixed

Issue resolved by pull request 33373
https://github.com/apache/spark/pull/33373

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383663#comment-17383663
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33425

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32919) Add support in Spark driver to coordinate the shuffle map stage in push-based shuffle by selecting external shuffle services for merging shuffle partitions

2021-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383664#comment-17383664
 ] 

Apache Spark commented on SPARK-32919:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33425

> Add support in Spark driver to coordinate the shuffle map stage in push-based 
> shuffle by selecting external shuffle services for merging shuffle partitions
> ---
>
> Key: SPARK-32919
> URL: https://issues.apache.org/jira/browse/SPARK-32919
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> In the beginning of a shuffle map stage, driver needs to select external 
> shuffle services as the mergers of the shuffle partitions for the 
> corresponding shuffle.
> We currently leverage the immediate available information about current and 
> past executor location information for this selection purpose. Ideally, this 
> would be behind a pluggable interface so that we can potentially leverage 
> information tracked outside of a Spark application for better load balancing 
> or for a disaggregate deployment environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 153 matches

Mail list logo