date:20211012

[jira] [Resolved] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36973.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34251
[https://github.com/apache/spark/pull/34251]

> Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
> -
>
> Key: SPARK-36973
> URL: https://issues.apache.org/jira/browse/SPARK-36973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36973:


Assignee: dch nguyen

> Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
> -
>
> Key: SPARK-36973
> URL: https://issues.apache.org/jira/browse/SPARK-36973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36994) Upgrade Apache Thrift

2021-10-12 Thread kaja girish (Jira)

kaja girish created SPARK-36994:
---

 Summary: Upgrade Apache Thrift
 Key: SPARK-36994
 URL: https://issues.apache.org/jira/browse/SPARK-36994
 Project: Spark
  Issue Type: Bug
  Components: Security
Affects Versions: 3.0.1
Reporter: kaja girish


*Image:*
 * spark:3.0.1

*Components Affected:* 
 * Apache Thrift

*Recommendation:*
 * upgrade Apache Thrift 

*CVE:*

 
|Component Name|Component Version Name|Vulnerability|Fixed version|
|Apache Thrift|0.11.0-4.|CVE-2019-0205|0.13.0|
|Apache Thrift|0.11.0-4.|CVE-2019-0210|0.13.0|
|Apache Thrift|0.11.0-4.|CVE-2020-13949|0.14.1|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column C1 with data type Int 
and have only one record
 * File 2: Same schema with File 1 except column C1  having data type String 
and having>= X records

X depends on the capacity of your computer, my case is 36, you can increase the 
number of row to find X.

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
C1 changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= X records

X depends on the capacity of your computer, my case is 36, you can increase the 
number of row to find X.

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= X records

X depends on the capacity of your computer, my case is 36, you can increase the 
number of row to find X.

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= X records

X depends on the capacity of your computer, my case is 36

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (

[jira] [Commented] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428036#comment-17428036
 ] 

Apache Spark commented on SPARK-36972:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34269

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
> Fix For: 3.3.0
>
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36976) Add max_by/min_by API to SparkR

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36976.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34258
[https://github.com/apache/spark/pull/34258]

> Add max_by/min_by API to SparkR
> ---
>
> Key: SPARK-36976
> URL: https://issues.apache.org/jira/browse/SPARK-36976
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
> Fix For: 3.3.0
>
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Attachment: (was: file1.parquet)

> ignoreCorruptFiles does not work when schema change from int to string when a 
> file having more than X records
> -
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
>
> Precondition:
> Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int 
> and have only one record
>  * File 2: Same schema with File 1 except column X  having data type String 
> and having>= X records
> X depends on the capacity of your computer, my case is 36
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
> If i remove one record from file2. It works well  
>  
>  Code with exist file
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
> file1_path = f'{folder_path}/file1.parquet' 
> file1_schema = spark.read.parquet(file1_path).schema 
> file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> Code with creating file
>  
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> schema1 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", IntegerType(), True),
> ])
> sample_data = [(1, 17)]
> df1 = spark.createDataFrame(sample_data, schema1)
> schema2 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", StringType(), True),
> ])
> sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (2, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (3, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (4, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (5, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (6, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (7, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (8, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (9, "1"), (2, "3332"), (3, "19"), (4, ""),
>  ]
> df2 = spark.createDataFrame(sample_data, schema2)
> file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'
> df1.write \
>  .mode('overwrite') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df2.write \
>  .mode('append') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df = spark.read.schema(schema1).parquet(file_save_path)
> df.show(){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Attachment: (was: file2.parquet)

> ignoreCorruptFiles does not work when schema change from int to string when a 
> file having more than X records
> -
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
>
> Precondition:
> Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int 
> and have only one record
>  * File 2: Same schema with File 1 except column X  having data type String 
> and having>= X records
> X depends on the capacity of your computer, my case is 36
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
> If i remove one record from file2. It works well  
>  
>  Code with exist file
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
> file1_path = f'{folder_path}/file1.parquet' 
> file1_schema = spark.read.parquet(file1_path).schema 
> file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> Code with creating file
>  
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> schema1 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", IntegerType(), True),
> ])
> sample_data = [(1, 17)]
> df1 = spark.createDataFrame(sample_data, schema1)
> schema2 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", StringType(), True),
> ])
> sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (2, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (3, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (4, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (5, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (6, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (7, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (8, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (9, "1"), (2, "3332"), (3, "19"), (4, ""),
>  ]
> df2 = spark.createDataFrame(sample_data, schema2)
> file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'
> df1.write \
>  .mode('overwrite') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df2.write \
>  .mode('append') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df = spark.read.schema(schema1).parquet(file_save_path)
> df.show(){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36976) Add max_by/min_by API to SparkR

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36976:


Assignee: Leona Yoda

> Add max_by/min_by API to SparkR
> ---
>
> Key: SPARK-36976
> URL: https://issues.apache.org/jira/browse/SPARK-36976
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= X records

X depends on the capacity of your computer, my case is 36

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Summary: ignoreCorruptFiles does not work when schema change from int to 
string when a file having more than X records  (was: ignoreCorruptFiles does 
not work when schema change from int to string when a file having more than 35 
records)

> ignoreCorruptFiles does not work when schema change from int to string when a 
> file having more than X records
> -
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int 
> and have only one record
>  * File 2: Same schema with File 1 except column X  having data type String 
> and having>= 36 records
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
> If i remove one record from file2. It works well  
>  
>  Code with exist file
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
> file1_path = f'{folder_path}/file1.parquet' 
> file1_schema = spark.read.parquet(file1_path).schema 
> file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> Code with creating file
>  
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> schema1 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", IntegerType(), True),
> ])
> sample_data = [(1, 17)]
> df1 = spark.createDataFrame(sample_data, schema1)
> schema2 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", StringType(), True),
> ])
> sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (2, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (3, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (4, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (5, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (6, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (7, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (8, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (9, "1"), (2, "3332"), (3, "19"), (4, ""),
>  ]
> df2 = spark.createDataFrame(sample_data, schema2)
> file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'
> df1.write \
>  .mode('overwrite') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df2.write \
>  .mode('append') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df = spark.read.schema(schema1).parquet(file_save_path)
> df.show(){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36993) Fix json_tuple throw NPE if fields exist no foldable null value

2021-10-12 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36993:
---
Summary: Fix json_tuple throw NPE if fields exist no foldable null value  
(was: Fix json_tupe throw NPE if fields exist no foldable null value)

> Fix json_tuple throw NPE if fields exist no foldable null value
> ---
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, "3

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, "

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Summary: ignoreCorruptFiles does not work when schema change from int to 
string when a file having more than 35 records  (was: ignoreCorruptFiles does 
not work when schema change from int to string)

> ignoreCorruptFiles does not work when schema change from int to string when a 
> file having more than 35 records
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int 
> and have only one record
>  * File 2: Same schema with File 1 except column X  having data type String 
> and having>= 36 records
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
> If i remove one record from file2. It works well  
>  
>  Code with exist file
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
> file1_path = f'{folder_path}/file1.parquet' 
> file1_schema = spark.read.parquet(file1_path).schema 
> file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> Code with creating file
>  
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> schema1 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", IntegerType(), True),
> ])
> sample_data = [(1, 17)]
> df1 = spark.createDataFrame(sample_data, schema1)
> schema2 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", StringType(), True),
> ])
> sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (2, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (3, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (4, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (5, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (6, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (7, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (8, "1"), (2, "3332"), (3, "19"), (4, ""),
>  (9, "1"), (2, "3332"), (3, "19"), (4, ""),
>  ]
> df2 = self.spark.createDataFrame(sample_data, schema2)
> file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/'
> df1.write \
>  .mode('overwrite') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df2.write \
>  .mode('append') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df = spark.read.schema(schema1).parquet(file_save_path)
> df.show(){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = self.spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4,

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= 36 records

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  

 
 Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = self.spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

 


  
Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3,

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

 


  
Code with exist file
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
Code with creating file

 
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""),
 (2, "1"), (2, "3332"), (3, "19"), (4, ""),
 (3, "1"), (2, "3332"), (3, "19"), (4, ""),
 (4, "1"), (2, "3332"), (3, "19"), (4, ""),
 (5, "1"), (2, "3332"), (3, "19"), (4, ""),
 (6, "1"), (2, "3332"), (3, "19"), (4, ""),
 (7, "1"), (2, "3332"), (3, "19"), (4, ""),
 (8, "1"), (2, "3332"), (3, "19"), (4, ""),
 (9, "1"), (2, "3332"), (3, "19"), (4, ""),
 ]
df2 = self.spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 
  
 \{{}}
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
{{}}


> ignoreCorruptFiles does not work when schema change from int to string
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int
>  * File 2: Same schema with File 1 except column X  having data type String
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type o

[jira] [Commented] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428007#comment-17428007
 ] 

Apache Spark commented on SPARK-36993:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34268

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36993:


Assignee: Apache Spark

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36993:


Assignee: (was: Apache Spark)

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428006#comment-17428006
 ] 

Apache Spark commented on SPARK-36993:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34268

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-36993:
--
Affects Version/s: 3.0.3

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null field

2021-10-12 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-36993:
--
Summary: Fix json_tupe throw NPE if fields exist no foldable null field  
(was: Fix json_tupe throw NPE if fields exist no foldable null column)

> Fix json_tupe throw NPE if fields exist no foldable null field
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value

2021-10-12 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-36993:
--
Summary: Fix json_tupe throw NPE if fields exist no foldable null value  
(was: Fix json_tupe throw NPE if fields exist no foldable null field)

> Fix json_tupe throw NPE if fields exist no foldable null value
> --
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null column

2021-10-12 Thread XiDuo You (Jira)

XiDuo You created SPARK-36993:
-

 Summary: Fix json_tupe throw NPE if fields exist no foldable null 
column
 Key: SPARK-36993
 URL: https://issues.apache.org/jira/browse/SPARK-36993
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: XiDuo You


If json_tuple exists no foldable null field, Spark would throw NPE during eval 
field.toString.

e.g. the query `SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( 
SELECT rand() AS c1 );` will fail with:

{code:java}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null column

2021-10-12 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-36993:
--
Description: 
If json_tuple exists no foldable null field, Spark would throw NPE during eval 
field.toString.

e.g. the query will fail with:

{code:java}
SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
c1 );
{code}

{code:java}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)

{code}


  was:
If json_tuple exists no foldable null field, Spark would throw NPE during eval 
field.toString.

e.g. the query `SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( 
SELECT rand() AS c1 );` will fail with:

{code:java}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
at 
org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)

{code}



> Fix json_tupe throw NPE if fields exist no foldable null column
> ---
>
> Key: SPARK-36993
> URL: https://issues.apache.org/jira/browse/SPARK-36993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> If json_tuple exists no foldable null field, Spark would throw NPE during 
> eval field.toString.
> e.g. the query will fail with:
> {code:java}
> SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS 
> c1 );
> {code}
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36953) Expose SQL state and error class in PySpark exceptions

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36953.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34219
[https://github.com/apache/spark/pull/34219]

> Expose SQL state and error class in PySpark exceptions
> --
>
> Key: SPARK-36953
> URL: https://issues.apache.org/jira/browse/SPARK-36953
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-34920 introduced error classs and states but they are not accessible in 
> PySpark. We should make both available in PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36953) Expose SQL state and error class in PySpark exceptions

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36953:


Assignee: Hyukjin Kwon

> Expose SQL state and error class in PySpark exceptions
> --
>
> Key: SPARK-36953
> URL: https://issues.apache.org/jira/browse/SPARK-36953
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> SPARK-34920 introduced error classs and states but they are not accessible in 
> PySpark. We should make both available in PySpark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36794.
-
Resolution: Fixed

Issue resolved by pull request 34247
[https://github.com/apache/spark/pull/34247]

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36794:

Summary: Ignore duplicated join keys when building relation for SEMI/ANTI 
shuffle hash join  (was: Ignore duplicated join keys when building relation for 
SEMI/ANTI hash join)

> Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash 
> join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36900:


Assignee: Apache Spark

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36954) Fast fail with explicit err msg when calling withWatermark on non-streaming dataset

2021-10-12 Thread huangtengfei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtengfei resolved SPARK-36954.
--
Resolution: Not A Problem

> Fast fail with explicit err msg when calling withWatermark on non-streaming 
> dataset
> ---
>
> Key: SPARK-36954
> URL: https://issues.apache.org/jira/browse/SPARK-36954
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.2
>Reporter: huangtengfei
>Priority: Minor
>
> [Dataset.withWatermark|https://github.com/apache/spark/blob/v3.2.0-rc7/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L740]
>  is a function specific for SS.
> Now it could be triggered on a batch dataset, and add a specific rule to 
> eliminate in analyze phase. User can call this API and nothing happens, it 
> may be a little bit confused.
> If the usage is not as expected, maybe we can just fast fail it with explicit 
> message, and also we do not have to keep on extra rule to do this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36900:


Assignee: (was: Apache Spark)

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-36900:
--
  Assignee: (was: Sean R. Owen)

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-12 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427992#comment-17427992
 ] 

Hyukjin Kwon commented on SPARK-36900:
--

Reverted in:

https://github.com/apache/spark/commit/4b86fe4c71559df12ab8a1ebcf5662c4cf87ca7f 
(branch-3.2)
https://github.com/apache/spark/commit/6ed13147c99b2f652748b716c70dd1937230cafd 
(master)
https://github.com/apache/spark/commit/6e8cd3b1a7489c9b0c5779559e45b3cd5decc1ea 
(master)

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36900:
-
Fix Version/s: (was: 3.2.1)
   (was: 3.3.0)

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Sean R. Owen
>Priority: Minor
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427981#comment-17427981
 ] 

Apache Spark commented on SPARK-36992:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34267

> Improve byte array sort perf by unify getPrefix function of UTF8String and 
> ByteArray
> 
>
> Key: SPARK-36992
> URL: https://issues.apache.org/jira/browse/SPARK-36992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> When execute sort operator, we first compare the prefix. However the 
> getPrefix function of byte array is slow. We use first 8 bytes as the prefix, 
> so at most we will call 8 times with `Platform.getByte` which is slower than 
> call once with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36992:


Assignee: (was: Apache Spark)

> Improve byte array sort perf by unify getPrefix function of UTF8String and 
> ByteArray
> 
>
> Key: SPARK-36992
> URL: https://issues.apache.org/jira/browse/SPARK-36992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> When execute sort operator, we first compare the prefix. However the 
> getPrefix function of byte array is slow. We use first 8 bytes as the prefix, 
> so at most we will call 8 times with `Platform.getByte` which is slower than 
> call once with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36992:


Assignee: Apache Spark

> Improve byte array sort perf by unify getPrefix function of UTF8String and 
> ByteArray
> 
>
> Key: SPARK-36992
> URL: https://issues.apache.org/jira/browse/SPARK-36992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> When execute sort operator, we first compare the prefix. However the 
> getPrefix function of byte array is slow. We use first 8 bytes as the prefix, 
> so at most we will call 8 times with `Platform.getByte` which is slower than 
> call once with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-12 Thread XiDuo You (Jira)

XiDuo You created SPARK-36992:
-

 Summary: Improve byte array sort perf by unify getPrefix function 
of UTF8String and ByteArray
 Key: SPARK-36992
 URL: https://issues.apache.org/jira/browse/SPARK-36992
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


When execute sort operator, we first compare the prefix. However the getPrefix 
function of byte array is slow. We use first 8 bytes as the prefix, so at most 
we will call 8 times with `Platform.getByte` which is slower than call once 
with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36971) Query files directly with SQL is broken (with Glue)

2021-10-12 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427976#comment-17427976
 ] 

Hyukjin Kwon commented on SPARK-36971:
--

I suggest you do contact AWS or Databricks to follow up the issue. Databricks 
or AWS aren't Apache Spark.

> Query files directly with SQL is broken (with Glue)
> ---
>
> Key: SPARK-36971
> URL: https://issues.apache.org/jira/browse/SPARK-36971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Databricks Runtime 9.1 and 10.0 Beta
>Reporter: Lauri Koobas
>Priority: Major
>
> This is broken in DBR 9.1 (and 10.0 Beta):
> {{    select * from json.`filename`}}
> [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html]
> I have tried with JSON and Parquet files.
> The error:
> {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch 
> tables of db json}}{color}
> Down in the stack trace this also exists:
> {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not 
> found. (Service: AWSGlue; Status Code: 400; Error Code: 
> EntityNotFoundException; ... )){color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36971) Query files directly with SQL is broken (with Glue)

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36971.
--
Resolution: Invalid

> Query files directly with SQL is broken (with Glue)
> ---
>
> Key: SPARK-36971
> URL: https://issues.apache.org/jira/browse/SPARK-36971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Databricks Runtime 9.1 and 10.0 Beta
>Reporter: Lauri Koobas
>Priority: Major
>
> This is broken in DBR 9.1 (and 10.0 Beta):
> {{    select * from json.`filename`}}
> [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html]
> I have tried with JSON and Parquet files.
> The error:
> {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch 
> tables of db json}}{color}
> Down in the stack trace this also exists:
> {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not 
> found. (Service: AWSGlue; Status Code: 400; Error Code: 
> EntityNotFoundException; ... )){color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 
  
 \{{}}
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
file1_path = f'{folder_path}/file1.parquet' 
file1_schema = spark.read.parquet(file1_path).schema 
file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
file_all_df.show(n=10)
{code}
{{}}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 
 
{{}}
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = 
f'{folder_path}/file1.parquet' file1_schema = 
spark.read.parquet(file1_path).schema file_all_df = 
spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10)
{code}
{{}}


> ignoreCorruptFiles does not work when schema change from int to string
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int
>  * File 2: Same schema with File 1 except column X  having data type String
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
>   
>  \{{}}
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' 
> file1_path = f'{folder_path}/file1.parquet' 
> file1_schema = spark.read.parquet(file1_path).schema 
> file_all_df = spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Description: 
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 
 
{{}}
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = 
f'{folder_path}/file1.parquet' file1_schema = 
spark.read.parquet(file1_path).schema file_all_df = 
spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10)
{code}
{{}}

  was:
Precondition:

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int
 * File 2: Same schema with File 1 except column X  having data type String

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 


> ignoreCorruptFiles does not work when schema change from int to string
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int
>  * File 2: Same schema with File 1 except column X  having data type String
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
>  
> {{}}
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = 
> f'{folder_path}/file1.parquet' file1_schema = 
> spark.read.parquet(file1_path).schema file_all_df = 
> spark.read.schema(file1_schema).parquet( folder_path) 
> file_all_df.show(n=10)
> {code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36985) Future typing errors in pyspark.pandas

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36985.
--
Fix Version/s: 3.3.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34266

> Future typing errors in pyspark.pandas
> --
>
> Key: SPARK-36985
> URL: https://issues.apache.org/jira/browse/SPARK-36985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.3.0
>
>
> The following problems are detected on master with mypy 0.920
> {code:java}
> $ git rev-parse HEAD 
> 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca
> $ mypy --version 
> mypy 0.920+dev.332b712df848cd242987864b38bd237364654532
> $ mypy --config-file mypy.ini pyspark
> pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment 
> (expression has type "CategoricalIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment 
> (expression has type "Int64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment 
> (expression has type "Float64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment 
> (expression has type "DatetimeIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment 
> (expression has type "Index", variable has type "MultiIndex")  [assignment]
> pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute 
> "_anchor"  [attr-defined]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Attachment: file2.parquet
file1.parquet

> ignoreCorruptFiles does not work when schema change from int to string
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
> Attachments: file1.parquet, file2.parquet
>
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int
>  * File 2: Same schema with File 1 except column X  having data type String
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427970#comment-17427970
 ] 

Hyukjin Kwon commented on SPARK-36989:
--

Adding mypy tests would be super awesome!

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36961) Use PEP526 style variable type hints

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36961:


Assignee: Takuya Ueshin

> Use PEP526 style variable type hints
> 
>
> Key: SPARK-36961
> URL: https://issues.apache.org/jira/browse/SPARK-36961
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> Now that we have started using newer Python syntax in the code base.
> We should use PEP526 style variable type hints.
> https://www.python.org/dev/peps/pep-0526/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36961) Use PEP526 style variable type hints

2021-10-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36961.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34227
[https://github.com/apache/spark/pull/34227]

> Use PEP526 style variable type hints
> 
>
> Key: SPARK-36961
> URL: https://issues.apache.org/jira/browse/SPARK-36961
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.3.0
>
>
> Now that we have started using newer Python syntax in the code base.
> We should use PEP526 style variable type hints.
> https://www.python.org/dev/peps/pep-0526/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-12 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36981.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34253

> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36985) Future typing errors in pyspark.pandas

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427966#comment-17427966
 ] 

Apache Spark commented on SPARK-36985:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34266

> Future typing errors in pyspark.pandas
> --
>
> Key: SPARK-36985
> URL: https://issues.apache.org/jira/browse/SPARK-36985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> The following problems are detected on master with mypy 0.920
> {code:java}
> $ git rev-parse HEAD 
> 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca
> $ mypy --version 
> mypy 0.920+dev.332b712df848cd242987864b38bd237364654532
> $ mypy --config-file mypy.ini pyspark
> pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment 
> (expression has type "CategoricalIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment 
> (expression has type "Int64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment 
> (expression has type "Float64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment 
> (expression has type "DatetimeIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment 
> (expression has type "Index", variable has type "MultiIndex")  [assignment]
> pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute 
> "_anchor"  [attr-defined]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36985) Future typing errors in pyspark.pandas

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36985:


Assignee: (was: Apache Spark)

> Future typing errors in pyspark.pandas
> --
>
> Key: SPARK-36985
> URL: https://issues.apache.org/jira/browse/SPARK-36985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> The following problems are detected on master with mypy 0.920
> {code:java}
> $ git rev-parse HEAD 
> 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca
> $ mypy --version 
> mypy 0.920+dev.332b712df848cd242987864b38bd237364654532
> $ mypy --config-file mypy.ini pyspark
> pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment 
> (expression has type "CategoricalIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment 
> (expression has type "Int64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment 
> (expression has type "Float64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment 
> (expression has type "DatetimeIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment 
> (expression has type "Index", variable has type "MultiIndex")  [assignment]
> pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute 
> "_anchor"  [attr-defined]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36985) Future typing errors in pyspark.pandas

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36985:


Assignee: Apache Spark

> Future typing errors in pyspark.pandas
> --
>
> Key: SPARK-36985
> URL: https://issues.apache.org/jira/browse/SPARK-36985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> The following problems are detected on master with mypy 0.920
> {code:java}
> $ git rev-parse HEAD 
> 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca
> $ mypy --version 
> mypy 0.920+dev.332b712df848cd242987864b38bd237364654532
> $ mypy --config-file mypy.ini pyspark
> pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment 
> (expression has type "CategoricalIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment 
> (expression has type "Int64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment 
> (expression has type "Float64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment 
> (expression has type "DatetimeIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment 
> (expression has type "Index", variable has type "MultiIndex")  [assignment]
> pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute 
> "_anchor"  [attr-defined]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36985) Future typing errors in pyspark.pandas

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427967#comment-17427967
 ] 

Apache Spark commented on SPARK-36985:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34266

> Future typing errors in pyspark.pandas
> --
>
> Key: SPARK-36985
> URL: https://issues.apache.org/jira/browse/SPARK-36985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> The following problems are detected on master with mypy 0.920
> {code:java}
> $ git rev-parse HEAD 
> 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca
> $ mypy --version 
> mypy 0.920+dev.332b712df848cd242987864b38bd237364654532
> $ mypy --config-file mypy.ini pyspark
> pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment 
> (expression has type "CategoricalIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment 
> (expression has type "Int64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment 
> (expression has type "Float64Index", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment 
> (expression has type "DatetimeIndex", variable has type "MultiIndex")  
> [assignment]
> pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment 
> (expression has type "Index", variable has type "MultiIndex")  [assignment]
> pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute 
> "_anchor"  [attr-defined]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427958#comment-17427958
 ] 

Apache Spark commented on SPARK-23626:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/34265

>  DAGScheduler blocked due to JobSubmitted event
> ---
>
> Key: SPARK-23626
> URL: https://issues.apache.org/jira/browse/SPARK-23626
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1, 2.3.3, 2.4.3, 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted 
> events has to be processed as DAGSchedulerEventProcessLoop is single threaded 
> and it will block other tasks in queue like TaskCompletion.
> The JobSubmitted event is time consuming depending on the nature of the job 
> (Example: calculating parent stage dependencies, shuffle dependencies, 
> partitions) and thus it blocks all the events to be processed.
>  
> I see multiple JIRA referring to this behavior
> https://issues.apache.org/jira/browse/SPARK-2647
> https://issues.apache.org/jira/browse/SPARK-4961
>  
> Similarly in my cluster some jobs partition calculation is time consuming 
> (Similar to stack at SPARK-2647) hence it slows down the spark 
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if 
> its tasks are finished within seconds, as TaskCompletion Events are processed 
> at a slower rate due to blockage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36979:
-

Assignee: XiDuo You

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.2.0
>
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36979:
--
Issue Type: Bug  (was: Improvement)

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.2.0
>
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36979.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34260
[https://github.com/apache/spark/pull/34260]

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Minor
> Fix For: 3.2.0
>
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py

2021-10-12 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427922#comment-17427922
 ] 

Xinrong Meng commented on SPARK-36991:
--

I am working on this.

> Inline type hints for spark/python/pyspark/sql/streaming.py
> ---
>
> Key: SPARK-36991
> URL: https://issues.apache.org/jira/browse/SPARK-36991
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for spark/python/pyspark/sql/streaming.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py

2021-10-12 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36991:


 Summary: Inline type hints for 
spark/python/pyspark/sql/streaming.py
 Key: SPARK-36991
 URL: https://issues.apache.org/jira/browse/SPARK-36991
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints for spark/python/pyspark/sql/streaming.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file

2021-10-12 Thread Catalin Toda (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Toda updated SPARK-36990:
-
Description: 
The code below does not work on both Spark 3.1 and Spark 3.2.

Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == 
null 
([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)]
 which makes isUnsignedTypeMatched return false always:

[https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180]
 

I am not sure even if logicalTypeAnnotation would not be null if 
isUnsignedTypeMatched is supposed to return true for this use case.

Python repro:
{code:java}
import os
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.AbstractFileSystem.s3.impl", 
"org.apache.hadoop.fs.s3a.S3A") \
.getOrCreate()
df = 
spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id")
df.write.mode("overwrite").parquet("s3://bucket/test")
df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test")
df.show(1, False)
{code}
 

  was:
The code above does not work on both Spark 3.1 and Spark 3.2.

Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == 
null 
([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)]
 which makes isUnsignedTypeMatched return false always:

[https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180]
 

I am not sure even if logicalTypeAnnotation would not be null if 
isUnsignedTypeMatched is supposed to return true for this use case.

 


> Long columns cannot read columns with INT32 type in the parquet file
> 
>
> Key: SPARK-36990
> URL: https://issues.apache.org/jira/browse/SPARK-36990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Catalin Toda
>Priority: Major
>
> The code below does not work on both Spark 3.1 and Spark 3.2.
> Part of the issue is the fact that the fileSchema has logicalTypeAnnotation 
> == null 
> ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)]
>  which makes isUnsignedTypeMatched return false always:
> [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180]
>  
> I am not sure even if logicalTypeAnnotation would not be null if 
> isUnsignedTypeMatched is supposed to return true for this use case.
> Python repro:
> {code:java}
> import os
> from pyspark.sql.functions import *
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> spark = SparkSession.builder \
> .config("spark.hadoop.fs.s3.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
> .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", 
> "org.apache.hadoop.fs.s3a.S3A") \
> .getOrCreate()
> df = 
> spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id")
> df.write.mode("overwrite").parquet("s3://bucket/test")
> df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test")
> df.show(1, False)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file

2021-10-12 Thread Catalin Toda (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Toda updated SPARK-36990:
-
Environment: (was: Python repro:
{code:java}
import os
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.AbstractFileSystem.s3.impl", 
"org.apache.hadoop.fs.s3a.S3A") \
.getOrCreate()
df = 
spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id")
df.write.mode("overwrite").parquet("s3://bucket/test")
df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test")
df.show(1, False)
{code})

> Long columns cannot read columns with INT32 type in the parquet file
> 
>
> Key: SPARK-36990
> URL: https://issues.apache.org/jira/browse/SPARK-36990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Catalin Toda
>Priority: Major
>
> The code above does not work on both Spark 3.1 and Spark 3.2.
> Part of the issue is the fact that the fileSchema has logicalTypeAnnotation 
> == null 
> ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)]
>  which makes isUnsignedTypeMatched return false always:
> [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180]
>  
> I am not sure even if logicalTypeAnnotation would not be null if 
> isUnsignedTypeMatched is supposed to return true for this use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36951) Inline type hints for python/pyspark/sql/column.py

2021-10-12 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36951.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 34226
https://github.com/apache/spark/pull/34226

> Inline type hints for python/pyspark/sql/column.py
> --
>
> Key: SPARK-36951
> URL: https://issues.apache.org/jira/browse/SPARK-36951
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/column.py for type check of function 
> bodies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file

2021-10-12 Thread Catalin Toda (Jira)

Catalin Toda created SPARK-36990:


 Summary: Long columns cannot read columns with INT32 type in the 
parquet file
 Key: SPARK-36990
 URL: https://issues.apache.org/jira/browse/SPARK-36990
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
 Environment: Python repro:
{code:java}
import os
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.AbstractFileSystem.s3.impl", 
"org.apache.hadoop.fs.s3a.S3A") \
.getOrCreate()
df = 
spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id")
df.write.mode("overwrite").parquet("s3://bucket/test")
df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test")
df.show(1, False)
{code}
Reporter: Catalin Toda


The code above does not work on both Spark 3.1 and Spark 3.2.

Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == 
null 
([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)]
 which makes isUnsignedTypeMatched return false always:

[https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180]
 

I am not sure even if logicalTypeAnnotation would not be null if 
isUnsignedTypeMatched is supposed to return true for this use case.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-36989:
---
Description: 
Before the migration, {{pyspark-stubs}} contained a set of [data 
tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], 
modeled after, and using internal test utilities, of mypy.

These were omitted during the migration for a few reasons:
 * Simplicity.
 * Relative slowness.
 * Dependence on non public API.

 

Data tests are useful for a number of reasons:

 
 * Improve test coverage for type hints.
 * Checking if type checkers infer expected types.
 * Checking if type checkers reject incorrect code.
 * Detecting unusual errors with code that otherwise type checks,

 

Especially, the last two functions are not fulfilled by simple validation of 
existing codebase.

 

Data tests are not required for all annotations and can be restricted to code 
that has high possibility of failure:
 * Complex overloaded signatures.
 * Complex generics.
 * Generic {{self}} annotations
 * Code containing {{type: ignore}}

The biggest risk, is that output matchers have to be updated when signature 
changes and / or mypy output changes.

Example of problem detected with data tests can be found in SPARK-36894 PR 
([https://github.com/apache/spark/pull/34146]).

 

 

  was:
Before the migration, {{pyspark-stubs}} contained a set of data tests, modeled 
after, and using internal test utilities, of mypy.

These were omitted during the migration for a few reasons:
 * Simplicity.
 * Relative slowness.
 * Dependence on non public API.

 

Data tests are useful for a number of reasons:

 
 * Improve test coverage for type hints.
 * Checking if type checkers infer expected types.
 * Checking if type checkers reject incorrect code.
 * Detecting unusual errors with code that otherwise type checks,

 

Especially, the last two functions are not fulfilled by simple validation of 
existing codebase.

 

Data tests are not required for all annotations and can be restricted to code 
that has high possibility of failure:
 * Complex overloaded signatures.
 * Complex generics.
 * Generic {{self}} annotations
 * Code containing {{type: ignore}}

The biggest risk, is that output matchers have to be updated when signature 
changes and / or mypy output changes.

Example of problem detected with data tests can be found in SPARK-36894 PR 
([https://github.com/apache/spark/pull/34146]).

 

 


> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of [data 
> tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit],
>  modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427889#comment-17427889
 ] 

Maciej Szymkiewicz edited comment on SPARK-36989 at 10/12/21, 7:23 PM:
---

FYI [~hyukjin.kwon], [~ueshin], [~XinrongM]


was (Author: zero323):
FYI [~hyukjin.kwon] [~XinrongM] [~ueshin]

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of data tests, 
> modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427889#comment-17427889
 ] 

Maciej Szymkiewicz commented on SPARK-36989:


FYI [~hyukjin.kwon] [~XinrongM] [~ueshin]

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of data tests, 
> modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427888#comment-17427888
 ] 

Maciej Szymkiewicz commented on SPARK-36989:


Currently I am working on [some 
fixes|https://github.com/typeddjango/pytest-mypy-plugins/commits?author=zero323]
 to  
[typeddjango/pytest-mypy-plugins|https://github.com/typeddjango/pytest-mypy-plugins]
 and I hope it will allow us to bring data test to Spark, without depending on 
internal mypy testing suite (which, adding to being internal, requires rather 
specific project layout).

> Migrate type hint data tests
> 
>
> Key: SPARK-36989
> URL: https://issues.apache.org/jira/browse/SPARK-36989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Before the migration, {{pyspark-stubs}} contained a set of data tests, 
> modeled after, and using internal test utilities, of mypy.
> These were omitted during the migration for a few reasons:
>  * Simplicity.
>  * Relative slowness.
>  * Dependence on non public API.
>  
> Data tests are useful for a number of reasons:
>  
>  * Improve test coverage for type hints.
>  * Checking if type checkers infer expected types.
>  * Checking if type checkers reject incorrect code.
>  * Detecting unusual errors with code that otherwise type checks,
>  
> Especially, the last two functions are not fulfilled by simple validation of 
> existing codebase.
>  
> Data tests are not required for all annotations and can be restricted to code 
> that has high possibility of failure:
>  * Complex overloaded signatures.
>  * Complex generics.
>  * Generic {{self}} annotations
>  * Code containing {{type: ignore}}
> The biggest risk, is that output matchers have to be updated when signature 
> changes and / or mypy output changes.
> Example of problem detected with data tests can be found in SPARK-36894 PR 
> ([https://github.com/apache/spark/pull/34146]).
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36462:


Assignee: (was: Apache Spark)

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36462:


Assignee: Apache Spark

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427886#comment-17427886
 ] 

Apache Spark commented on SPARK-36462:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34264

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427884#comment-17427884
 ] 

Apache Spark commented on SPARK-36462:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34264

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36989) Migrate type hint data tests

2021-10-12 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-36989:
--

 Summary: Migrate type hint data tests
 Key: SPARK-36989
 URL: https://issues.apache.org/jira/browse/SPARK-36989
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz


Before the migration, {{pyspark-stubs}} contained a set of data tests, modeled 
after, and using internal test utilities, of mypy.

These were omitted during the migration for a few reasons:
 * Simplicity.
 * Relative slowness.
 * Dependence on non public API.

 

Data tests are useful for a number of reasons:

 
 * Improve test coverage for type hints.
 * Checking if type checkers infer expected types.
 * Checking if type checkers reject incorrect code.
 * Detecting unusual errors with code that otherwise type checks,

 

Especially, the last two functions are not fulfilled by simple validation of 
existing codebase.

 

Data tests are not required for all annotations and can be restricted to code 
that has high possibility of failure:
 * Complex overloaded signatures.
 * Complex generics.
 * Generic {{self}} annotations
 * Code containing {{type: ignore}}

The biggest risk, is that output matchers have to be updated when signature 
changes and / or mypy output changes.

Example of problem detected with data tests can be found in SPARK-36894 PR 
([https://github.com/apache/spark/pull/34146]).

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427852#comment-17427852
 ] 

Apache Spark commented on SPARK-36978:
--

User 'utkarsh39' has created a pull request for this issue:
https://github.com/apache/spark/pull/34263

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36978:


Assignee: Apache Spark

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Assignee: Apache Spark
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36978:


Assignee: (was: Apache Spark)

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427851#comment-17427851
 ] 

Apache Spark commented on SPARK-36978:
--

User 'utkarsh39' has created a pull request for this issue:
https://github.com/apache/spark/pull/34263

> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-12 Thread Utkarsh Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Utkarsh Agarwal updated SPARK-36978:

Description: 
[InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
 optimization rule generates {{IsNotNull}} constraints corresponding to null 
intolerant predicates. The {{IsNotNull}} constraints are generated on the 
attribute inside the corresponding predicate. 
 e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
column {{structCol.b}} where {{structCol}} is a struct column results in a 
constraint {{IsNotNull(structCol)}}.

This generation of constraints on the root level nested type is extremely 
conservative as it could lead to materialization of the the entire struct. The 
constraint should instead be generated on the nested field being referenced by 
the predicate. In the above example, the constraint should be 
{{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}

 

The new constraints also create opportunities for nested pruning. Currently 
{{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
However the constraint {{IsNotNull(structCol.b)}} could create opportunities to 
prune {{structCol}}.

  was:
[InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
 optimization rule generates {{IsNotNull}} constraints corresponding to null 
intolerant predicates. The {{IsNotNull}} constraints are generated on the 
attribute inside the corresponding predicate. 
e.g. A predicate {{a > 0}}  on an integer column {{a}} will result in a 
constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
column {{structCol.b}} where {{structCol}} is a struct column results in a 
constraint {{IsNotNull(structCol)}}.

This generation of constraints on the root level nested type is extremely 
conservative as it could lead to materialization of the the entire struct. The 
constraint should instead be generated on the nested field being referenced by 
the predicate. In the above example, the constraint should be 
{{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}



> InferConstraints rule should create IsNotNull constraints on the nested field 
> instead of the root nested type 
> --
>
> Key: SPARK-36978
> URL: https://issues.apache.org/jira/browse/SPARK-36978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>
> [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
>  optimization rule generates {{IsNotNull}} constraints corresponding to null 
> intolerant predicates. The {{IsNotNull}} constraints are generated on the 
> attribute inside the corresponding predicate. 
>  e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a 
> constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
> column {{structCol.b}} where {{structCol}} is a struct column results in a 
> constraint {{IsNotNull(structCol)}}.
> This generation of constraints on the root level nested type is extremely 
> conservative as it could lead to materialization of the the entire struct. 
> The constraint should instead be generated on the nested field being 
> referenced by the predicate. In the above example, the constraint should be 
> {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}
>  
> The new constraints also create opportunities for nested pruning. Currently 
> {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. 
> However the constraint {{IsNotNull(structCol.b)}} could create opportunities 
> to prune {{structCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-10-12 Thread Shardul Mahadik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427825#comment-17427825
 ] 

Shardul Mahadik commented on SPARK-36877:
-

{quote} Getting RDD means the physical plan is finalized. With AQE, finalizing 
the physical plan means running all the query stages except for the last 
stage.{quote}

Ack! Makes sense.

{quote}> shouldn't it reuse the result from previous stages?
One DataFrame means one query, and today Spark can't reuse 
shuffle/broadcast/subquery across queries.{quote}

But isn't this the same DF. I am calling {{df.rdd}} and then {{df.write}} where 
{{df}} is the same. So it is not across queries.

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36970) Manual disabled format `B` for `date_format` function to compatibility with Java 8 behavior.

2021-10-12 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36970.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34237
[https://github.com/apache/spark/pull/34237]

> Manual disabled format `B` for `date_format` function to compatibility with 
> Java 8 behavior.
> 
>
> Key: SPARK-36970
> URL: https://issues.apache.org/jira/browse/SPARK-36970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> The `date_format` function has some behavioral differences when using JDK 8 
> and JDK 17 as following:
> the result of {{select date_format('2018-11-17 13:33:33.333', 'B')}} in 
> {{datetime-formatting-invalid.sql}} with Java 8 is:
> {code:java}
> -- !query
> select date_format('2018-11-17 13:33:33.333', 'B')
> -- !query schema
> struct<>
> -- !query output
> java.lang.IllegalArgumentException
> Unknown pattern letter: B
> {code}
> and with Java 17 the result is:
> {code:java}
> - datetime-formatting-invalid.sql *** FAILED ***
>   datetime-formatting-invalid.sql
>   Expected "struct<[]>", but got "struct<[date_format(2018-11-17 
> 13:33:33.333, B):string]>" Schema did not match for query #34
>   select date_format('2018-11-17 13:33:33.333', 'B'): -- !query
>   select date_format('2018-11-17 13:33:33.333', 'B')
>   -- !query schema
>   struct
>   -- !query output
>   in the afternoon (SQLQueryTestSuite.scala:469)
> {code}
>  
> From the javadoc we can find that 'B' is used to represent `{{Pattern letters 
> to output a day period`}} in Java 17.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36970) Manual disabled format `B` for `date_format` function to compatibility with Java 8 behavior.

2021-10-12 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-36970:


Assignee: Yang Jie

> Manual disabled format `B` for `date_format` function to compatibility with 
> Java 8 behavior.
> 
>
> Key: SPARK-36970
> URL: https://issues.apache.org/jira/browse/SPARK-36970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> The `date_format` function has some behavioral differences when using JDK 8 
> and JDK 17 as following:
> the result of {{select date_format('2018-11-17 13:33:33.333', 'B')}} in 
> {{datetime-formatting-invalid.sql}} with Java 8 is:
> {code:java}
> -- !query
> select date_format('2018-11-17 13:33:33.333', 'B')
> -- !query schema
> struct<>
> -- !query output
> java.lang.IllegalArgumentException
> Unknown pattern letter: B
> {code}
> and with Java 17 the result is:
> {code:java}
> - datetime-formatting-invalid.sql *** FAILED ***
>   datetime-formatting-invalid.sql
>   Expected "struct<[]>", but got "struct<[date_format(2018-11-17 
> 13:33:33.333, B):string]>" Schema did not match for query #34
>   select date_format('2018-11-17 13:33:33.333', 'B'): -- !query
>   select date_format('2018-11-17 13:33:33.333', 'B')
>   -- !query schema
>   struct
>   -- !query output
>   in the afternoon (SQLQueryTestSuite.scala:469)
> {code}
>  
> From the javadoc we can find that 'B' is used to represent `{{Pattern letters 
> to output a day period`}} in Java 17.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What ciphers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Summary: What ciphers spark support for internode communication?  (was: 
What chipers spark support for internode communication?)

> What ciphers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mentions this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
>  {{The common-crypto mentions that it uses JCE or OPENSSL implementations, 
> but says nothing about the ciphers.}}
>  {{Does it support everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36867) Misleading Error Message with Invalid Column and Group By

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36867:
---

Assignee: Wenchen Fan

> Misleading Error Message with Invalid Column and Group By
> -
>
> Key: SPARK-36867
> URL: https://issues.apache.org/jira/browse/SPARK-36867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Alan Jackoway
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>
> When you run a query with an invalid column that also does a group by on a 
> constructed column, the error message you get back references a missing 
> column for the group by rather than the invalid column.
> You can reproduce this in pyspark in 3.1.2 with the following code:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Group By Issue").getOrCreate()
> data = spark.createDataFrame(
> [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", 
> 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)],
> schema=["d", "v"]
> )
> data.createOrReplaceTempView("data")
> # This is valid
> spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> # This is invalid because val is the wrong variable
> spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> {code}
> The error message for the second spark.sql line is
> {quote}
> pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input 
> columns: [data.d, data.v]; line 1 pos 81;
> 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 
> as timestamp), Some(America/New_York)) as date) AS week#22]
> +- SubqueryAlias data
>+- LogicalRDD [d#0, v#1L], false
> {quote}
> but the actual problem is that I used the wrong variable name in a different 
> part of the query. Nothing is wrong with {{week}} in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36867) Misleading Error Message with Invalid Column and Group By

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36867.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34244
[https://github.com/apache/spark/pull/34244]

> Misleading Error Message with Invalid Column and Group By
> -
>
> Key: SPARK-36867
> URL: https://issues.apache.org/jira/browse/SPARK-36867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Alan Jackoway
>Priority: Major
> Fix For: 3.3.0
>
>
> When you run a query with an invalid column that also does a group by on a 
> constructed column, the error message you get back references a missing 
> column for the group by rather than the invalid column.
> You can reproduce this in pyspark in 3.1.2 with the following code:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Group By Issue").getOrCreate()
> data = spark.createDataFrame(
> [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", 
> 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)],
> schema=["d", "v"]
> )
> data.createOrReplaceTempView("data")
> # This is valid
> spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> # This is invalid because val is the wrong variable
> spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> {code}
> The error message for the second spark.sql line is
> {quote}
> pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input 
> columns: [data.d, data.v]; line 1 pos 81;
> 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 
> as timestamp), Some(America/New_York)) as date) AS week#22]
> +- SubqueryAlias data
>+- LogicalRDD [d#0, v#1L], false
> {quote}
> but the actual problem is that I used the wrong variable name in a different 
> part of the query. Nothing is wrong with {{week}} in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36914) Implement dropIndex and listIndexes in JDBC (MySQL dialect)

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36914.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34236
[https://github.com/apache/spark/pull/34236]

> Implement dropIndex and listIndexes in JDBC (MySQL dialect)
> ---
>
> Key: SPARK-36914
> URL: https://issues.apache.org/jira/browse/SPARK-36914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36914) Implement dropIndex and listIndexes in JDBC (MySQL dialect)

2021-10-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36914:
---

Assignee: Huaxin Gao

> Implement dropIndex and listIndexes in JDBC (MySQL dialect)
> ---
>
> Key: SPARK-36914
> URL: https://issues.apache.org/jira/browse/SPARK-36914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Description: 
{{Spark documentation mentions this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}
 {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}
 {{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}

  was:
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}
{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}
{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}


> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mentions this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
>  {{The common-crypto mentions that it uses JCE or OPENSSL implementations, 
> but says nothing about the ciphers.}}
>  {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Description: 
{{Spark documentation mentions this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}
 {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}
 {{Does it support everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}

  was:
{{Spark documentation mentions this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}
 {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}
 {{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}


> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mentions this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
>  {{The common-crypto mentions that it uses JCE or OPENSSL implementations, 
> but says nothing about the ciphers.}}
>  {{Does it support everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Description: 
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}


{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}


{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}

  was:
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
 \{{}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
 \{{}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}


> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mention this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
> {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
> says nothing about the ciphers.}}
> {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Description: 
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}
{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}
{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}

  was:
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}


{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}


{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}


> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mention this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
> {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
> says nothing about the ciphers.}}
> {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Description: 
{{Spark documentation mention this:}}
 {{[https://spark.apache.org/docs/3.0.0/security.html]}}
 \{{}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
 {{If I leave it to None what will happen? There won't be any encryption used 
or will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
 \{{}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
 {{So what will happen if I leave spark.network.crypto.config.* to None?}}
 {{And what ciphers are supported?}}

> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
> Environment: {{Spark documentation mention this:}}
> {{https://spark.apache.org/docs/3.0.0/security.html}}
> {{}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
> {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
> {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
> says nothing about the ciphers.}}
> {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
> {{}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
> {{So what will happen if I leave spark.network.crypto.config.* to None?}}
> {{And what ciphers are supported?}}
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mention this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
>  \{{}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
> {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
> says nothing about the ciphers.}}
> {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
>  \{{}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zoli updated SPARK-36988:
-
Environment: (was: {{Spark documentation mention this:}}
{{https://spark.apache.org/docs/3.0.0/security.html}}
{{}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
{{If I leave it to None what will happen? There won't be any encryption used or 
will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{{}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
{{So what will happen if I leave spark.network.crypto.config.* to None?}}
{{And what ciphers are supported?}})

> What chipers spark support for internode communication?
> ---
>
> Key: SPARK-36988
> URL: https://issues.apache.org/jira/browse/SPARK-36988
> Project: Spark
>  Issue Type: Question
>  Components: Security
>Affects Versions: 3.1.2
>Reporter: zoli
>Priority: Minor
>
> {{Spark documentation mention this:}}
>  {{[https://spark.apache.org/docs/3.0.0/security.html]}}
>  \{{}}
> {code:java}
> spark.network.crypto.config.*
> "Configuration values for the commons-crypto library, such as which cipher 
> implementations to use. The config name should be the name of commons-crypto 
> configuration without the commons.crypto prefix."{code}
> {{What this means?}}
>  {{If I leave it to None what will happen? There won't be any encryption used 
> or will it fallback to some default one?}}
> {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
> says nothing about the ciphers.}}
> {{Is it supports everything the given JVM does?}}
> {{The documentation is vague on this.}}
> {{However the spark ui part for the security is clear:}}
>  \{{}}
> {code:java}
> ${ns}.enabledAlgorithms
> A comma-separated list of ciphers. The specified ciphers must be supported by 
> JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
> Names" section of the Java security guide. The list for Java 8 can be found 
> at this page. Note: If not set, the default cipher suite for the JRE will be 
> used.{code}
> {{ }}
>  {{So what will happen if I leave spark.network.crypto.config.* to None?}}
>  {{And what ciphers are supported?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36988) What chipers spark support for internode communication?

2021-10-12 Thread zoli (Jira)

zoli created SPARK-36988:


 Summary: What chipers spark support for internode communication?
 Key: SPARK-36988
 URL: https://issues.apache.org/jira/browse/SPARK-36988
 Project: Spark
  Issue Type: Question
  Components: Security
Affects Versions: 3.1.2
 Environment: {{Spark documentation mention this:}}
{{https://spark.apache.org/docs/3.0.0/security.html}}
{{}}
{code:java}
spark.network.crypto.config.*
"Configuration values for the commons-crypto library, such as which cipher 
implementations to use. The config name should be the name of commons-crypto 
configuration without the commons.crypto prefix."{code}
{{What this means?}}
{{If I leave it to None what will happen? There won't be any encryption used or 
will it fallback to some default one?}}

{{The common-crypto mentions that it uses JCE or OPENSSL implementations, but 
says nothing about the ciphers.}}

{{Is it supports everything the given JVM does?}}

{{The documentation is vague on this.}}

{{However the spark ui part for the security is clear:}}
{{}}
{code:java}
${ns}.enabledAlgorithms
A comma-separated list of ciphers. The specified ciphers must be supported by 
JVM. The reference list of protocols can be found in the "JSSE Cipher Suite 
Names" section of the Java security guide. The list for Java 8 can be found at 
this page. Note: If not set, the default cipher suite for the JRE will be 
used.{code}
{{ }}
{{So what will happen if I leave spark.network.crypto.config.* to None?}}
{{And what ciphers are supported?}}
Reporter: zoli






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string

2021-10-12 Thread mike (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mike updated SPARK-36983:
-
Summary: ignoreCorruptFiles does not work when schema change from int to 
string  (was: ignoreCorruptFiles does work when schema change from int to 
string)

> ignoreCorruptFiles does not work when schema change from int to string
> --
>
> Key: SPARK-36983
> URL: https://issues.apache.org/jira/browse/SPARK-36983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: mike
>Priority: Major
>
> Precondition:
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column X with data type Int
>  * File 2: Same schema with File 1 except column X  having data type String
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column X changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36987) Add Doc about FROM statement

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427672#comment-17427672
 ] 

Apache Spark commented on SPARK-36987:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34261

> Add Doc about FROM statement
> 
>
> Key: SPARK-36987
> URL: https://issues.apache.org/jira/browse/SPARK-36987
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Add Doc about FROM statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36987) Add Doc about FROM statement

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36987:


Assignee: Apache Spark

> Add Doc about FROM statement
> 
>
> Key: SPARK-36987
> URL: https://issues.apache.org/jira/browse/SPARK-36987
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Add Doc about FROM statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36987) Add Doc about FROM statement

2021-10-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36987:


Assignee: (was: Apache Spark)

> Add Doc about FROM statement
> 
>
> Key: SPARK-36987
> URL: https://issues.apache.org/jira/browse/SPARK-36987
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Add Doc about FROM statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36987) Add Doc about FROM statement

2021-10-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427673#comment-17427673
 ] 

Apache Spark commented on SPARK-36987:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34261

> Add Doc about FROM statement
> 
>
> Key: SPARK-36987
> URL: https://issues.apache.org/jira/browse/SPARK-36987
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Add Doc about FROM statement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo