[jira] [Resolved] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
[ https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36973. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34251 [https://github.com/apache/spark/pull/34251] > Deduplicate prepare data method for HistogramPlotBase and KdePlotBase > - > > Key: SPARK-36973 > URL: https://issues.apache.org/jira/browse/SPARK-36973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
[ https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36973: Assignee: dch nguyen > Deduplicate prepare data method for HistogramPlotBase and KdePlotBase > - > > Key: SPARK-36973 > URL: https://issues.apache.org/jira/browse/SPARK-36973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36994) Upgrade Apache Thrift
kaja girish created SPARK-36994: --- Summary: Upgrade Apache Thrift Key: SPARK-36994 URL: https://issues.apache.org/jira/browse/SPARK-36994 Project: Spark Issue Type: Bug Components: Security Affects Versions: 3.0.1 Reporter: kaja girish *Image:* * spark:3.0.1 *Components Affected:* * Apache Thrift *Recommendation:* * upgrade Apache Thrift *CVE:* |Component Name|Component Version Name|Vulnerability|Fixed version| |Apache Thrift|0.11.0-4.|CVE-2019-0205|0.13.0| |Apache Thrift|0.11.0-4.|CVE-2019-0210|0.13.0| |Apache Thrift|0.11.0-4.|CVE-2020-13949|0.14.1| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column C1 with data type Int and have only one record * File 2: Same schema with File 1 except column C1 having data type String and having>= X records X depends on the capacity of your computer, my case is 36, you can increase the number of row to find X. Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column C1 changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= X records X depends on the capacity of your computer, my case is 36, you can increase the number of row to find X. Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= X records X depends on the capacity of your computer, my case is 36, you can increase the number of row to find X. Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= X records X depends on the capacity of your computer, my case is 36 Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (
[jira] [Commented] (SPARK-36972) Add max_by/min_by API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428036#comment-17428036 ] Apache Spark commented on SPARK-36972: -- User 'yoda-mon' has created a pull request for this issue: https://github.com/apache/spark/pull/34269 > Add max_by/min_by API to PySpark > > > Key: SPARK-36972 > URL: https://issues.apache.org/jira/browse/SPARK-36972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Minor > Fix For: 3.3.0 > > > Related issues > - https://issues.apache.org/jira/browse/SPARK-27653 > * https://issues.apache.org/jira/browse/SPARK-36963 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36976) Add max_by/min_by API to SparkR
[ https://issues.apache.org/jira/browse/SPARK-36976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36976. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34258 [https://github.com/apache/spark/pull/34258] > Add max_by/min_by API to SparkR > --- > > Key: SPARK-36976 > URL: https://issues.apache.org/jira/browse/SPARK-36976 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Minor > Fix For: 3.3.0 > > > Related issues > - https://issues.apache.org/jira/browse/SPARK-27653 > * https://issues.apache.org/jira/browse/SPARK-36963 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Attachment: (was: file1.parquet) > ignoreCorruptFiles does not work when schema change from int to string when a > file having more than X records > - > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > > Precondition: > Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > and have only one record > * File 2: Same schema with File 1 except column X having data type String > and having>= X records > X depends on the capacity of your computer, my case is 36 > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > If i remove one record from file2. It works well > > Code with exist file > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' > file1_path = f'{folder_path}/file1.parquet' > file1_schema = spark.read.parquet(file1_path).schema > file_all_df = spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > Code with creating file > > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > schema1 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", IntegerType(), True), > ]) > sample_data = [(1, 17)] > df1 = spark.createDataFrame(sample_data, schema1) > schema2 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", StringType(), True), > ]) > sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), > (2, "1"), (2, "3332"), (3, "19"), (4, ""), > (3, "1"), (2, "3332"), (3, "19"), (4, ""), > (4, "1"), (2, "3332"), (3, "19"), (4, ""), > (5, "1"), (2, "3332"), (3, "19"), (4, ""), > (6, "1"), (2, "3332"), (3, "19"), (4, ""), > (7, "1"), (2, "3332"), (3, "19"), (4, ""), > (8, "1"), (2, "3332"), (3, "19"), (4, ""), > (9, "1"), (2, "3332"), (3, "19"), (4, ""), > ] > df2 = spark.createDataFrame(sample_data, schema2) > file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' > df1.write \ > .mode('overwrite') \ > .format('parquet') \ > .save(f'{file_save_path}') > df2.write \ > .mode('append') \ > .format('parquet') \ > .save(f'{file_save_path}') > df = spark.read.schema(schema1).parquet(file_save_path) > df.show(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Attachment: (was: file2.parquet) > ignoreCorruptFiles does not work when schema change from int to string when a > file having more than X records > - > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > > Precondition: > Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > and have only one record > * File 2: Same schema with File 1 except column X having data type String > and having>= X records > X depends on the capacity of your computer, my case is 36 > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > If i remove one record from file2. It works well > > Code with exist file > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' > file1_path = f'{folder_path}/file1.parquet' > file1_schema = spark.read.parquet(file1_path).schema > file_all_df = spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > Code with creating file > > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > schema1 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", IntegerType(), True), > ]) > sample_data = [(1, 17)] > df1 = spark.createDataFrame(sample_data, schema1) > schema2 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", StringType(), True), > ]) > sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), > (2, "1"), (2, "3332"), (3, "19"), (4, ""), > (3, "1"), (2, "3332"), (3, "19"), (4, ""), > (4, "1"), (2, "3332"), (3, "19"), (4, ""), > (5, "1"), (2, "3332"), (3, "19"), (4, ""), > (6, "1"), (2, "3332"), (3, "19"), (4, ""), > (7, "1"), (2, "3332"), (3, "19"), (4, ""), > (8, "1"), (2, "3332"), (3, "19"), (4, ""), > (9, "1"), (2, "3332"), (3, "19"), (4, ""), > ] > df2 = spark.createDataFrame(sample_data, schema2) > file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' > df1.write \ > .mode('overwrite') \ > .format('parquet') \ > .save(f'{file_save_path}') > df2.write \ > .mode('append') \ > .format('parquet') \ > .save(f'{file_save_path}') > df = spark.read.schema(schema1).parquet(file_save_path) > df.show(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36976) Add max_by/min_by API to SparkR
[ https://issues.apache.org/jira/browse/SPARK-36976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36976: Assignee: Leona Yoda > Add max_by/min_by API to SparkR > --- > > Key: SPARK-36976 > URL: https://issues.apache.org/jira/browse/SPARK-36976 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.3.0 >Reporter: Leona Yoda >Assignee: Leona Yoda >Priority: Minor > > Related issues > - https://issues.apache.org/jira/browse/SPARK-27653 > * https://issues.apache.org/jira/browse/SPARK-36963 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= X records X depends on the capacity of your computer, my case is 36 Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Summary: ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records (was: ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records) > ignoreCorruptFiles does not work when schema change from int to string when a > file having more than X records > - > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > and have only one record > * File 2: Same schema with File 1 except column X having data type String > and having>= 36 records > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > If i remove one record from file2. It works well > > Code with exist file > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' > file1_path = f'{folder_path}/file1.parquet' > file1_schema = spark.read.parquet(file1_path).schema > file_all_df = spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > Code with creating file > > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > schema1 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", IntegerType(), True), > ]) > sample_data = [(1, 17)] > df1 = spark.createDataFrame(sample_data, schema1) > schema2 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", StringType(), True), > ]) > sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), > (2, "1"), (2, "3332"), (3, "19"), (4, ""), > (3, "1"), (2, "3332"), (3, "19"), (4, ""), > (4, "1"), (2, "3332"), (3, "19"), (4, ""), > (5, "1"), (2, "3332"), (3, "19"), (4, ""), > (6, "1"), (2, "3332"), (3, "19"), (4, ""), > (7, "1"), (2, "3332"), (3, "19"), (4, ""), > (8, "1"), (2, "3332"), (3, "19"), (4, ""), > (9, "1"), (2, "3332"), (3, "19"), (4, ""), > ] > df2 = spark.createDataFrame(sample_data, schema2) > file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' > df1.write \ > .mode('overwrite') \ > .format('parquet') \ > .save(f'{file_save_path}') > df2.write \ > .mode('append') \ > .format('parquet') \ > .save(f'{file_save_path}') > df = spark.read.schema(schema1).parquet(file_save_path) > df.show(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36993) Fix json_tuple throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-36993: --- Summary: Fix json_tuple throw NPE if fields exist no foldable null value (was: Fix json_tupe throw NPE if fields exist no foldable null value) > Fix json_tuple throw NPE if fields exist no foldable null value > --- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015) In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, "3
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, "
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Summary: ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records (was: ignoreCorruptFiles does not work when schema change from int to string) > ignoreCorruptFiles does not work when schema change from int to string when a > file having more than 35 records > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > and have only one record > * File 2: Same schema with File 1 except column X having data type String > and having>= 36 records > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > If i remove one record from file2. It works well > > Code with exist file > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' > file1_path = f'{folder_path}/file1.parquet' > file1_schema = spark.read.parquet(file1_path).schema > file_all_df = spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > Code with creating file > > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > schema1 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", IntegerType(), True), > ]) > sample_data = [(1, 17)] > df1 = spark.createDataFrame(sample_data, schema1) > schema2 = StructType([ > StructField("program_sk", IntegerType(), True), > StructField("client_sk", StringType(), True), > ]) > sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), > (2, "1"), (2, "3332"), (3, "19"), (4, ""), > (3, "1"), (2, "3332"), (3, "19"), (4, ""), > (4, "1"), (2, "3332"), (3, "19"), (4, ""), > (5, "1"), (2, "3332"), (3, "19"), (4, ""), > (6, "1"), (2, "3332"), (3, "19"), (4, ""), > (7, "1"), (2, "3332"), (3, "19"), (4, ""), > (8, "1"), (2, "3332"), (3, "19"), (4, ""), > (9, "1"), (2, "3332"), (3, "19"), (4, ""), > ] > df2 = self.spark.createDataFrame(sample_data, schema2) > file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/' > df1.write \ > .mode('overwrite') \ > .format('parquet') \ > .save(f'{file_save_path}') > df2.write \ > .mode('append') \ > .format('parquet') \ > .save(f'{file_save_path}') > df = spark.read.schema(schema1).parquet(file_save_path) > df.show(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than 35 records
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = self.spark.createDataFrame(sample_data, schema2) file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4,
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int and have only one record * File 2: Same schema with File 1 except column X having data type String and having>= 36 records Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` If i remove one record from file2. It works well Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = self.spark.createDataFrame(sample_data, schema2) file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3,
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` Code with exist file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} Code with creating file {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) schema1 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", IntegerType(), True), ]) sample_data = [(1, 17)] df1 = spark.createDataFrame(sample_data, schema1) schema2 = StructType([ StructField("program_sk", IntegerType(), True), StructField("client_sk", StringType(), True), ]) sample_data = [(1, "1"), (2, "3332"), (3, "19"), (4, ""), (2, "1"), (2, "3332"), (3, "19"), (4, ""), (3, "1"), (2, "3332"), (3, "19"), (4, ""), (4, "1"), (2, "3332"), (3, "19"), (4, ""), (5, "1"), (2, "3332"), (3, "19"), (4, ""), (6, "1"), (2, "3332"), (3, "19"), (4, ""), (7, "1"), (2, "3332"), (3, "19"), (4, ""), (8, "1"), (2, "3332"), (3, "19"), (4, ""), (9, "1"), (2, "3332"), (3, "19"), (4, ""), ] df2 = self.spark.createDataFrame(sample_data, schema2) file_save_path = 's3://aduro-data-dev/adp_data_lake/test_ignore_corrupt/' df1.write \ .mode('overwrite') \ .format('parquet') \ .save(f'{file_save_path}') df2.write \ .mode('append') \ .format('parquet') \ .save(f'{file_save_path}') df = spark.read.schema(schema1).parquet(file_save_path) df.show(){code} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` \{{}} {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} {{}} > ignoreCorruptFiles does not work when schema change from int to string > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > * File 2: Same schema with File 1 except column X having data type String > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type o
[jira] [Commented] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428007#comment-17428007 ] Apache Spark commented on SPARK-36993: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34268 > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36993: Assignee: Apache Spark > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36993: Assignee: (was: Apache Spark) > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428006#comment-17428006 ] Apache Spark commented on SPARK-36993: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34268 > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-36993: -- Affects Version/s: 3.0.3 > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null field
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-36993: -- Summary: Fix json_tupe throw NPE if fields exist no foldable null field (was: Fix json_tupe throw NPE if fields exist no foldable null column) > Fix json_tupe throw NPE if fields exist no foldable null field > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null value
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-36993: -- Summary: Fix json_tupe throw NPE if fields exist no foldable null value (was: Fix json_tupe throw NPE if fields exist no foldable null field) > Fix json_tupe throw NPE if fields exist no foldable null value > -- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null column
XiDuo You created SPARK-36993: - Summary: Fix json_tupe throw NPE if fields exist no foldable null column Key: SPARK-36993 URL: https://issues.apache.org/jira/browse/SPARK-36993 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2, 3.2.0, 3.3.0 Reporter: XiDuo You If json_tuple exists no foldable null field, Spark would throw NPE during eval field.toString. e.g. the query `SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS c1 );` will fail with: {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36993) Fix json_tupe throw NPE if fields exist no foldable null column
[ https://issues.apache.org/jira/browse/SPARK-36993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-36993: -- Description: If json_tuple exists no foldable null field, Spark would throw NPE during eval field.toString. e.g. the query will fail with: {code:java} SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS c1 ); {code} {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) {code} was: If json_tuple exists no foldable null field, Spark would throw NPE during eval field.toString. e.g. the query `SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS c1 );` will fail with: {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) at org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) {code} > Fix json_tupe throw NPE if fields exist no foldable null column > --- > > Key: SPARK-36993 > URL: https://issues.apache.org/jira/browse/SPARK-36993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > If json_tuple exists no foldable null field, Spark would throw NPE during > eval field.toString. > e.g. the query will fail with: > {code:java} > SELECT json_tuple('{"a":"1"}', if(c1 < 1, null, 'a')) FROM ( SELECT rand() AS > c1 ); > {code} > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$parseRow$2(jsonExpressions.scala:435) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.parseRow(jsonExpressions.scala:435) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.$anonfun$eval$6(jsonExpressions.scala:413) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36953. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34219 [https://github.com/apache/spark/pull/34219] > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36953: Assignee: Hyukjin Kwon > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36794. - Resolution: Fixed Issue resolved by pull request 34247 [https://github.com/apache/spark/pull/34247] > Ignore duplicated join keys when building relation for SEMI/ANTI hash join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join
[ https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-36794: Summary: Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join (was: Ignore duplicated join keys when building relation for SEMI/ANTI hash join) > Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash > join > -- > > Key: SPARK-36794 > URL: https://issues.apache.org/jira/browse/SPARK-36794 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we > only need to keep one row per unique join key(s) inside hash table > (`HashedRelation`) when building the hash table. This can help reduce the > size of hash table of join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36900: Assignee: Apache Spark > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36954) Fast fail with explicit err msg when calling withWatermark on non-streaming dataset
[ https://issues.apache.org/jira/browse/SPARK-36954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtengfei resolved SPARK-36954. -- Resolution: Not A Problem > Fast fail with explicit err msg when calling withWatermark on non-streaming > dataset > --- > > Key: SPARK-36954 > URL: https://issues.apache.org/jira/browse/SPARK-36954 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.2 >Reporter: huangtengfei >Priority: Minor > > [Dataset.withWatermark|https://github.com/apache/spark/blob/v3.2.0-rc7/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L740] > is a function specific for SS. > Now it could be triggered on a batch dataset, and add a specific rule to > eliminate in analyze phase. User can call this API and nothing happens, it > may be a little bit confused. > If the usage is not as expected, maybe we can just fast fail it with explicit > message, and also we do not have to keep on extra rule to do this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36900: Assignee: (was: Apache Spark) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-36900: -- Assignee: (was: Sean R. Owen) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427992#comment-17427992 ] Hyukjin Kwon commented on SPARK-36900: -- Reverted in: https://github.com/apache/spark/commit/4b86fe4c71559df12ab8a1ebcf5662c4cf87ca7f (branch-3.2) https://github.com/apache/spark/commit/6ed13147c99b2f652748b716c70dd1937230cafd (master) https://github.com/apache/spark/commit/6e8cd3b1a7489c9b0c5779559e45b3cd5decc1ea (master) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36900: - Fix Version/s: (was: 3.2.1) (was: 3.3.0) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Sean R. Owen >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray
[ https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427981#comment-17427981 ] Apache Spark commented on SPARK-36992: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34267 > Improve byte array sort perf by unify getPrefix function of UTF8String and > ByteArray > > > Key: SPARK-36992 > URL: https://issues.apache.org/jira/browse/SPARK-36992 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > When execute sort operator, we first compare the prefix. However the > getPrefix function of byte array is slow. We use first 8 bytes as the prefix, > so at most we will call 8 times with `Platform.getByte` which is slower than > call once with `Platform.getInt` or `Platform.getLong`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray
[ https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36992: Assignee: (was: Apache Spark) > Improve byte array sort perf by unify getPrefix function of UTF8String and > ByteArray > > > Key: SPARK-36992 > URL: https://issues.apache.org/jira/browse/SPARK-36992 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > When execute sort operator, we first compare the prefix. However the > getPrefix function of byte array is slow. We use first 8 bytes as the prefix, > so at most we will call 8 times with `Platform.getByte` which is slower than > call once with `Platform.getInt` or `Platform.getLong`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray
[ https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36992: Assignee: Apache Spark > Improve byte array sort perf by unify getPrefix function of UTF8String and > ByteArray > > > Key: SPARK-36992 > URL: https://issues.apache.org/jira/browse/SPARK-36992 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > When execute sort operator, we first compare the prefix. However the > getPrefix function of byte array is slow. We use first 8 bytes as the prefix, > so at most we will call 8 times with `Platform.getByte` which is slower than > call once with `Platform.getInt` or `Platform.getLong`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray
XiDuo You created SPARK-36992: - Summary: Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray Key: SPARK-36992 URL: https://issues.apache.org/jira/browse/SPARK-36992 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You When execute sort operator, we first compare the prefix. However the getPrefix function of byte array is slow. We use first 8 bytes as the prefix, so at most we will call 8 times with `Platform.getByte` which is slower than call once with `Platform.getInt` or `Platform.getLong`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36971) Query files directly with SQL is broken (with Glue)
[ https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427976#comment-17427976 ] Hyukjin Kwon commented on SPARK-36971: -- I suggest you do contact AWS or Databricks to follow up the issue. Databricks or AWS aren't Apache Spark. > Query files directly with SQL is broken (with Glue) > --- > > Key: SPARK-36971 > URL: https://issues.apache.org/jira/browse/SPARK-36971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Databricks Runtime 9.1 and 10.0 Beta >Reporter: Lauri Koobas >Priority: Major > > This is broken in DBR 9.1 (and 10.0 Beta): > {{ select * from json.`filename`}} > [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html] > I have tried with JSON and Parquet files. > The error: > {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch > tables of db json}}{color} > Down in the stack trace this also exists: > {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not > found. (Service: AWSGlue; Status Code: 400; Error Code: > EntityNotFoundException; ... )){color}}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36971) Query files directly with SQL is broken (with Glue)
[ https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36971. -- Resolution: Invalid > Query files directly with SQL is broken (with Glue) > --- > > Key: SPARK-36971 > URL: https://issues.apache.org/jira/browse/SPARK-36971 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Databricks Runtime 9.1 and 10.0 Beta >Reporter: Lauri Koobas >Priority: Major > > This is broken in DBR 9.1 (and 10.0 Beta): > {{ select * from json.`filename`}} > [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html] > I have tried with JSON and Parquet files. > The error: > {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch > tables of db json}}{color} > Down in the stack trace this also exists: > {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not > found. (Service: AWSGlue; Status Code: 400; Error Code: > EntityNotFoundException; ... )){color}}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` \{{}} {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} {{}} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` {{}} {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} {{}} > ignoreCorruptFiles does not work when schema change from int to string > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > * File 2: Same schema with File 1 except column X having data type String > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > > \{{}} > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' > file1_path = f'{folder_path}/file1.parquet' > file1_schema = spark.read.parquet(file1_path).schema > file_all_df = spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Description: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` {{}} {code:java} spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = f'{folder_path}/file1.parquet' file1_schema = spark.read.parquet(file1_path).schema file_all_df = spark.read.schema(file1_schema).parquet( folder_path) file_all_df.show(n=10) {code} {{}} was: Precondition: In folder A having two parquet files * File 1: have some columns and one of them is column X with data type Int * File 2: Same schema with File 1 except column X having data type String Read file 1 to get schema of file 1. Read folder A with schema of file 1. Expected: Read successfully, file 2 will be ignored as the data type of column X changed to string. Actual: File 2 seems to be not ignored and get error: `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor driver): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > ignoreCorruptFiles does not work when schema change from int to string > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > * File 2: Same schema with File 1 except column X having data type String > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > > > {{}} > {code:java} > spark.conf.set('spark.sql.files.ignoreCorruptFiles', True) > folder_path = 's3://xxx-data-dev/test/ignore_corrupt_files' file1_path = > f'{folder_path}/file1.parquet' file1_schema = > spark.read.parquet(file1_path).schema file_all_df = > spark.read.schema(file1_schema).parquet( folder_path) > file_all_df.show(n=10) > {code} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36985) Future typing errors in pyspark.pandas
[ https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36985. -- Fix Version/s: 3.3.0 Assignee: Takuya Ueshin Resolution: Fixed Fixed in https://github.com/apache/spark/pull/34266 > Future typing errors in pyspark.pandas > -- > > Key: SPARK-36985 > URL: https://issues.apache.org/jira/browse/SPARK-36985 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.3.0 > > > The following problems are detected on master with mypy 0.920 > {code:java} > $ git rev-parse HEAD > 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca > $ mypy --version > mypy 0.920+dev.332b712df848cd242987864b38bd237364654532 > $ mypy --config-file mypy.ini pyspark > pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment > (expression has type "CategoricalIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment > (expression has type "Int64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment > (expression has type "Float64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment > (expression has type "DatetimeIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment > (expression has type "Index", variable has type "MultiIndex") [assignment] > pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute > "_anchor" [attr-defined] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Attachment: file2.parquet file1.parquet > ignoreCorruptFiles does not work when schema change from int to string > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > Attachments: file1.parquet, file2.parquet > > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > * File 2: Same schema with File 1 except column X having data type String > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427970#comment-17427970 ] Hyukjin Kwon commented on SPARK-36989: -- Adding mypy tests would be super awesome! > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36961) Use PEP526 style variable type hints
[ https://issues.apache.org/jira/browse/SPARK-36961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36961: Assignee: Takuya Ueshin > Use PEP526 style variable type hints > > > Key: SPARK-36961 > URL: https://issues.apache.org/jira/browse/SPARK-36961 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > > Now that we have started using newer Python syntax in the code base. > We should use PEP526 style variable type hints. > https://www.python.org/dev/peps/pep-0526/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36961) Use PEP526 style variable type hints
[ https://issues.apache.org/jira/browse/SPARK-36961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36961. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34227 [https://github.com/apache/spark/pull/34227] > Use PEP526 style variable type hints > > > Key: SPARK-36961 > URL: https://issues.apache.org/jira/browse/SPARK-36961 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 3.3.0 > > > Now that we have started using newer Python syntax in the code base. > We should use PEP526 style variable type hints. > https://www.python.org/dev/peps/pep-0526/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36981) Upgrade joda-time to 2.10.12
[ https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-36981. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34253 > Upgrade joda-time to 2.10.12 > > > Key: SPARK-36981 > URL: https://issues.apache.org/jira/browse/SPARK-36981 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > joda-time 2.10.12 seems to support an updated TZDB. > https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12 > https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36985) Future typing errors in pyspark.pandas
[ https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427966#comment-17427966 ] Apache Spark commented on SPARK-36985: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34266 > Future typing errors in pyspark.pandas > -- > > Key: SPARK-36985 > URL: https://issues.apache.org/jira/browse/SPARK-36985 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > The following problems are detected on master with mypy 0.920 > {code:java} > $ git rev-parse HEAD > 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca > $ mypy --version > mypy 0.920+dev.332b712df848cd242987864b38bd237364654532 > $ mypy --config-file mypy.ini pyspark > pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment > (expression has type "CategoricalIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment > (expression has type "Int64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment > (expression has type "Float64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment > (expression has type "DatetimeIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment > (expression has type "Index", variable has type "MultiIndex") [assignment] > pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute > "_anchor" [attr-defined] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36985) Future typing errors in pyspark.pandas
[ https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36985: Assignee: (was: Apache Spark) > Future typing errors in pyspark.pandas > -- > > Key: SPARK-36985 > URL: https://issues.apache.org/jira/browse/SPARK-36985 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > The following problems are detected on master with mypy 0.920 > {code:java} > $ git rev-parse HEAD > 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca > $ mypy --version > mypy 0.920+dev.332b712df848cd242987864b38bd237364654532 > $ mypy --config-file mypy.ini pyspark > pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment > (expression has type "CategoricalIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment > (expression has type "Int64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment > (expression has type "Float64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment > (expression has type "DatetimeIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment > (expression has type "Index", variable has type "MultiIndex") [assignment] > pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute > "_anchor" [attr-defined] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36985) Future typing errors in pyspark.pandas
[ https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36985: Assignee: Apache Spark > Future typing errors in pyspark.pandas > -- > > Key: SPARK-36985 > URL: https://issues.apache.org/jira/browse/SPARK-36985 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > The following problems are detected on master with mypy 0.920 > {code:java} > $ git rev-parse HEAD > 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca > $ mypy --version > mypy 0.920+dev.332b712df848cd242987864b38bd237364654532 > $ mypy --config-file mypy.ini pyspark > pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment > (expression has type "CategoricalIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment > (expression has type "Int64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment > (expression has type "Float64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment > (expression has type "DatetimeIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment > (expression has type "Index", variable has type "MultiIndex") [assignment] > pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute > "_anchor" [attr-defined] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36985) Future typing errors in pyspark.pandas
[ https://issues.apache.org/jira/browse/SPARK-36985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427967#comment-17427967 ] Apache Spark commented on SPARK-36985: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34266 > Future typing errors in pyspark.pandas > -- > > Key: SPARK-36985 > URL: https://issues.apache.org/jira/browse/SPARK-36985 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > The following problems are detected on master with mypy 0.920 > {code:java} > $ git rev-parse HEAD > 36b3bbc0aa9f9c39677960cd93f32988c7d7aaca > $ mypy --version > mypy 0.920+dev.332b712df848cd242987864b38bd237364654532 > $ mypy --config-file mypy.ini pyspark > pyspark/pandas/indexes/base.py:184: error: Incompatible types in assignment > (expression has type "CategoricalIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:188: error: Incompatible types in assignment > (expression has type "Int64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:192: error: Incompatible types in assignment > (expression has type "Float64Index", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:197: error: Incompatible types in assignment > (expression has type "DatetimeIndex", variable has type "MultiIndex") > [assignment] > pyspark/pandas/indexes/base.py:199: error: Incompatible types in assignment > (expression has type "Index", variable has type "MultiIndex") [assignment] > pyspark/pandas/indexes/base.py:201: error: "MultiIndex" has no attribute > "_anchor" [attr-defined] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23626) DAGScheduler blocked due to JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-23626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427958#comment-17427958 ] Apache Spark commented on SPARK-23626: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/34265 > DAGScheduler blocked due to JobSubmitted event > --- > > Key: SPARK-23626 > URL: https://issues.apache.org/jira/browse/SPARK-23626 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.1, 2.3.3, 2.4.3, 3.0.0 >Reporter: Ajith S >Priority: Major > > DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted > events has to be processed as DAGSchedulerEventProcessLoop is single threaded > and it will block other tasks in queue like TaskCompletion. > The JobSubmitted event is time consuming depending on the nature of the job > (Example: calculating parent stage dependencies, shuffle dependencies, > partitions) and thus it blocks all the events to be processed. > > I see multiple JIRA referring to this behavior > https://issues.apache.org/jira/browse/SPARK-2647 > https://issues.apache.org/jira/browse/SPARK-4961 > > Similarly in my cluster some jobs partition calculation is time consuming > (Similar to stack at SPARK-2647) hence it slows down the spark > DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even if > its tasks are finished within seconds, as TaskCompletion Events are processed > at a slower rate due to blockage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules
[ https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36979: - Assignee: XiDuo You > Add RewriteLateralSubquery rule into nonExcludableRules > --- > > Key: SPARK-36979 > URL: https://issues.apache.org/jira/browse/SPARK-36979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.2.0 > > > Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if > we set > `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`, > the lateral join query will fail with: > {code:java} > java.lang.AssertionError: assertion failed: No plan for LateralJoin > lateral-subquery#218 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules
[ https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36979: -- Issue Type: Bug (was: Improvement) > Add RewriteLateralSubquery rule into nonExcludableRules > --- > > Key: SPARK-36979 > URL: https://issues.apache.org/jira/browse/SPARK-36979 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Fix For: 3.2.0 > > > Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if > we set > `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`, > the lateral join query will fail with: > {code:java} > java.lang.AssertionError: assertion failed: No plan for LateralJoin > lateral-subquery#218 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules
[ https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36979. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 34260 [https://github.com/apache/spark/pull/34260] > Add RewriteLateralSubquery rule into nonExcludableRules > --- > > Key: SPARK-36979 > URL: https://issues.apache.org/jira/browse/SPARK-36979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Minor > Fix For: 3.2.0 > > > Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if > we set > `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`, > the lateral join query will fail with: > {code:java} > java.lang.AssertionError: assertion failed: No plan for LateralJoin > lateral-subquery#218 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py
[ https://issues.apache.org/jira/browse/SPARK-36991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427922#comment-17427922 ] Xinrong Meng commented on SPARK-36991: -- I am working on this. > Inline type hints for spark/python/pyspark/sql/streaming.py > --- > > Key: SPARK-36991 > URL: https://issues.apache.org/jira/browse/SPARK-36991 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Inline type hints for spark/python/pyspark/sql/streaming.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36991) Inline type hints for spark/python/pyspark/sql/streaming.py
Xinrong Meng created SPARK-36991: Summary: Inline type hints for spark/python/pyspark/sql/streaming.py Key: SPARK-36991 URL: https://issues.apache.org/jira/browse/SPARK-36991 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Inline type hints for spark/python/pyspark/sql/streaming.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file
[ https://issues.apache.org/jira/browse/SPARK-36990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Catalin Toda updated SPARK-36990: - Description: The code below does not work on both Spark 3.1 and Spark 3.2. Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == null ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)] which makes isUnsignedTypeMatched return false always: [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180] I am not sure even if logicalTypeAnnotation would not be null if isUnsignedTypeMatched is supposed to return true for this use case. Python repro: {code:java} import os from pyspark.sql.functions import * from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession.builder \ .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3A") \ .getOrCreate() df = spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id") df.write.mode("overwrite").parquet("s3://bucket/test") df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test") df.show(1, False) {code} was: The code above does not work on both Spark 3.1 and Spark 3.2. Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == null ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)] which makes isUnsignedTypeMatched return false always: [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180] I am not sure even if logicalTypeAnnotation would not be null if isUnsignedTypeMatched is supposed to return true for this use case. > Long columns cannot read columns with INT32 type in the parquet file > > > Key: SPARK-36990 > URL: https://issues.apache.org/jira/browse/SPARK-36990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Catalin Toda >Priority: Major > > The code below does not work on both Spark 3.1 and Spark 3.2. > Part of the issue is the fact that the fileSchema has logicalTypeAnnotation > == null > ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)] > which makes isUnsignedTypeMatched return false always: > [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180] > > I am not sure even if logicalTypeAnnotation would not be null if > isUnsignedTypeMatched is supposed to return true for this use case. > Python repro: > {code:java} > import os > from pyspark.sql.functions import * > from pyspark.sql import SparkSession > from pyspark.sql.types import * > spark = SparkSession.builder \ > .config("spark.hadoop.fs.s3.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", > "org.apache.hadoop.fs.s3a.S3A") \ > .getOrCreate() > df = > spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id") > df.write.mode("overwrite").parquet("s3://bucket/test") > df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test") > df.show(1, False) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file
[ https://issues.apache.org/jira/browse/SPARK-36990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Catalin Toda updated SPARK-36990: - Environment: (was: Python repro: {code:java} import os from pyspark.sql.functions import * from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession.builder \ .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3A") \ .getOrCreate() df = spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id") df.write.mode("overwrite").parquet("s3://bucket/test") df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test") df.show(1, False) {code}) > Long columns cannot read columns with INT32 type in the parquet file > > > Key: SPARK-36990 > URL: https://issues.apache.org/jira/browse/SPARK-36990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Catalin Toda >Priority: Major > > The code above does not work on both Spark 3.1 and Spark 3.2. > Part of the issue is the fact that the fileSchema has logicalTypeAnnotation > == null > ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)] > which makes isUnsignedTypeMatched return false always: > [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180] > > I am not sure even if logicalTypeAnnotation would not be null if > isUnsignedTypeMatched is supposed to return true for this use case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36951) Inline type hints for python/pyspark/sql/column.py
[ https://issues.apache.org/jira/browse/SPARK-36951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36951. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 34226 https://github.com/apache/spark/pull/34226 > Inline type hints for python/pyspark/sql/column.py > -- > > Key: SPARK-36951 > URL: https://issues.apache.org/jira/browse/SPARK-36951 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/column.py for type check of function > bodies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36990) Long columns cannot read columns with INT32 type in the parquet file
Catalin Toda created SPARK-36990: Summary: Long columns cannot read columns with INT32 type in the parquet file Key: SPARK-36990 URL: https://issues.apache.org/jira/browse/SPARK-36990 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2, 3.2.0 Environment: Python repro: {code:java} import os from pyspark.sql.functions import * from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession.builder \ .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3A") \ .getOrCreate() df = spark.createDataFrame([(1,2),(2,3)],StructType([StructField("id",IntegerType(),True),StructField("id2",IntegerType(),True)])).select("id") df.write.mode("overwrite").parquet("s3://bucket/test") df=spark.read.schema(StructType([StructField("id",LongType(),True)])).parquet("s3://bucket/test") df.show(1, False) {code} Reporter: Catalin Toda The code above does not work on both Spark 3.1 and Spark 3.2. Part of the issue is the fact that the fileSchema has logicalTypeAnnotation == null ([https://github.com/apache/spark/blob/5013171fd36e6221a540c801cb7fd9e298a6b5ba/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L92)] which makes isUnsignedTypeMatched return false always: [https://github.com/apache/spark/blob/5b2f1912280e7a5afb92a96b894a7bc5f263aa6e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java#L180] I am not sure even if logicalTypeAnnotation would not be null if isUnsignedTypeMatched is supposed to return true for this use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-36989: --- Description: Before the migration, {{pyspark-stubs}} contained a set of [data tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], modeled after, and using internal test utilities, of mypy. These were omitted during the migration for a few reasons: * Simplicity. * Relative slowness. * Dependence on non public API. Data tests are useful for a number of reasons: * Improve test coverage for type hints. * Checking if type checkers infer expected types. * Checking if type checkers reject incorrect code. * Detecting unusual errors with code that otherwise type checks, Especially, the last two functions are not fulfilled by simple validation of existing codebase. Data tests are not required for all annotations and can be restricted to code that has high possibility of failure: * Complex overloaded signatures. * Complex generics. * Generic {{self}} annotations * Code containing {{type: ignore}} The biggest risk, is that output matchers have to be updated when signature changes and / or mypy output changes. Example of problem detected with data tests can be found in SPARK-36894 PR ([https://github.com/apache/spark/pull/34146]). was: Before the migration, {{pyspark-stubs}} contained a set of data tests, modeled after, and using internal test utilities, of mypy. These were omitted during the migration for a few reasons: * Simplicity. * Relative slowness. * Dependence on non public API. Data tests are useful for a number of reasons: * Improve test coverage for type hints. * Checking if type checkers infer expected types. * Checking if type checkers reject incorrect code. * Detecting unusual errors with code that otherwise type checks, Especially, the last two functions are not fulfilled by simple validation of existing codebase. Data tests are not required for all annotations and can be restricted to code that has high possibility of failure: * Complex overloaded signatures. * Complex generics. * Generic {{self}} annotations * Code containing {{type: ignore}} The biggest risk, is that output matchers have to be updated when signature changes and / or mypy output changes. Example of problem detected with data tests can be found in SPARK-36894 PR ([https://github.com/apache/spark/pull/34146]). > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427889#comment-17427889 ] Maciej Szymkiewicz edited comment on SPARK-36989 at 10/12/21, 7:23 PM: --- FYI [~hyukjin.kwon], [~ueshin], [~XinrongM] was (Author: zero323): FYI [~hyukjin.kwon] [~XinrongM] [~ueshin] > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of data tests, > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427889#comment-17427889 ] Maciej Szymkiewicz commented on SPARK-36989: FYI [~hyukjin.kwon] [~XinrongM] [~ueshin] > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of data tests, > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427888#comment-17427888 ] Maciej Szymkiewicz commented on SPARK-36989: Currently I am working on [some fixes|https://github.com/typeddjango/pytest-mypy-plugins/commits?author=zero323] to [typeddjango/pytest-mypy-plugins|https://github.com/typeddjango/pytest-mypy-plugins] and I hope it will allow us to bring data test to Spark, without depending on internal mypy testing suite (which, adding to being internal, requires rather specific project layout). > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of data tests, > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36462: Assignee: (was: Apache Spark) > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Minor > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36462: Assignee: Apache Spark > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427886#comment-17427886 ] Apache Spark commented on SPARK-36462: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/34264 > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Minor > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427884#comment-17427884 ] Apache Spark commented on SPARK-36462: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/34264 > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Minor > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36989) Migrate type hint data tests
Maciej Szymkiewicz created SPARK-36989: -- Summary: Migrate type hint data tests Key: SPARK-36989 URL: https://issues.apache.org/jira/browse/SPARK-36989 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz Before the migration, {{pyspark-stubs}} contained a set of data tests, modeled after, and using internal test utilities, of mypy. These were omitted during the migration for a few reasons: * Simplicity. * Relative slowness. * Dependence on non public API. Data tests are useful for a number of reasons: * Improve test coverage for type hints. * Checking if type checkers infer expected types. * Checking if type checkers reject incorrect code. * Detecting unusual errors with code that otherwise type checks, Especially, the last two functions are not fulfilled by simple validation of existing codebase. Data tests are not required for all annotations and can be restricted to code that has high possibility of failure: * Complex overloaded signatures. * Complex generics. * Generic {{self}} annotations * Code containing {{type: ignore}} The biggest risk, is that output matchers have to be updated when signature changes and / or mypy output changes. Example of problem detected with data tests can be found in SPARK-36894 PR ([https://github.com/apache/spark/pull/34146]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427852#comment-17427852 ] Apache Spark commented on SPARK-36978: -- User 'utkarsh39' has created a pull request for this issue: https://github.com/apache/spark/pull/34263 > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36978: Assignee: Apache Spark > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Assignee: Apache Spark >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36978: Assignee: (was: Apache Spark) > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427851#comment-17427851 ] Apache Spark commented on SPARK-36978: -- User 'utkarsh39' has created a pull request for this issue: https://github.com/apache/spark/pull/34263 > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type
[ https://issues.apache.org/jira/browse/SPARK-36978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Utkarsh Agarwal updated SPARK-36978: Description: [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] optimization rule generates {{IsNotNull}} constraints corresponding to null intolerant predicates. The {{IsNotNull}} constraints are generated on the attribute inside the corresponding predicate. e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int column {{structCol.b}} where {{structCol}} is a struct column results in a constraint {{IsNotNull(structCol)}}. This generation of constraints on the root level nested type is extremely conservative as it could lead to materialization of the the entire struct. The constraint should instead be generated on the nested field being referenced by the predicate. In the above example, the constraint should be {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} The new constraints also create opportunities for nested pruning. Currently {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. However the constraint {{IsNotNull(structCol.b)}} could create opportunities to prune {{structCol}}. was: [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] optimization rule generates {{IsNotNull}} constraints corresponding to null intolerant predicates. The {{IsNotNull}} constraints are generated on the attribute inside the corresponding predicate. e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int column {{structCol.b}} where {{structCol}} is a struct column results in a constraint {{IsNotNull(structCol)}}. This generation of constraints on the root level nested type is extremely conservative as it could lead to materialization of the the entire struct. The constraint should instead be generated on the nested field being referenced by the predicate. In the above example, the constraint should be {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > InferConstraints rule should create IsNotNull constraints on the nested field > instead of the root nested type > -- > > Key: SPARK-36978 > URL: https://issues.apache.org/jira/browse/SPARK-36978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: Utkarsh Agarwal >Priority: Major > > [InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206] > optimization rule generates {{IsNotNull}} constraints corresponding to null > intolerant predicates. The {{IsNotNull}} constraints are generated on the > attribute inside the corresponding predicate. > e.g. A predicate {{a > 0}} on an integer column {{a}} will result in a > constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int > column {{structCol.b}} where {{structCol}} is a struct column results in a > constraint {{IsNotNull(structCol)}}. > This generation of constraints on the root level nested type is extremely > conservative as it could lead to materialization of the the entire struct. > The constraint should instead be generated on the nested field being > referenced by the predicate. In the above example, the constraint should be > {{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}} > > The new constraints also create opportunities for nested pruning. Currently > {{IsNotNull(structCol)}} constraint would preclude pruning of {{structCol}}. > However the constraint {{IsNotNull(structCol.b)}} could create opportunities > to prune {{structCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns
[ https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427825#comment-17427825 ] Shardul Mahadik commented on SPARK-36877: - {quote} Getting RDD means the physical plan is finalized. With AQE, finalizing the physical plan means running all the query stages except for the last stage.{quote} Ack! Makes sense. {quote}> shouldn't it reuse the result from previous stages? One DataFrame means one query, and today Spark can't reuse shuffle/broadcast/subquery across queries.{quote} But isn't this the same DF. I am calling {{df.rdd}} and then {{df.write}} where {{df}} is the same. So it is not across queries. > Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing > reruns > -- > > Key: SPARK-36877 > URL: https://issues.apache.org/jira/browse/SPARK-36877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: Screen Shot 2021-09-28 at 09.32.20.png > > > In one of our jobs we perform the following operation: > {code:scala} > val df = /* some expensive multi-table/multi-stage join */ > val numPartitions = df.rdd.getNumPartitions > df.repartition(x).write. > {code} > With AQE enabled, we found that the expensive stages were being run twice > causing significant performance regression after enabling AQE; once when > calling {{df.rdd}} and again when calling {{df.write}}. > A more concrete example: > {code:scala} > scala> sql("SET spark.sql.adaptive.enabled=true") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > res1: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val df1 = spark.range(10).withColumn("id2", $"id") > df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), > "id").join(spark.range(10), "id") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df3 = df2.groupBy("id2").count() > df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint] > scala> df3.rdd.getNumPartitions > res2: Int = 10(0 + 16) / > 16] > scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1") > {code} > In the screenshot below, you can see that the first 3 stages (0 to 4) were > rerun again (5 to 9). > I have two questions: > 1) Should calling df.rdd trigger actual job execution when AQE is enabled? > 2) Should calling df.write later cause rerun of the stages? If df.rdd has > already partially executed the stages, shouldn't it reuse the result from > previous stages? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36970) Manual disabled format `B` for `date_format` function to compatibility with Java 8 behavior.
[ https://issues.apache.org/jira/browse/SPARK-36970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36970. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34237 [https://github.com/apache/spark/pull/34237] > Manual disabled format `B` for `date_format` function to compatibility with > Java 8 behavior. > > > Key: SPARK-36970 > URL: https://issues.apache.org/jira/browse/SPARK-36970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > The `date_format` function has some behavioral differences when using JDK 8 > and JDK 17 as following: > the result of {{select date_format('2018-11-17 13:33:33.333', 'B')}} in > {{datetime-formatting-invalid.sql}} with Java 8 is: > {code:java} > -- !query > select date_format('2018-11-17 13:33:33.333', 'B') > -- !query schema > struct<> > -- !query output > java.lang.IllegalArgumentException > Unknown pattern letter: B > {code} > and with Java 17 the result is: > {code:java} > - datetime-formatting-invalid.sql *** FAILED *** > datetime-formatting-invalid.sql > Expected "struct<[]>", but got "struct<[date_format(2018-11-17 > 13:33:33.333, B):string]>" Schema did not match for query #34 > select date_format('2018-11-17 13:33:33.333', 'B'): -- !query > select date_format('2018-11-17 13:33:33.333', 'B') > -- !query schema > struct > -- !query output > in the afternoon (SQLQueryTestSuite.scala:469) > {code} > > From the javadoc we can find that 'B' is used to represent `{{Pattern letters > to output a day period`}} in Java 17. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36970) Manual disabled format `B` for `date_format` function to compatibility with Java 8 behavior.
[ https://issues.apache.org/jira/browse/SPARK-36970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-36970: Assignee: Yang Jie > Manual disabled format `B` for `date_format` function to compatibility with > Java 8 behavior. > > > Key: SPARK-36970 > URL: https://issues.apache.org/jira/browse/SPARK-36970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > The `date_format` function has some behavioral differences when using JDK 8 > and JDK 17 as following: > the result of {{select date_format('2018-11-17 13:33:33.333', 'B')}} in > {{datetime-formatting-invalid.sql}} with Java 8 is: > {code:java} > -- !query > select date_format('2018-11-17 13:33:33.333', 'B') > -- !query schema > struct<> > -- !query output > java.lang.IllegalArgumentException > Unknown pattern letter: B > {code} > and with Java 17 the result is: > {code:java} > - datetime-formatting-invalid.sql *** FAILED *** > datetime-formatting-invalid.sql > Expected "struct<[]>", but got "struct<[date_format(2018-11-17 > 13:33:33.333, B):string]>" Schema did not match for query #34 > select date_format('2018-11-17 13:33:33.333', 'B'): -- !query > select date_format('2018-11-17 13:33:33.333', 'B') > -- !query schema > struct > -- !query output > in the afternoon (SQLQueryTestSuite.scala:469) > {code} > > From the javadoc we can find that 'B' is used to represent `{{Pattern letters > to output a day period`}} in Java 17. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What ciphers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Summary: What ciphers spark support for internode communication? (was: What chipers spark support for internode communication?) > What ciphers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mentions this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, > but says nothing about the ciphers.}} > {{Does it support everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36867) Misleading Error Message with Invalid Column and Group By
[ https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36867: --- Assignee: Wenchen Fan > Misleading Error Message with Invalid Column and Group By > - > > Key: SPARK-36867 > URL: https://issues.apache.org/jira/browse/SPARK-36867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Alan Jackoway >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > > When you run a query with an invalid column that also does a group by on a > constructed column, the error message you get back references a missing > column for the group by rather than the invalid column. > You can reproduce this in pyspark in 3.1.2 with the following code: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Group By Issue").getOrCreate() > data = spark.createDataFrame( > [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", > 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)], > schema=["d", "v"] > ) > data.createOrReplaceTempView("data") > # This is valid > spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from > data group by week").show() > # This is invalid because val is the wrong variable > spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from > data group by week").show() > {code} > The error message for the second spark.sql line is > {quote} > pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input > columns: [data.d, data.v]; line 1 pos 81; > 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 > as timestamp), Some(America/New_York)) as date) AS week#22] > +- SubqueryAlias data >+- LogicalRDD [d#0, v#1L], false > {quote} > but the actual problem is that I used the wrong variable name in a different > part of the query. Nothing is wrong with {{week}} in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36867) Misleading Error Message with Invalid Column and Group By
[ https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36867. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34244 [https://github.com/apache/spark/pull/34244] > Misleading Error Message with Invalid Column and Group By > - > > Key: SPARK-36867 > URL: https://issues.apache.org/jira/browse/SPARK-36867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Alan Jackoway >Priority: Major > Fix For: 3.3.0 > > > When you run a query with an invalid column that also does a group by on a > constructed column, the error message you get back references a missing > column for the group by rather than the invalid column. > You can reproduce this in pyspark in 3.1.2 with the following code: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName("Group By Issue").getOrCreate() > data = spark.createDataFrame( > [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", > 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)], > schema=["d", "v"] > ) > data.createOrReplaceTempView("data") > # This is valid > spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from > data group by week").show() > # This is invalid because val is the wrong variable > spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from > data group by week").show() > {code} > The error message for the second spark.sql line is > {quote} > pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input > columns: [data.d, data.v]; line 1 pos 81; > 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 > as timestamp), Some(America/New_York)) as date) AS week#22] > +- SubqueryAlias data >+- LogicalRDD [d#0, v#1L], false > {quote} > but the actual problem is that I used the wrong variable name in a different > part of the query. Nothing is wrong with {{week}} in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36914) Implement dropIndex and listIndexes in JDBC (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-36914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36914. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34236 [https://github.com/apache/spark/pull/34236] > Implement dropIndex and listIndexes in JDBC (MySQL dialect) > --- > > Key: SPARK-36914 > URL: https://issues.apache.org/jira/browse/SPARK-36914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36914) Implement dropIndex and listIndexes in JDBC (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-36914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36914: --- Assignee: Huaxin Gao > Implement dropIndex and listIndexes in JDBC (MySQL dialect) > --- > > Key: SPARK-36914 > URL: https://issues.apache.org/jira/browse/SPARK-36914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Description: {{Spark documentation mentions this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} was: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mentions this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, > but says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Description: {{Spark documentation mentions this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Does it support everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} was: {{Spark documentation mentions this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mentions this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, > but says nothing about the ciphers.}} > {{Does it support everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Description: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} was: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} \{{}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} \{{}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mention this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but > says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Description: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} was: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mention this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but > says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Description: {{Spark documentation mention this:}} {{[https://spark.apache.org/docs/3.0.0/security.html]}} \{{}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} \{{}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 > Environment: {{Spark documentation mention this:}} > {{https://spark.apache.org/docs/3.0.0/security.html}} > {{}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but > says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > {{}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} >Reporter: zoli >Priority: Minor > > {{Spark documentation mention this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > \{{}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but > says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > \{{}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36988) What chipers spark support for internode communication?
[ https://issues.apache.org/jira/browse/SPARK-36988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zoli updated SPARK-36988: - Environment: (was: {{Spark documentation mention this:}} {{https://spark.apache.org/docs/3.0.0/security.html}} {{}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {{}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}}) > What chipers spark support for internode communication? > --- > > Key: SPARK-36988 > URL: https://issues.apache.org/jira/browse/SPARK-36988 > Project: Spark > Issue Type: Question > Components: Security >Affects Versions: 3.1.2 >Reporter: zoli >Priority: Minor > > {{Spark documentation mention this:}} > {{[https://spark.apache.org/docs/3.0.0/security.html]}} > \{{}} > {code:java} > spark.network.crypto.config.* > "Configuration values for the commons-crypto library, such as which cipher > implementations to use. The config name should be the name of commons-crypto > configuration without the commons.crypto prefix."{code} > {{What this means?}} > {{If I leave it to None what will happen? There won't be any encryption used > or will it fallback to some default one?}} > {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but > says nothing about the ciphers.}} > {{Is it supports everything the given JVM does?}} > {{The documentation is vague on this.}} > {{However the spark ui part for the security is clear:}} > \{{}} > {code:java} > ${ns}.enabledAlgorithms > A comma-separated list of ciphers. The specified ciphers must be supported by > JVM. The reference list of protocols can be found in the "JSSE Cipher Suite > Names" section of the Java security guide. The list for Java 8 can be found > at this page. Note: If not set, the default cipher suite for the JRE will be > used.{code} > {{ }} > {{So what will happen if I leave spark.network.crypto.config.* to None?}} > {{And what ciphers are supported?}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36988) What chipers spark support for internode communication?
zoli created SPARK-36988: Summary: What chipers spark support for internode communication? Key: SPARK-36988 URL: https://issues.apache.org/jira/browse/SPARK-36988 Project: Spark Issue Type: Question Components: Security Affects Versions: 3.1.2 Environment: {{Spark documentation mention this:}} {{https://spark.apache.org/docs/3.0.0/security.html}} {{}} {code:java} spark.network.crypto.config.* "Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix."{code} {{What this means?}} {{If I leave it to None what will happen? There won't be any encryption used or will it fallback to some default one?}} {{The common-crypto mentions that it uses JCE or OPENSSL implementations, but says nothing about the ciphers.}} {{Is it supports everything the given JVM does?}} {{The documentation is vague on this.}} {{However the spark ui part for the security is clear:}} {{}} {code:java} ${ns}.enabledAlgorithms A comma-separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page. Note: If not set, the default cipher suite for the JRE will be used.{code} {{ }} {{So what will happen if I leave spark.network.crypto.config.* to None?}} {{And what ciphers are supported?}} Reporter: zoli -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string
[ https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mike updated SPARK-36983: - Summary: ignoreCorruptFiles does not work when schema change from int to string (was: ignoreCorruptFiles does work when schema change from int to string) > ignoreCorruptFiles does not work when schema change from int to string > -- > > Key: SPARK-36983 > URL: https://issues.apache.org/jira/browse/SPARK-36983 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.1.2 >Reporter: mike >Priority: Major > > Precondition: > In folder A having two parquet files > * File 1: have some columns and one of them is column X with data type Int > * File 2: Same schema with File 1 except column X having data type String > Read file 1 to get schema of file 1. > Read folder A with schema of file 1. > Expected: Read successfully, file 2 will be ignored as the data type of > column X changed to string. > Actual: File 2 seems to be not ignored and get error: > `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 > executor driver): java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36987) Add Doc about FROM statement
[ https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427672#comment-17427672 ] Apache Spark commented on SPARK-36987: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34261 > Add Doc about FROM statement > > > Key: SPARK-36987 > URL: https://issues.apache.org/jira/browse/SPARK-36987 > Project: Spark > Issue Type: Task > Components: docs >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Add Doc about FROM statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36987) Add Doc about FROM statement
[ https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36987: Assignee: Apache Spark > Add Doc about FROM statement > > > Key: SPARK-36987 > URL: https://issues.apache.org/jira/browse/SPARK-36987 > Project: Spark > Issue Type: Task > Components: docs >Affects Versions: 3.2.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Add Doc about FROM statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36987) Add Doc about FROM statement
[ https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36987: Assignee: (was: Apache Spark) > Add Doc about FROM statement > > > Key: SPARK-36987 > URL: https://issues.apache.org/jira/browse/SPARK-36987 > Project: Spark > Issue Type: Task > Components: docs >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Add Doc about FROM statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36987) Add Doc about FROM statement
[ https://issues.apache.org/jira/browse/SPARK-36987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427673#comment-17427673 ] Apache Spark commented on SPARK-36987: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34261 > Add Doc about FROM statement > > > Key: SPARK-36987 > URL: https://issues.apache.org/jira/browse/SPARK-36987 > Project: Spark > Issue Type: Task > Components: docs >Affects Versions: 3.2.1 >Reporter: angerszhu >Priority: Major > > Add Doc about FROM statement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org