[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

mike (Jira) Tue, 12 Oct 2021 23:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


mike updated SPARK-36983:
-------------------------
    Description: 
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column C1 with data type Int 
and have only one record
 * File 2: Same schema with File 1 except column C1  having data type String 
and having>= X records

X depends on the capacity of your computer, my case is 36, you can increase the 
number of row to find X.

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
C1 changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (2, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (3, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (4, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (5, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (6, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (7, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (8, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (9, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}

  was:
Precondition:

Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)

In folder A having two parquet files
 * File 1: have some columns and one of them is column X with data type Int and 
have only one record
 * File 2: Same schema with File 1 except column X  having data type String and 
having>= X records

X depends on the capacity of your computer, my case is 36, you can increase the 
number of row to find X.

Read file 1 to get schema of file 1.

Read folder A with schema of file 1.

Expected: Read successfully, file 2 will be ignored as the data type of column 
X changed to string.

Actual: File 2 seems to be not ignored and get error:

 `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
executor driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 executor 
driver): java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
 at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`

 

If i remove one record from file2. It works well  
{code:java}
spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
schema1 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", IntegerType(), True),
])

sample_data = [(1, 17)]
df1 = spark.createDataFrame(sample_data, schema1)

schema2 = StructType([
 StructField("program_sk", IntegerType(), True),
 StructField("client_sk", StringType(), True),
])
sample_data = [(1, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (2, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (3, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (4, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (5, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (6, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (7, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (8, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 (9, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
 ]
df2 = spark.createDataFrame(sample_data, schema2)
file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'

df1.write \
 .mode('overwrite') \
 .format('parquet') \
 .save(f'{file_save_path}')

df2.write \
 .mode('append') \
 .format('parquet') \
 .save(f'{file_save_path}')

df = spark.read.schema(schema1).parquet(file_save_path)
df.show(){code}


> ignoreCorruptFiles does not work when schema change from int to string when a 
> file having more than X records
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36983
>                 URL: https://issues.apache.org/jira/browse/SPARK-36983
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: mike
>            Priority: Major
>
> Precondition:
> Spark 3.1 run locally on my Macbook Pro(16G Ram,i7, 2015)
> In folder A having two parquet files
>  * File 1: have some columns and one of them is column C1 with data type Int 
> and have only one record
>  * File 2: Same schema with File 1 except column C1  having data type String 
> and having>= X records
> X depends on the capacity of your computer, my case is 36, you can increase 
> the number of row to find X.
> Read file 1 to get schema of file 1.
> Read folder A with schema of file 1.
> Expected: Read successfully, file 2 will be ignored as the data type of 
> column C1 changed to string.
> Actual: File 2 seems to be not ignored and get error:
>  `WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (192.168.1.78 
> executor driver): java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>  at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:45)`
>  
> If i remove one record from file2. It works well  
> {code:java}
> spark.conf.set('spark.sql.files.ignoreCorruptFiles', True)
> schema1 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", IntegerType(), True),
> ])
> sample_data = [(1, 17)]
> df1 = spark.createDataFrame(sample_data, schema1)
> schema2 = StructType([
>  StructField("program_sk", IntegerType(), True),
>  StructField("client_sk", StringType(), True),
> ])
> sample_data = [(1, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (2, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (3, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (4, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (5, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (6, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (7, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (8, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  (9, "19999"), (2, "3332"), (3, "199999"), (4, "3333"),
>  ]
> df2 = spark.createDataFrame(sample_data, schema2)
> file_save_path = 's3://xxx-data-dev/adp_data_lake/test_ignore_corrupt/'
> df1.write \
>  .mode('overwrite') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df2.write \
>  .mode('append') \
>  .format('parquet') \
>  .save(f'{file_save_path}')
> df = spark.read.schema(schema1).parquet(file_save_path)
> df.show(){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36983) ignoreCorruptFiles does not work when schema change from int to string when a file having more than X records

Reply via email to