[ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
---------------------------------
    Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
'<corrupt_record_field_name> is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
selecting that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
of the id column consists of a row with one column value 1234 but in Spark 2.3 
and earlier, it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> ------------------------------------------------------------
>
>                 Key: SPARK-34042
>                 URL: https://issues.apache.org/jira/browse/SPARK-34042
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.7
>            Reporter: Marius Butan
>            Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> '<corrupt_record_field_name> is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to