Marius Butan created SPARK-34042:
------------------------------------

             Summary: Column pruning is not working as expected for PERMISIVE 
mode
                 Key: SPARK-34042
                 URL: https://issues.apache.org/jira/browse/SPARK-34042
             Project: Spark
          Issue Type: Bug
          Components: Java API
    Affects Versions: 2.4.7
            Reporter: Marius Butan


In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
select that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behaviour, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, selection of 
the id column consists of a row with one column value 1234 but in Spark 2.3 and 
earlier it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to