Marius Butan created SPARK-34042: ------------------------------------ Summary: Column pruning is not working as expected for PERMISIVE mode Key: SPARK-34042 URL: https://issues.apache.org/jira/browse/SPARK-34042 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.7 Reporter: Marius Butan
In PERMISSIVE mode Given a csv with multiple columns, if you have in schema a single column and you are selecting in SQL with condition that corrupt record to be null, the row is mapped as corrupted. BUT if you add an extra column in csv schema an extra column and you are not select that column in SQL, the row is not corrupted PS. I don't know exactly what is the right behaviour, I didn't find for PERMISSIVE mode the documentation. What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}. https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html I made a "unit" test in order to exemplify the issue: [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org