[ https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marius Butan updated SPARK-34042: --------------------------------- Description: In PERMISSIVE mode Given a csv with multiple columns per row, if your file schema has a single column and you are doing a SELECT in SQL with a condition like '<corrupt_record_column_name> is null', the row is marked as corrupted BUT if you add an extra column in the file schema and you are not putting that column in SQL SELECT , the row is not marked as corrupted PS. I don't know exactly what is the right behavior, I didn't find it for PERMISSIVE mode the documentation. What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, the selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the previous behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}. [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html] I made a "unit" test in order to exemplify the issue: [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java] was: In PERMISSIVE mode Given a csv with multiple columns per row, if your file schema has a single column and you are doing a SELECT in SQL with a condition like '<corrupt_record_column_name> is null', the row is marked as corrupted BUT if you add an extra column in the file schema and you are not putting that column in SQL SELECT , the row is not marked as corrupted PS. I don't know exactly what is the right behavior, I didn't find for PERMISSIVE mode the documentation. What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, the selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the previous behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}. [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html] I made a "unit" test in order to exemplify the issue: [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java] > Column pruning is not working as expected for PERMISIVE mode > ------------------------------------------------------------ > > Key: SPARK-34042 > URL: https://issues.apache.org/jira/browse/SPARK-34042 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.4.7 > Reporter: Marius Butan > Priority: Major > > In PERMISSIVE mode > Given a csv with multiple columns per row, if your file schema has a single > column and you are doing a SELECT in SQL with a condition like > '<corrupt_record_column_name> is null', the row is marked as corrupted > > BUT if you add an extra column in the file schema and you are not putting > that column in SQL SELECT , the row is not marked as corrupted > > PS. I don't know exactly what is the right behavior, I didn't find it for > PERMISSIVE mode the documentation. > What I found is: As an example, CSV file contains the "id,name" header and > one row "1234". In Spark 2.4, the selection of the id column consists of a > row with one column value 1234 but in Spark 2.3 and earlier, it is empty in > the DROPMALFORMED mode. To restore the previous behavior, set > {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}. > > [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html] > > I made a "unit" test in order to exemplify the issue: > [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org