[ https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265022#comment-17265022 ]
Marius Butan commented on SPARK-34042: -------------------------------------- Like I said in description I made some tests in [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]: All the tests are running on a file with 3 columns withPruningEnabledAndMapSingleColumn -> with pruning enabled when we have a column in schema and we use it for the select query, the expected result is 2 not 0 withPruningEnabledAndMap2ColumnsButUse1InSql-> with pruning enabled when we have 2 columns in schema and we use only 1 in the select, the expected result is 2 and it is correct withPruningDisableAndMap2ColumnsButUse1InSql -> with pruning disabled it works correctly > Column pruning is not working as expected for PERMISIVE mode > ------------------------------------------------------------ > > Key: SPARK-34042 > URL: https://issues.apache.org/jira/browse/SPARK-34042 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.4.7 > Reporter: Marius Butan > Priority: Major > > In PERMISSIVE mode > Given a csv with multiple columns per row, if your file schema has a single > column and you are doing a SELECT in SQL with a condition like > '<corrupt_record_column_name> is null', the row is marked as corrupted > > BUT if you add an extra column in the file schema and you are not putting > that column in SQL SELECT , the row is not marked as corrupted > > PS. I don't know exactly what is the right behavior, I didn't find it for > PERMISSIVE mode the documentation. > What I found is: As an example, CSV file contains the "id,name" header and > one row "1234". In Spark 2.4, the selection of the id column consists of a > row with one column value 1234 but in Spark 2.3 and earlier, it is empty in > the DROPMALFORMED mode. To restore the previous behavior, set > {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}. > > [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html] > > I made a "unit" test in order to exemplify the issue: > [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org