[ https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262725#comment-16262725 ]
Hyukjin Kwon commented on SPARK-22580: -------------------------------------- There was a limitation and discussion about it. I think it is fixed in https://github.com/apache/spark/pull/19199 with informing workaround. Please refer the discussion in https://github.com/apache/spark/pull/18865. Let me leave it as a duplicate of SPARK-21610 > Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) > always 0 > --------------------------------------------------------------------------------- > > Key: SPARK-22580 > URL: https://issues.apache.org/jira/browse/SPARK-22580 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.2.0 > Environment: Same behavior on Debian and MS Windows (8.1) system. JRE > 1.8 > Reporter: Florian Kaspar > > It seems that doing counts after filtering for the parser-created > columnNameOfCorruptRecord and doing a count afterwards does not recognize any > invalid row that was put to this special column. > Filtering for members of the actualSchema works fine and yields correct > counts. > Input CSV example: > {noformat} > val1, cat1, 1.337 > val2, cat1, 1.337 > val3, cat2, 42.0 > some, invalid, line > {noformat} > Code snippet: > {code:java} > StructType schema = new StructType(new StructField[] { > new StructField("s1", DataTypes.StringType, true, > Metadata.empty()), > new StructField("s2", DataTypes.StringType, true, > Metadata.empty()), > new StructField("d1", DataTypes.DoubleType, true, > Metadata.empty()), > new StructField("FALLBACK", DataTypes.StringType, true, > Metadata.empty())}); > Dataset<Row> csv = sqlContext.read() > .option("header", "false") > .option("parserLib", "univocity") > .option("mode", "PERMISSIVE") > .option("maxCharsPerColumn", 10000000) > .option("ignoreLeadingWhiteSpace", "false") > .option("ignoreTrailingWhiteSpace", "false") > .option("comment", null) > .option("header", "false") > .option("columnNameOfCorruptRecord", "FALLBACK") > .schema(schema) > .csv(path/to/csv/file); > long validCount = csv.filter("FALLBACK IS NULL").count(); > long invalidCount = csv.filter("FALLBACK IS NOT NULL").count(); > {code} > Expected: > validCount is 3 > Invalid Count is 1 > Actual: > validCount is 4 > Invalid Count is 0 > Caching the csv after load solves the problem and shows the correct counts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org