[jira] [Commented] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

Hyukjin Kwon (JIRA) Wed, 22 Nov 2017 06:51:17 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262725#comment-16262725
 ]


Hyukjin Kwon commented on SPARK-22580:
--------------------------------------

There was a limitation and discussion about it. I think it is fixed in 
https://github.com/apache/spark/pull/19199 with informing workaround. Please 
refer the discussion in https://github.com/apache/spark/pull/18865. Let me 
leave it as a duplicate of SPARK-21610

> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) 
> always 0
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-22580
>                 URL: https://issues.apache.org/jira/browse/SPARK-22580
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.0
>         Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 
> 1.8
>            Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created 
> columnNameOfCorruptRecord and doing a count afterwards does not recognize any 
> invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct 
> counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
>         StructType schema = new StructType(new StructField[] { 
>                 new StructField("s1", DataTypes.StringType, true, 
> Metadata.empty()),
>                 new StructField("s2", DataTypes.StringType, true, 
> Metadata.empty()),
>                 new StructField("d1", DataTypes.DoubleType, true, 
> Metadata.empty()),
>                 new StructField("FALLBACK", DataTypes.StringType, true, 
> Metadata.empty())});
>             Dataset<Row> csv = sqlContext.read()
>                     .option("header", "false")
>                     .option("parserLib", "univocity")
>                     .option("mode", "PERMISSIVE")
>                     .option("maxCharsPerColumn", 10000000)
>                     .option("ignoreLeadingWhiteSpace", "false")
>                     .option("ignoreTrailingWhiteSpace", "false")
>                     .option("comment", null)
>                     .option("header", "false")
>                     .option("columnNameOfCorruptRecord", "FALLBACK")
>                     .schema(schema)
>                     .csv(path/to/csv/file);
>              long validCount = csv.filter("FALLBACK IS NULL").count();
>              long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected: 
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

Reply via email to