[jira] [Commented] (SPARK-29068) CSV read reports incorrect row count

Sandeep Katta (Jira) Fri, 13 Sep 2019 05:33:26 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929158#comment-16929158
 ]


Sandeep Katta commented on SPARK-29068:
---------------------------------------

looks similar to 
[SPARK-29058|https://issues.apache.org/jira/browse/SPARK-29058],  anyways I 
will look into this issue

> CSV read reports incorrect row count
> ------------------------------------
>
>                 Key: SPARK-29068
>                 URL: https://issues.apache.org/jira/browse/SPARK-29068
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.4
>            Reporter: Thomas Diesler
>            Priority: Major
>
> Reading the [SFNY example 
> data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] 
> in Java like this ...
> {code:java}
>         Path srcdir = Paths.get("src/test/resources");
>         Path inpath = srcdir.resolve("part_1_data.csv");
>         SparkSession session = getOrCreateSession();
>         Dataset<Row> dataset = session.read()
>                       //.option("header", true)
>                         .option("mode", "DROPMALFORMED")
>                       .schema(new StructType()
>                               .add("insf", IntegerType, false)
>                               .add("beds", DoubleType, false)
>                               .add("baths", DoubleType, false)
>                               .add("price", IntegerType, false)
>                               .add("year", IntegerType, false)
>                               .add("sqft", IntegerType, false)
>                               .add("prcsqft", IntegerType, false)
>                               .add("elevation", IntegerType, false))
>                       .csv(inpath.toString());
> {code}
> Incorrectly reports 495 instead of 492 rows. It seems to include the three 
> header rows in the count.
> Also, without DROPMALFORMED it creates 495 rows with three null value rows. 
> This also seems to be incorrect because the schema explicitly requires non 
> null values for all fields.
> This code works fine with Spark-2.1.0



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29068) CSV read reports incorrect row count

Reply via email to