[ https://issues.apache.org/jira/browse/SPARK-29068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929158#comment-16929158 ]
Sandeep Katta commented on SPARK-29068: --------------------------------------- looks similar to [SPARK-29058|https://issues.apache.org/jira/browse/SPARK-29058], anyways I will look into this issue > CSV read reports incorrect row count > ------------------------------------ > > Key: SPARK-29068 > URL: https://issues.apache.org/jira/browse/SPARK-29068 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.4 > Reporter: Thomas Diesler > Priority: Major > > Reading the [SFNY example > data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] > in Java like this ... > {code:java} > Path srcdir = Paths.get("src/test/resources"); > Path inpath = srcdir.resolve("part_1_data.csv"); > SparkSession session = getOrCreateSession(); > Dataset<Row> dataset = session.read() > //.option("header", true) > .option("mode", "DROPMALFORMED") > .schema(new StructType() > .add("insf", IntegerType, false) > .add("beds", DoubleType, false) > .add("baths", DoubleType, false) > .add("price", IntegerType, false) > .add("year", IntegerType, false) > .add("sqft", IntegerType, false) > .add("prcsqft", IntegerType, false) > .add("elevation", IntegerType, false)) > .csv(inpath.toString()); > {code} > Incorrectly reports 495 instead of 492 rows. It seems to include the three > header rows in the count. > Also, without DROPMALFORMED it creates 495 rows with three null value rows. > This also seems to be incorrect because the schema explicitly requires non > null values for all fields. > This code works fine with Spark-2.1.0 -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org