[jira] [Commented] (SPARK-29068) CSV read reports incorrect row count
[ https://issues.apache.org/jira/browse/SPARK-29068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929158#comment-16929158 ] Sandeep Katta commented on SPARK-29068: --- looks similar to [SPARK-29058|https://issues.apache.org/jira/browse/SPARK-29058], anyways I will look into this issue > CSV read reports incorrect row count > > > Key: SPARK-29068 > URL: https://issues.apache.org/jira/browse/SPARK-29068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Thomas Diesler >Priority: Major > > Reading the [SFNY example > data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] > in Java like this ... > {code:java} > Path srcdir = Paths.get("src/test/resources"); > Path inpath = srcdir.resolve("part_1_data.csv"); > SparkSession session = getOrCreateSession(); > Dataset dataset = session.read() > //.option("header", true) > .option("mode", "DROPMALFORMED") > .schema(new StructType() > .add("insf", IntegerType, false) > .add("beds", DoubleType, false) > .add("baths", DoubleType, false) > .add("price", IntegerType, false) > .add("year", IntegerType, false) > .add("sqft", IntegerType, false) > .add("prcsqft", IntegerType, false) > .add("elevation", IntegerType, false)) > .csv(inpath.toString()); > {code} > Incorrectly reports 495 instead of 492 rows. It seems to include the three > header rows in the count. > Also, without DROPMALFORMED it creates 495 rows with three null value rows. > This also seems to be incorrect because the schema explicitly requires non > null values for all fields. > This code works fine with Spark-2.1.0 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29068) CSV read reports incorrect row count
[ https://issues.apache.org/jira/browse/SPARK-29068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928686#comment-16928686 ] HondaWei commented on SPARK-29068: -- [~tdiesler], May you provide your test CSV data? thanks! > CSV read reports incorrect row count > > > Key: SPARK-29068 > URL: https://issues.apache.org/jira/browse/SPARK-29068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Thomas Diesler >Priority: Major > > Reading the [SFNY example > data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] > in Java like this ... > {code:java} > Path srcdir = Paths.get("src/test/resources"); > Path inpath = srcdir.resolve("part_1_data.csv"); > SparkSession session = getOrCreateSession(); > Dataset dataset = session.read() > //.option("header", true) > .option("mode", "DROPMALFORMED") > .schema(new StructType() > .add("insf", IntegerType, false) > .add("beds", DoubleType, false) > .add("baths", DoubleType, false) > .add("price", IntegerType, false) > .add("year", IntegerType, false) > .add("sqft", IntegerType, false) > .add("prcsqft", IntegerType, false) > .add("elevation", IntegerType, false)) > .csv(inpath.toString()); > {code} > Incorrectly reports 495 instead of 492 rows. It seems to include the three > header rows in the count. > Also, without DROPMALFORMED it creates 495 rows with three null value rows. > This also seems to be incorrect because the schema explicitly requires non > null values for all fields. > This code works fine with Spark-2.1.0 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org