[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1758#comment-1758 ] Hyukjin Kwon commented on SPARK-25545: -- See the discussion at https://github.com/apache/spark/pull/17293 Eventually they shouldn't be implicitly converted, or at the very least it should be fixed with a coherent reason. > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > - > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Steven Bakhtiari >Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630087#comment-16630087 ] Steven Bakhtiari commented on SPARK-25545: -- Hi [~hyukjin.kwon] - is there any rationale for why it's converted into a nullable schema? If there's a good reason, I'd like to understand what it is (assuming this is being done for a specific reason, as opposed to being an oversight or bug). > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > - > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Steven Bakhtiari >Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630050#comment-16630050 ] Hyukjin Kwon commented on SPARK-25545: -- The problem is, we convert that into nullable schema. It's a duplicate of SPARK-20457 > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > - > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Steven Bakhtiari >Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields
[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628949#comment-16628949 ] Steven Bakhtiari commented on SPARK-25545: -- Somebody on SO pointed me to this older ticket that appears to touch on the same issue. SPARK-10848 > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > - > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Steven Bakhtiari >Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org