[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-10-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1758#comment-1758
 ] 

Hyukjin Kwon commented on SPARK-25545:
--

See the discussion at https://github.com/apache/spark/pull/17293 Eventually 
they shouldn't be implicitly converted, or at the very least it should be fixed 
with a coherent reason.

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-09-27 Thread Steven Bakhtiari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630087#comment-16630087
 ] 

Steven Bakhtiari commented on SPARK-25545:
--

Hi [~hyukjin.kwon] - is there any rationale for why it's converted into a 
nullable schema?

If there's a good reason, I'd like to understand what it is (assuming this is 
being done for a specific reason, as opposed to being an oversight or bug).

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-09-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630050#comment-16630050
 ] 

Hyukjin Kwon commented on SPARK-25545:
--

The problem is, we convert that into nullable schema. It's a duplicate of 
SPARK-20457

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25545) CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields

2018-09-26 Thread Steven Bakhtiari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628949#comment-16628949
 ] 

Steven Bakhtiari commented on SPARK-25545:
--

Somebody on SO pointed me to this older ticket that appears to touch on the 
same issue. SPARK-10848

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -
>
> Key: SPARK-25545
> URL: https://issues.apache.org/jira/browse/SPARK-25545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Steven Bakhtiari
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
> StructField("col1", IntegerType, nullable = false) ::
>   StructField("col2", IntegerType, nullable = false) ::
>   StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org