[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851952#comment-16851952
 ] 

Liang-Chi Hsieh commented on SPARK-27873:
-

I can prepare a PR if Marcin or Hyukjin Kwon don't plan to do.

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851948#comment-16851948
 ] 

Liang-Chi Hsieh commented on SPARK-27873:
-

I guess what Marcin meant is:

{code}
val schema = StructType.fromDDL("a int, b date")
val columnNameOfCorruptRecord = "_unparsed"
val schemaWithCorrField1 = schema.add(columnNameOfCorruptRecord, StringType)
val df = spark
  .read
  .option("mode", "Permissive")
  .option("header", "true")
  .option("enforceSchema", false)
  .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord)
  .schema(schemaWithCorrField1)
  .csv(testFile(valueMalformedWithHeaderFile))
{code}

If we want to keep corrupt record, we provide a new column into the schema. But 
this new column isn't in CSV header. So if enforceSchema is disable at the same 
time, CSVHeaderChecker throws a exception like:

{code}
[info]   Cause: java.lang.IllegalArgumentException: Number of column in CSV 
header is not equal to number of fields in the schema: 
[info]  Header length: 2, schema size: 3   
{code}

It is because CSVHeaderChecker doesn't consider columnNameOfCorruptRecord for 
now.

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-29 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851534#comment-16851534
 ] 

Hyukjin Kwon commented on SPARK-27873:
--

Can you show a reproducer with output and expected output?

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org