[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

Frank Kemmer (JIRA) Wed, 05 Sep 2018 11:12:25 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank Kemmer updated SPARK-25343:
---------------------------------
    Description: 
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string at the separators.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.

  was:
With the cvs() method it is currenty possible to create a Dataframe from 
Dataset[String], where the given string contains comma separated values. This 
is really great.

But very often we have to parse files where we have to split the values of a 
line by very individual value separators and regular expressions. The result is 
a Dataset[List[String]]. This list corresponds to what you would get, after 
splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as 
input especially given a target schema. The csv parser usually casts the 
separated values against the schema and can sort out lines where the values of 
the columns do not fit with the schema.

This is especially interesting with PERMISSIVE mode and a column for corrupt 
records which then should contain the input list of strings as a dumped JSON 
string.

This is the functionality I am looking for and I think it is already 
implemented in the CSV parser.


> Extend CSV parsing to Dataset[List[String]]
> -------------------------------------------
>
>                 Key: SPARK-25343
>                 URL: https://issues.apache.org/jira/browse/SPARK-25343
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Frank Kemmer
>            Priority: Minor
>
> With the cvs() method it is currenty possible to create a Dataframe from 
> Dataset[String], where the given string contains comma separated values. This 
> is really great.
> But very often we have to parse files where we have to split the values of a 
> line by very individual value separators and regular expressions. The result 
> is a Dataset[List[String]]. This list corresponds to what you would get, 
> after splitting the values of a CSV string at the separators.
> It would be great, if the csv() method would also accept such a Dataset as 
> input especially given a target schema. The csv parser usually casts the 
> separated values against the schema and can sort out lines where the values 
> of the columns do not fit with the schema.
> This is especially interesting with PERMISSIVE mode and a column for corrupt 
> records which then should contain the input list of strings as a dumped JSON 
> string.
> This is the functionality I am looking for and I think it is already 
> implemented in the CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

Reply via email to