[jira] [Comment Edited] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

Ruslan Dautkhanov (JIRA) Wed, 15 May 2019 09:01:18 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840018#comment-16840018
 ]


Ruslan Dautkhanov edited comment on SPARK-15463 at 5/15/19 4:00 PM:
--------------------------------------------------------------------

[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
schema inference)? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 


was (Author: tagar):
[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 

> Support for creating a dataframe from CSV in Dataset[String]
> ------------------------------------------------------------
>
>                 Key: SPARK-15463
>                 URL: https://issues.apache.org/jira/browse/SPARK-15463
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: PJ Fanning
>            Assignee: Hyukjin Kwon
>            Priority: Major
>             Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

Reply via email to