[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2019-05-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840914#comment-16840914
 ] 

Hyukjin Kwon commented on SPARK-15463:
--

Please ask a question to the mailing list.

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2019-05-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840018#comment-16840018
 ] 

Ruslan Dautkhanov commented on SPARK-15463:
---

[~hyukjin.kwon] would it be possible to make csvParsing optional (and have only 
? 

We have an RDD with columns stored separately in a tuple .. but all strings. 
Would be great to infer schema without parsing a single String as a csv.

Current workaround is to glue all strings together (with proper quoting, 
escaping etc) just so that bring columns back here 

[https://github.com/apache/spark/blob/3f42c4cc7b93e32cb8d4f2517987097b73e733fd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L156]

into a list of columns and finally use inferSchema 

infer() already accepts `RDD[Array[String]] ` but current API only accepts 
`RDD[String]` (or `Dataset[String]`)

[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L51]

I don't think it's a public API to use infer() directly from 
[https://github.com/apache/spark/blob/a30983db575de5c87b3a4698b223229327fd65cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L30]
 ? It seems to be a better workaround than collapsing all-string columns to a 
csv, parse it internally by Spark only to infer data types of those columns. 

Thank you for any leads.

 

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901026#comment-15901026
 ] 

Hyukjin Kwon commented on SPARK-15463:
--

Yes, I am less sure for this functionality. Maybe, asking it to mailing list 
and see if many users want this like {{from_json}} was also asked into mailing 
list?

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900824#comment-15900824
 ] 

Takeshi Yamamuro commented on SPARK-15463:
--

Have you seen https://github.com/apache/spark/pull/13300#issuecomment-261156734 
as related discussion?
Currently, I think [~hyukjin.kwon]'s idea is more preferable: 
https://github.com/apache/spark/pull/16854#issue-206224691.

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-07 Thread Jayesh lalwani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899665#comment-15899665
 ] 

Jayesh lalwani commented on SPARK-15463:


Does it make sense to have a to_csv and a from_csv function that is modeled 
after to_json and from_json?

The applications that we are supporting need inputs from a combination of 
sources and formats. Also, there is a combination of sinks and formats. For 
example, we might need
a) Files with CSV content
b) Files with JSON content
c) Kafka with CSV content
d) Kafka with JSON content
e) Parquet

Also, if the input has a nested structure (JSON/Parquet) sometimes, we prefer 
keeping the data in a StructType object.. and sometimes we prefer to flatten 
the Struct Type object into a dataframe. 
For example, if we are getting data from Kafka as JSON, massaging it, and are 
writing JSON to Kafka, we would prefer to be able to transform a StructType 
object, and not have to flatten it into a dataframe
Another example is that we get data from JSON, that needs to be stored in an 
RDMBS database. This requires us to flatten the data into a data frame before 
storing it into the table

So, this is what I was thinking. We should have the following functions
1) from_json - Convert a Dataframe with String to DataFrame with StructType
2) to_json - Convert a Dataframe with StructType to Dataframe of String
3) from_csv - Convert a Dataframe of String to DataFrame of StructType
4) to_csv - COnvert a DataFrame of StructType to DataFrame of String
5) flatten - convert Data Frame with StructType into a DataFrame that has the 
same fields as the StructType


Essentially, the request in the Change Request can be done by calling 
*flatten(from_csv())*

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-02-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858101#comment-15858101
 ] 

Apache Spark commented on SPARK-15463:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16854

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org