[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-04-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:47 AM:
--

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Now we can do a workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.




was (Author: hyukjin.kwon):
Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-04-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:40 AM:
--

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.




was (Author: hyukjin.kwon):
Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7)
val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema
spark.read.schema(sampledSchema).csv("/tmp/path")
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-03-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247378#comment-15247378
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 3/26/17 2:09 PM:
---

This is currently not supported. I will work on this if it is decided to be 
supported. [~rxin]


was (Author: hyukjin.kwon):
This is currently not supported. I can work on this but I feel a bit hesitating 
because I believe CSV data source is ported mainly for "small data world". But 
I believe there are a lot of users dealing with large CSV files. 
I will work on this if it is decided to be supported. [~rxin]

> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org