[jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source

Hyukjin Kwon (JIRA) Mon, 03 Apr 2017 18:22:11 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-14726.
----------------------------------
    Resolution: Won't Fix

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7)
val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema
spark.read.schema(sampledSchema).csv("/tmp/path")
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -------------------------------------------------------------
>
>                 Key: SPARK-14726
>                 URL: https://issues.apache.org/jira/browse/SPARK-14726
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source

Reply via email to