[jira] [Updated] (SPARK-14274) Replaces inferSchema with prepareRead to collect necessary global information

Cheng Lian (JIRA) Wed, 30 Mar 2016 07:52:00 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cheng Lian updated SPARK-14274:
-------------------------------
    Description: 
One problem of our newly introduced {{FileFormat.buildReader()}} method is that 
it only sees pieces of input files. On the other hand, data sources like CSV 
and LibSVM requires some sort of global information:

- CSV: the content of the header line if {{header}} option is set to true, so 
that we can filter out header lines within each input file. This is considered 
as a global information because it's possible that the header appears in the 
middle of a file after blocks of comments and empty lines, although this is 
just a rare/contrived corner case.
- LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to 
infer the total number of features to construct result {{LabeledPoint}} 
instances.

Unfortunately, with our current API, this kind of global information can't be 
gathered.

The solution proposed here is to add a {{prepareRead}} method, which accepts 
the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which 
contains an {{Option\[StructType\]}} for the inferred schema and a 
{{Map\[String, Any\]}} for any gathered global information. This 
{{ReadContext}} is then passed to {{buildReader()}}. By default, 
{{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema 
itself can be considered as a sort of global information).

  was:
One problem of our newly introduced {{FileFormat.buildReader()}} method is that 
it only sees pieces of input files. On the other hand, data sources like CSV 
and LibSVM requires some sort of global information:

- CSV: the content of the header line if {{header}} option is set to true, so 
that we can filter out header lines within each input file. This is considered 
as a global information because it's possible that the header appears in the 
middle of a file after blocks of comments and empty lines, although this is 
just a rare/contrived corner case.
- LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to 
infer the total number of features to construct result {{LabeledPoint}}s.

Unfortunately, with our current API, this kind of global information can't be 
gathered.

The solution proposed here is to add a {{prepareRead}} method, which accepts 
the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which 
contains an {{Option\[StructType\]}} for the inferred schema and a 
{{Map\[String, Any\]}} for any gathered global information. This 
{{ReadContext}} is then passed to {{buildReader()}}. By default, 
{{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema 
itself can be considered as a sort of global information).


> Replaces inferSchema with prepareRead to collect necessary global information
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-14274
>                 URL: https://issues.apache.org/jira/browse/SPARK-14274
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>             Fix For: 2.0.0
>
>
> One problem of our newly introduced {{FileFormat.buildReader()}} method is 
> that it only sees pieces of input files. On the other hand, data sources like 
> CSV and LibSVM requires some sort of global information:
> - CSV: the content of the header line if {{header}} option is set to true, so 
> that we can filter out header lines within each input file. This is 
> considered as a global information because it's possible that the header 
> appears in the middle of a file after blocks of comments and empty lines, 
> although this is just a rare/contrived corner case.
> - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset 
> to infer the total number of features to construct result {{LabeledPoint}} 
> instances.
> Unfortunately, with our current API, this kind of global information can't be 
> gathered.
> The solution proposed here is to add a {{prepareRead}} method, which accepts 
> the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which 
> contains an {{Option\[StructType\]}} for the inferred schema and a 
> {{Map\[String, Any\]}} for any gathered global information. This 
> {{ReadContext}} is then passed to {{buildReader()}}. By default, 
> {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema 
> itself can be considered as a sort of global information).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14274) Replaces inferSchema with prepareRead to collect necessary global information

Reply via email to