[jira] [Updated] (SPARK-15982) Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc

Tathagata Das (JIRA) Thu, 16 Jun 2016 20:36:58 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tathagata Das updated SPARK-15982:
----------------------------------
    Description: 
Issues with current reader behavior. 

- `text()` without args returns an empty DF with no columns -> inconsistent, 
its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, 
it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - 
SPARK-16009

The solution I am implementing is to do the following. 
1. For each format, there will be a single argument method, and a vararg 
method. For json, parquet, csv, text, this means adding json(string), etc.. For 
orc, this means adding orc(varargs).
2. Remove the special handling of text(), csv(), etc. that returns empty 
dataframe with no fields. Rather pass on the empty sequence of paths to the 
datasource, and let each datasource handle it right. For e.g, text data source, 
should return empty DF with schema (value: string)


  was:
Issues with current reader behavior. 

- `text()` without args returns an empty DF with no columns -> inconsistent, 
its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, 
it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - 


> Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc
> -------------------------------------------------------------------
>
>                 Key: SPARK-15982
>                 URL: https://issues.apache.org/jira/browse/SPARK-15982
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>
> Issues with current reader behavior. 
> - `text()` without args returns an empty DF with no columns -> inconsistent, 
> its expected that text will always return a DF with `value` string field,
> - `textFile()` without args fails with exception because of the above reason, 
> it expected the DF returned by `text()` to have a `value` field.
> - `orc()` does not have var args, inconsistent with others
> - `json(single-arg)` was removed, but that caused source compatibility issues 
> - SPARK-16009
> The solution I am implementing is to do the following. 
> 1. For each format, there will be a single argument method, and a vararg 
> method. For json, parquet, csv, text, this means adding json(string), etc.. 
> For orc, this means adding orc(varargs).
> 2. Remove the special handling of text(), csv(), etc. that returns empty 
> dataframe with no fields. Rather pass on the empty sequence of paths to the 
> datasource, and let each datasource handle it right. For e.g, text data 
> source, should return empty DF with schema (value: string)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15982) Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc

Reply via email to