[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

Hossein Falaki (JIRA) Mon, 07 Jul 2014 17:46:59 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
 ]


Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM:
----------------------------------------------------------------

As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
* header: a logical value indicating whether the file contains the names of the 
variables as its first line.
* sep: the field separator character. 
* quote: the set of quoting characters. To disable quoting altogether, use 
‘quote = ""’
* dec: the character used in the file for decimal points.
* fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

> CSV import to SchemaRDDs
> ------------------------
>
>                 Key: SPARK-2360
>                 URL: https://issues.apache.org/jira/browse/SPARK-2360
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Michael Armbrust
>            Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

Reply via email to