[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262 ]
Hossein Falaki commented on SPARK-2360: --------------------------------------- As a point for comparison the interface in some other popular packages are: __R__: ``` read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) ``` Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html > CSV import to SchemaRDDs > ------------------------ > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Michael Armbrust > Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)