I've been doing a bunch of work with CSVs in Spark, mostly saving them as a
merged CSV (instead of the various part-nnnnn files). You might find the
following links useful:

- This article is about combining the part files and outputting a header as
the first line in the merged results:

http://java.dzone.com/articles/spark-write-csv-file-header

- This was my take on the previous author's original article, but it
doesn't yet handle the header row:

http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/

spark-csv helps with reading CSV data and mapping a schema for Spark SQL,
but as of now doesn't save CSV data.

On Fri Feb 06 2015 at 9:49:06 AM Sean Owen <so...@cloudera.com> wrote:

> You can do this manually without much trouble: get your files on a
> distributed store like HDFS, read them with textFile, filter out
> headers, parse with a CSV library like Commons CSV, select columns,
> format and store the result. That's tens of lines of code.
>
> However you probably want to start by looking at
> https://github.com/databricks/spark-csv which may make it even easier
> than that and give you a richer query syntax.
>
> On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com>
> wrote:
> > Hi!
> >   I'm new to Spark. I have a case study that where the data is store in
> CSV
> > files. These files have headers with morte than 1000 columns. I would
> like
> > to know what are the best practice to parsing them and in special the
> > following points:
> > 1. Getting and parsing all the files from a folder
> > 2. What CSV parser do you use?
> > 3. I would like to select just some columns whose names matches a pattern
> > and then pass the selected columns values (plus the column names) to the
> > processing and save the output to a CSV (preserving the selected
> columns).
> >
> > If you have any experience with some points above, it will be really
> helpful
> > (for me and for the others that will encounter the same cases) if you can
> > share your thoughts.
> > Thanks.
> >   Regards,
> >  Florin
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to