You can do this manually without much trouble: get your files on a
distributed store like HDFS, read them with textFile, filter out
headers, parse with a CSV library like Commons CSV, select columns,
format and store the result. That's tens of lines of code.

However you probably want to start by looking at
https://github.com/databricks/spark-csv which may make it even easier
than that and give you a richer query syntax.

On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com> wrote:
> Hi!
>   I'm new to Spark. I have a case study that where the data is store in CSV
> files. These files have headers with morte than 1000 columns. I would like
> to know what are the best practice to parsing them and in special the
> following points:
> 1. Getting and parsing all the files from a folder
> 2. What CSV parser do you use?
> 3. I would like to select just some columns whose names matches a pattern
> and then pass the selected columns values (plus the column names) to the
> processing and save the output to a CSV (preserving the selected columns).
>
> If you have any experience with some points above, it will be really helpful
> (for me and for the others that will encounter the same cases) if you can
> share your thoughts.
> Thanks.
>   Regards,
>  Florin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to