Parsing CSV files in Spark

2015-02-06 Thread Spico Florin
Hi! I'm new to Spark. I have a case study that where the data is store in CSV files. These files have headers with morte than 1000 columns. I would like to know what are the best practice to parsing them and in special the following points: 1. Getting and parsing all the files from a folder 2.

Re: Parsing CSV files in Spark

2015-02-06 Thread Charles Feduke
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a merged CSV (instead of the various part-n files). You might find the following links useful: - This article is about combining the part files and outputting a header as the first line in the merged results:

Re: Parsing CSV files in Spark

2015-02-06 Thread Sean Owen
You can do this manually without much trouble: get your files on a distributed store like HDFS, read them with textFile, filter out headers, parse with a CSV library like Commons CSV, select columns, format and store the result. That's tens of lines of code. However you probably want to start by

Re: Parsing CSV files in Spark

2015-02-06 Thread Mohit Jaggi
As Sean said, this is just a few lines of code. You can see an example here: https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660 On Feb 6, 2015,