Hi!
I'm new to Spark. I have a case study that where the data is store in CSV
files. These files have headers with morte than 1000 columns. I would like
to know what are the best practice to parsing them and in special the
following points:
1. Getting and parsing all the files from a folder
2.
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a
merged CSV (instead of the various part-n files). You might find the
following links useful:
- This article is about combining the part files and outputting a header as
the first line in the merged results:
You can do this manually without much trouble: get your files on a
distributed store like HDFS, read them with textFile, filter out
headers, parse with a CSV library like Commons CSV, select columns,
format and store the result. That's tens of lines of code.
However you probably want to start by
As Sean said, this is just a few lines of code. You can see an example here:
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DF.scala#L660
On Feb 6, 2015,