I've been doing a bunch of work with CSVs in Spark, mostly saving them as a merged CSV (instead of the various part-nnnnn files). You might find the following links useful:
- This article is about combining the part files and outputting a header as the first line in the merged results: http://java.dzone.com/articles/spark-write-csv-file-header - This was my take on the previous author's original article, but it doesn't yet handle the header row: http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ spark-csv helps with reading CSV data and mapping a schema for Spark SQL, but as of now doesn't save CSV data. On Fri Feb 06 2015 at 9:49:06 AM Sean Owen <so...@cloudera.com> wrote: > You can do this manually without much trouble: get your files on a > distributed store like HDFS, read them with textFile, filter out > headers, parse with a CSV library like Commons CSV, select columns, > format and store the result. That's tens of lines of code. > > However you probably want to start by looking at > https://github.com/databricks/spark-csv which may make it even easier > than that and give you a richer query syntax. > > On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com> > wrote: > > Hi! > > I'm new to Spark. I have a case study that where the data is store in > CSV > > files. These files have headers with morte than 1000 columns. I would > like > > to know what are the best practice to parsing them and in special the > > following points: > > 1. Getting and parsing all the files from a folder > > 2. What CSV parser do you use? > > 3. I would like to select just some columns whose names matches a pattern > > and then pass the selected columns values (plus the column names) to the > > processing and save the output to a CSV (preserving the selected > columns). > > > > If you have any experience with some points above, it will be really > helpful > > (for me and for the others that will encounter the same cases) if you can > > share your thoughts. > > Thanks. > > Regards, > > Florin > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >