You can do this manually without much trouble: get your files on a distributed store like HDFS, read them with textFile, filter out headers, parse with a CSV library like Commons CSV, select columns, format and store the result. That's tens of lines of code.
However you probably want to start by looking at https://github.com/databricks/spark-csv which may make it even easier than that and give you a richer query syntax. On Fri, Feb 6, 2015 at 8:37 AM, Spico Florin <spicoflo...@gmail.com> wrote: > Hi! > I'm new to Spark. I have a case study that where the data is store in CSV > files. These files have headers with morte than 1000 columns. I would like > to know what are the best practice to parsing them and in special the > following points: > 1. Getting and parsing all the files from a folder > 2. What CSV parser do you use? > 3. I would like to select just some columns whose names matches a pattern > and then pass the selected columns values (plus the column names) to the > processing and save the output to a CSV (preserving the selected columns). > > If you have any experience with some points above, it will be really helpful > (for me and for the others that will encounter the same cases) if you can > share your thoughts. > Thanks. > Regards, > Florin > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org