Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a