You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.
Thanks
Best Regards
On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote:
Thanks very much, Akhil.
That solved my problem.
Best,
Rex
On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das
Something like this?
val huge_data = sc.textFile(/path/to/first.csv).map(x =
(x.split(\t)(1), x.split(\t)(0))
val gender_data = sc.textFile(/path/to/second.csv),map(x =
(x.split(\t)(0), x))
val joined_data = huge_data.join(gender_data)
joined_data.take(1000)
Its scala btw, python api should
To be concrete, say we have a folder with thousands of tab-delimited csv
files with following attributes format (each csv file is about 10GB):
idnameaddresscity...
1Mattadd1LA...
2Willadd2LA...
3Lucyadd3SF...
...
And we have a