You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning.
Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X <dnsr...@gmail.com> wrote: > Thanks very much, Akhil. > > That solved my problem. > > Best, > Rex > > > > On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Something like this? >> >> val huge_data = sc.textFile("/path/to/first.csv").map(x => >> (x.split("\t")(1), x.split("\t")(0)) >> val gender_data = sc.textFile("/path/to/second.csv"),map(x => >> (x.split("\t")(0), x)) >> >> val joined_data = huge_data.join(gender_data) >> >> joined_data.take(1000) >> >> >> Its scala btw, python api should also be similar. >> >> Thanks >> Best Regards >> >> On Sat, Jun 13, 2015 at 12:16 AM, Rex X <dnsr...@gmail.com> wrote: >> >>> To be concrete, say we have a folder with thousands of tab-delimited csv >>> files with following attributes format (each csv file is about 10GB): >>> >>> id name address city... >>> 1 Matt add1 LA... >>> 2 Will add2 LA... >>> 3 Lucy add3 SF... >>> ... >>> >>> And we have a lookup table based on "name" above >>> >>> name gender >>> Matt M >>> Lucy F >>> ... >>> >>> Now we are interested to output from top 1000 rows of each csv file into >>> following format: >>> >>> id name gender >>> 1 Matt M >>> ... >>> >>> Can we use pyspark to efficiently handle this? >>> >>> >>> >> >