Something like this? val huge_data = sc.textFile("/path/to/first.csv").map(x => (x.split("\t")(1), x.split("\t")(0)) val gender_data = sc.textFile("/path/to/second.csv"),map(x => (x.split("\t")(0), x))
val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should also be similar. Thanks Best Regards On Sat, Jun 13, 2015 at 12:16 AM, Rex X <dnsr...@gmail.com> wrote: > To be concrete, say we have a folder with thousands of tab-delimited csv > files with following attributes format (each csv file is about 10GB): > > id name address city... > 1 Matt add1 LA... > 2 Will add2 LA... > 3 Lucy add3 SF... > ... > > And we have a lookup table based on "name" above > > name gender > Matt M > Lucy F > ... > > Now we are interested to output from top 1000 rows of each csv file into > following format: > > id name gender > 1 Matt M > ... > > Can we use pyspark to efficiently handle this? > > >