Something like this?

val huge_data = sc.textFile("/path/to/first.csv").map(x =>
(x.split("\t")(1), x.split("\t")(0))
val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
(x.split("\t")(0), x))

val joined_data = huge_data.join(gender_data)

joined_data.take(1000)


Its scala btw, python api should also be similar.

Thanks
Best Regards

On Sat, Jun 13, 2015 at 12:16 AM, Rex X <dnsr...@gmail.com> wrote:

> To be concrete, say we have a folder with thousands of tab-delimited csv
> files with following attributes format (each csv file is about 10GB):
>
>     id    name    address    city...
>     1    Matt    add1    LA...
>     2    Will    add2    LA...
>     3    Lucy    add3    SF...
>     ...
>
> And we have a lookup table based on "name" above
>
>     name    gender
>     Matt    M
>     Lucy    F
>     ...
>
> Now we are interested to output from top 1000 rows of each csv file into
> following format:
>
>     id    name    gender
>     1    Matt    M
>     ...
>
> Can we use pyspark to efficiently handle this?
>
>
>

Reply via email to