You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.

Thanks
Best Regards

On Mon, Jun 15, 2015 at 10:28 PM, Rex X <dnsr...@gmail.com> wrote:

> Thanks very much, Akhil.
>
> That solved my problem.
>
> Best,
> Rex
>
>
>
> On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Something like this?
>>
>> val huge_data = sc.textFile("/path/to/first.csv").map(x =>
>> (x.split("\t")(1), x.split("\t")(0))
>> val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
>> (x.split("\t")(0), x))
>>
>> val joined_data = huge_data.join(gender_data)
>>
>> joined_data.take(1000)
>>
>>
>> Its scala btw, python api should also be similar.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Jun 13, 2015 at 12:16 AM, Rex X <dnsr...@gmail.com> wrote:
>>
>>> To be concrete, say we have a folder with thousands of tab-delimited csv
>>> files with following attributes format (each csv file is about 10GB):
>>>
>>>     id    name    address    city...
>>>     1    Matt    add1    LA...
>>>     2    Will    add2    LA...
>>>     3    Lucy    add3    SF...
>>>     ...
>>>
>>> And we have a lookup table based on "name" above
>>>
>>>     name    gender
>>>     Matt    M
>>>     Lucy    F
>>>     ...
>>>
>>> Now we are interested to output from top 1000 rows of each csv file into
>>> following format:
>>>
>>>     id    name    gender
>>>     1    Matt    M
>>>     ...
>>>
>>> Can we use pyspark to efficiently handle this?
>>>
>>>
>>>
>>
>

Reply via email to