Re: [Spark] What is the most efficient way to do such a join and column manipulation?

Rex X Sat, 13 Jun 2015 17:27:01 -0700

Thanks, Don! Does SQL implementation of spark do parallel processing on
records by default?


-Rex


On Sat, Jun 13, 2015 at 10:13 AM, Don Drake <dondr...@gmail.com> wrote:

> Take a look at https://github.com/databricks/spark-csv to read in the
> tab-delimited file (change the default delimiter)
>
> and once you have that as a DataFrame, SQL can do the rest.
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> -Don
>
>
> On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote:
>
>> Hi,
>>
>> I want to use spark to select N columns, top M rows of all csv files
>> under a folder.
>>
>> To be concrete, say we have a folder with thousands of tab-delimited csv
>> files with following attributes format (each csv file is about 10GB):
>>
>>     id    name    address    city...
>>     1    Matt    add1    LA...
>>     2    Will    add2    LA...
>>     3    Lucy    add3    SF...
>>     ...
>>
>> And we have a lookup table based on "name" above
>>
>>     name    gender
>>     Matt    M
>>     Lucy    F
>>     ...
>>
>> Now we are interested to output from top 100K rows of each csv file into
>> following format:
>>
>>     id    name    gender
>>     1    Matt    M
>>     ...
>>
>> Can we use pyspark to efficiently handle this?
>>
>>
>>
>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> http://www.MailLaunder.com/
> 800-733-2143
>

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

Reply via email to