Thanks, Don! Does SQL implementation of spark do parallel processing on records by default?
-Rex On Sat, Jun 13, 2015 at 10:13 AM, Don Drake <dondr...@gmail.com> wrote: > Take a look at https://github.com/databricks/spark-csv to read in the > tab-delimited file (change the default delimiter) > > and once you have that as a DataFrame, SQL can do the rest. > > https://spark.apache.org/docs/latest/sql-programming-guide.html > > -Don > > > On Fri, Jun 12, 2015 at 8:46 PM, Rex X <dnsr...@gmail.com> wrote: > >> Hi, >> >> I want to use spark to select N columns, top M rows of all csv files >> under a folder. >> >> To be concrete, say we have a folder with thousands of tab-delimited csv >> files with following attributes format (each csv file is about 10GB): >> >> id name address city... >> 1 Matt add1 LA... >> 2 Will add2 LA... >> 3 Lucy add3 SF... >> ... >> >> And we have a lookup table based on "name" above >> >> name gender >> Matt M >> Lucy F >> ... >> >> Now we are interested to output from top 100K rows of each csv file into >> following format: >> >> id name gender >> 1 Matt M >> ... >> >> Can we use pyspark to efficiently handle this? >> >> >> > > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > http://www.MailLaunder.com/ > 800-733-2143 >