RE: Processing multiple columns in parallel

2015-05-18 Thread Needham, Guy
emails. Learn more at http://vsre.info/ From: ayan guha [mailto:guha.a...@gmail.com] Sent: 18 May 2015 15:46 To: Laeeq Ahmed Cc: user@spark.apache.org Subject: Re: Processing multiple columns in parallel My first thought would be creating 10 rdds and run your word count on each of them..I think

Re: Processing multiple columns in parallel

2015-05-18 Thread ayan guha
My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs. Best Ayan On 18 May 2015 23:41, "Laeeq Ahmed" wrote: > Hi, > > Consider I have a tab delimited text file with 10 columns. Eac

Processing multiple columns in parallel

2015-05-18 Thread Laeeq Ahmed
Hi, Consider I have a tab delimited text file with 10 columns. Each column is a a set of text. I would like to do a word count for each column. In scala, I would do the following RDD transformation and action:  val data = sc.textFile("hdfs://namenode/data.txt")  for(i <- 0 until 9){     data.map