OK this is how I solved it. Not elegant at all but works and I need to move 
ahead at this time.Converting to pair RDD is now not required.                
reacRdd.map(line => line.split(',')).map(fields => {
                  if (fields.length >= 10 && !fields(0).contains("VAERS_ID")) {
((fields(0)+","+fields(1)+"\t"+fields(0)+","+fields(3)+"\t"+fields(0)+","+fields(5)+"\t"+fields(0)+","+fields(7)+"\t"+fields(0)+","+fields(9)))
                  }
                  else {
                    ("")
                  }
                }).flatMap(str => str.split('\t')).filter(line => 
line.toString.length() > 0).saveAsTextFile("/data/vaers/msfx/reac/" + outFile)

      From: Sanjay Subramanian <sanjaysubraman...@yahoo.com.INVALID>
 To: Hitesh Khamesra <hiteshk...@gmail.com> 
Cc: Kapil Malik <kma...@adobe.com>; Sean Owen <so...@cloudera.com>; 
"user@spark.apache.org" <user@spark.apache.org> 
 Sent: Thursday, January 1, 2015 12:39 PM
 Subject: Re: FlatMapValues
   
thanks let me try that out
 

     From: Hitesh Khamesra <hiteshk...@gmail.com>
 To: Sanjay Subramanian <sanjaysubraman...@yahoo.com> 
Cc: Kapil Malik <kma...@adobe.com>; Sean Owen <so...@cloudera.com>; 
"user@spark.apache.org" <user@spark.apache.org> 
 Sent: Thursday, January 1, 2015 9:46 AM
 Subject: Re: FlatMapValues
   
How about this..apply flatmap on per line. And in that function, parse each 
line and return all the colums as per your need.


On Wed, Dec 31, 2014 at 10:16 AM, Sanjay Subramanian 
<sanjaysubraman...@yahoo.com.invalid> wrote:

hey guys
Some of u may care :-) but this is just give u a background with where I am 
going with this. I have an IOS medical side effects app called MedicalSideFx. I 
built the entire underlying data layer aggregation using hadoop and currently 
the search is based on lucene. I am re-architecting the data layer by replacing 
hadoop with Spark and integrating FDA data, Canadian sidefx data and vaccines 
sidefx data.     
  @Kapil , sorry but flatMapValues is being reported as undefined
To give u a complete picture of the code (its inside IntelliJ but thats only 
for testing....the real code runs on sparkshell on my cluster)
https://github.com/sanjaysubramanian/msfx_scala/blob/master/src/main/scala/org/medicalsidefx/common/utils/AersReacColumnExtractor.scala

If u were to assume dataset as 
025003,Delirium,8.10,Hypokinesia,8.10,Hypotonia,8.10,,,,
025005,Arthritis,8.10,Injection site oedema,8.10,Injection site 
reaction,8.10,,,,

This present version of the code, the flatMap works but only gives me values 
DeliriumHypokinesiaHypotonia
ArthritisInjection site oedemaInjection site reaction


What I need is
025003,Delirium
025003,Hypokinesia025003,Hypotonia025005,Arthritis
025005,Injection site oedema025005,Injection site reaction

thanks
sanjay
      From: Kapil Malik <kma...@adobe.com>
 To: Sean Owen <so...@cloudera.com>; Sanjay Subramanian 
<sanjaysubraman...@yahoo.com> 
Cc: "user@spark.apache.org" <user@spark.apache.org> 
 Sent: Wednesday, December 31, 2014 9:35 AM
 Subject: RE: FlatMapValues
   
Hi Sanjay,

Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need to 
import org.apache.spark.rdd.SparkContext._ to use them 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
 )

@Sean, yes indeed flatMap / flatMapValues both can be used.

Regards,

Kapil 



-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: 31 December 2014 21:16
To: Sanjay Subramanian
Cc: user@spark.apache.org
Subject: Re: FlatMapValues

>From the clarification below, the problem is that you are calling 
>flatMapValues, which is only available on an RDD of key-value tuples.
Your map function returns a tuple in one case but a String in the other, so 
your RDD is a bunch of Any, which is not at all what you want. You need to 
return a tuple in both cases, which is what Kapil pointed out.

However it's still not quite what you want. Your input is basically [key value1 
value2 value3] so you want to flatMap that to (key,value1)
(key,value2) (key,value3). flatMapValues does not come into play.

On Wed, Dec 31, 2014 at 3:25 PM, Sanjay Subramanian 
<sanjaysubraman...@yahoo.com> wrote:
> My understanding is as follows
>
> STEP 1 (This would create a pair RDD)
> =======
>
> reacRdd.map(line => line.split(',')).map(fields => {
>  if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) {
>
> (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9)))
>  }
>  else {
>    ""
>  }
>  })
>
> STEP 2
> =======
> Since previous step created a pair RDD, I thought flatMapValues method 
> will be applicable.
> But the code does not even compile saying that flatMapValues is not 
> applicable to RDD :-(
>
>
> reacRdd.map(line => line.split(',')).map(fields => {
>  if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) {
>
> (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9)))
>  }
>  else {
>    ""
>  }
>  }).flatMapValues(skus =>
> skus.split('\t')).saveAsTextFile("/data/vaers/msfx/reac/" + outFile)
>
>
> SUMMARY
> =======
> when a dataset looks like the following
>
> 1,red,blue,green
> 2,yellow,violet,pink
>
> I want to output the following and I am asking how do I do that ? 
> Perhaps my code is 100% wrong. Please correct me and educate me :-)
>
> 1,red
> 1,blue
> 1,green
> 2,yellow
> 2,violet
> 2,pink

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   



   

  

Reply via email to