How about this..apply flatmap on per line. And in that function, parse each line and return all the colums as per your need.
On Wed, Dec 31, 2014 at 10:16 AM, Sanjay Subramanian < sanjaysubraman...@yahoo.com.invalid> wrote: > hey guys > > Some of u may care :-) but this is just give u a background with where I > am going with this. I have an IOS medical side effects app called > MedicalSideFx. I built the entire underlying data layer aggregation using > hadoop and currently the search is based on lucene. I am re-architecting > the data layer by replacing hadoop with Spark and integrating FDA data, > Canadian sidefx data and vaccines sidefx data. > > > @Kapil , sorry but flatMapValues is being reported as undefined > > To give u a complete picture of the code (its inside IntelliJ but thats > only for testing....the real code runs on sparkshell on my cluster) > > > https://github.com/sanjaysubramanian/msfx_scala/blob/master/src/main/scala/org/medicalsidefx/common/utils/AersReacColumnExtractor.scala > > If u were to assume dataset as > > 025003,Delirium,8.10,Hypokinesia,8.10,Hypotonia,8.10,,,, > 025005,Arthritis,8.10,Injection site oedema,8.10,Injection site > reaction,8.10,,,, > > This present version of the code, the flatMap works but only gives me > values > Delirium > Hypokinesia > Hypotonia > Arthritis > Injection site oedema > Injection site reaction > > > What I need is > > 025003,Delirium > 025003,Hypokinesia > 025003,Hypotonia > 025005,Arthritis > 025005,Injection site oedema > 025005,Injection site reaction > > > thanks > > sanjay > > ------------------------------ > *From:* Kapil Malik <kma...@adobe.com> > *To:* Sean Owen <so...@cloudera.com>; Sanjay Subramanian < > sanjaysubraman...@yahoo.com> > *Cc:* "user@spark.apache.org" <user@spark.apache.org> > *Sent:* Wednesday, December 31, 2014 9:35 AM > *Subject:* RE: FlatMapValues > > Hi Sanjay, > > Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need > to import org.apache.spark.rdd.SparkContext._ to use them > (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions > ) > > @Sean, yes indeed flatMap / flatMapValues both can be used. > > Regards, > > Kapil > > > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: 31 December 2014 21:16 > To: Sanjay Subramanian > Cc: user@spark.apache.org > Subject: Re: FlatMapValues > > From the clarification below, the problem is that you are calling > flatMapValues, which is only available on an RDD of key-value tuples. > Your map function returns a tuple in one case but a String in the other, > so your RDD is a bunch of Any, which is not at all what you want. You need > to return a tuple in both cases, which is what Kapil pointed out. > > However it's still not quite what you want. Your input is basically [key > value1 value2 value3] so you want to flatMap that to (key,value1) > (key,value2) (key,value3). flatMapValues does not come into play. > > On Wed, Dec 31, 2014 at 3:25 PM, Sanjay Subramanian < > sanjaysubraman...@yahoo.com> wrote: > > My understanding is as follows > > > > STEP 1 (This would create a pair RDD) > > ======= > > > > reacRdd.map(line => line.split(',')).map(fields => { > > if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) { > > > > > (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9))) > > } > > else { > > "" > > } > > }) > > > > STEP 2 > > ======= > > Since previous step created a pair RDD, I thought flatMapValues method > > will be applicable. > > But the code does not even compile saying that flatMapValues is not > > applicable to RDD :-( > > > > > > reacRdd.map(line => line.split(',')).map(fields => { > > if (fields.length >= 11 && !fields(0).contains("VAERS_ID")) { > > > > > (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t"+fields(9))) > > } > > else { > > "" > > } > > }).flatMapValues(skus => > > skus.split('\t')).saveAsTextFile("/data/vaers/msfx/reac/" + outFile) > > > > > > SUMMARY > > ======= > > when a dataset looks like the following > > > > 1,red,blue,green > > 2,yellow,violet,pink > > > > I want to output the following and I am asking how do I do that ? > > Perhaps my code is 100% wrong. Please correct me and educate me :-) > > > > 1,red > > 1,blue > > 1,green > > 2,yellow > > 2,violet > > 2,pink > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > >