Hmm that sounds like it could be done in a custom OutputFormat, but I'm not familiar enough with custom OutputFormats to say that's the right thing to do.
On Tue, Jun 3, 2014 at 10:23 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Andrew, > > Thanks for your answer. > > The reason of the question: I've been trying to contribute to the > community by helping answering Spark-related questions on Stack Overflow. > > (note on that: Given the growing volume on the user list lately, I think > it will need to scale out to other venues, so helping at SO will further > contribute to the mainstream road of Spark) > > I came across this question [1] on how to save parts of an RDD to > different HDFS files. I looked into the impl of saveAsText. The delegation > path terminates on PairRDD.saveAsHadoopDataset and looks like the impl > is quite tight to the RDD data, so the potential easiest way is solve the > problem at hand is to create several RDDs from the original RDD. > > The issue I see is that the 'sc.makeRDD(v.toSeq)' will potentially blow > when trying to materialize the iterator into a seq. I also don't know what > the behaviour of that call to SparkContext will be on a remote worker. > > My current conclusion is that the best option would be to roll an own > saveHdfsFile(...) > > Would you agree? > > -greetz, Gerard. > > > [1] > http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job > > > > > On Mon, Jun 2, 2014 at 11:44 PM, Andrew Ash <and...@andrewash.com> wrote: > >> Hi Gerard, >> >> Usually when I want to split one RDD into several, I'm better off >> re-thinking the algorithm to do all the computation at once. Example: >> >> Suppose you had a dataset that was the tuple (URL, webserver, >> pageSizeBytes), and you wanted to find out the average page size that each >> webserver (e.g. Apache, nginx, IIS, etc) served. Rather than splitting >> your allPagesRDD into an RDD for each webserver, like nginxRDD, apacheRDD, >> IISRDD, it's probably better to do the average computation over all at >> once, like this: >> >> // allPagesRDD is (URL, webserver, pageSizeBytes) >> allPagesRDD.keyBy(getWebserver) >> .map(k => (k.pageSizeBytes, 1)) >> .reduceByKey( (a,b) => (a._1 + b._1, a._2 + b._2) >> .mapValues( v => (v._1 / v._2) ) >> >> For this example you could use something like Summingbird to keep from >> doing the average tracking yourself. >> >> Can you go into more detail about why you want to split one RDD into >> several? >> >> >> On Mon, Jun 2, 2014 at 1:13 PM, Gerard Maas <gerard.m...@gmail.com> >> wrote: >> >>> The RDD API has functions to join multiple RDDs, such as PariRDD.join >>> or PariRDD.cogroup that take another RDD as input. e.g. >>> firstRDD.join(secondRDD) >>> >>> I'm looking for ways to do the opposite: split an existing RDD. What is >>> the right way to create derivate RDDs from an existing RDD? >>> >>> e.g. imagine I've an collection or pairs as input: colRDD = >>> (k1->v1)...(kx->vy)... >>> I could do: >>> val byKey = colRDD.groupByKey() = (k1->(k1->v1... >>> k1->vn)),...(kn->(kn->vy, ...)) >>> >>> Now, I'd like to create an RDD from the values to have something like: >>> >>> val groupedRDDs = (k1->RDD(k1->v1,...k1->vn), kn -> RDD(kn->vy, ...)) >>> >>> in this example, there's an f(byKey) = groupedRDDs. What's that f(x) ? >>> >>> Would: byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)} the >>> right/recommended way to do this? Any other options? >>> >>> Thanks, >>> >>> Gerard. >>> >> >> >