Re: How to create RDDs from another RDD?

2014-06-03 Thread Andrew Ash
Hmm that sounds like it could be done in a custom OutputFormat, but I'm not familiar enough with custom OutputFormats to say that's the right thing to do. On Tue, Jun 3, 2014 at 10:23 AM, Gerard Maas wrote: > Hi Andrew, > > Thanks for your answer. > > The reason of the question: I've been tryin

Re: How to create RDDs from another RDD?

2014-06-03 Thread Gerard Maas
Hi Andrew, Thanks for your answer. The reason of the question: I've been trying to contribute to the community by helping answering Spark-related questions on Stack Overflow. (note on that: Given the growing volume on the user list lately, I think it will need to scale out to other venues, so he

Re: How to create RDDs from another RDD?

2014-06-02 Thread Andrew Ash
Hi Gerard, Usually when I want to split one RDD into several, I'm better off re-thinking the algorithm to do all the computation at once. Example: Suppose you had a dataset that was the tuple (URL, webserver, pageSizeBytes), and you wanted to find out the average page size that each webserver (e

How to create RDDs from another RDD?

2014-06-02 Thread Gerard Maas
The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join(secondRDD) I'm looking for ways to do the opposite: split an existing RDD. What is the right way to create derivate RDDs from an existing RDD? e.g. imagine