Hi Matei,
I got this behaviour in a test case. I was testing a moving average operation
on a DStream.
I create RDDs and add them in a queue from which a DStream is created using a
time window.
I changed the collection to array now so no conversion is being done on it.
My input collection:
Val rdd1Data=Array("purchaseHonda", dateToLong(0, 1, 1, 1, 1, 2010), 1)
, ("purchaseHonda", dateToLong(0, 1, 1, 1, 1, 2010), 1)
, ("purchaseHonda", dateToLong(0, 1, 1, 1, 1, 2010), 1)
, ("purchaseFord", dateToLong(0, 1, 1, 1, 1, 2010), 1) )
Val rdd2Data = ...
Then later on I place them into the queue like this:
RddQueue += ssc.sparkContext.makeRDD(rdd1Data)
I traverse the reducedByKey (key is the formed from the first 2 fields and the
3rd field is a count) dstream and on each RDD I do collect and traverse that
resulting collection as well and add the numbers to a moving average function.
(I could just print them instead) The numbers come in the wrong order for all
RDDs.
-A
From: Matei Zaharia [mailto:[email protected]]
Sent: April-23-14 2:58 PM
To: [email protected]
Subject: Re: default spark partitioner
It should keep them in order, but what kind of collection do you have? Maybe
toArray changes the order.
Matei
On Apr 23, 2014, at 8:21 AM, Adrian Mocanu
<[email protected]<mailto:[email protected]>> wrote:
How does the default spark partitioner partition RDD data? Does it keep the
data in order?
I'm asking because I'm generating an RDD by hand via
`ssc.sparkContext.makeRDD(collection.toArray)` and I collect and iterate over
what I collect, but the data is in a different order than in the initial
collection from which the RDD comes from.
-Adrian