Hi Matei,
I got this behaviour in  a test case. I was testing a moving average operation 
on a DStream.
I create RDDs and add them in a queue from which a DStream is created using a 
time window.

I changed the collection to array now so no conversion is being done on it.

My input collection:
Val rdd1Data=Array("purchaseHonda",   dateToLong(0, 1, 1, 1, 1, 2010), 1)
      , ("purchaseHonda", dateToLong(0, 1, 1, 1, 1, 2010), 1)
      , ("purchaseHonda", dateToLong(0, 1, 1, 1, 1, 2010), 1)
      , ("purchaseFord", dateToLong(0, 1, 1, 1, 1, 2010), 1) )
Val rdd2Data = ...

Then later on I place them into the queue like this:
    RddQueue += ssc.sparkContext.makeRDD(rdd1Data)

I traverse the reducedByKey (key is the formed from the first 2 fields and the 
3rd field is a count) dstream and on each RDD I do collect and traverse that 
resulting collection as well and add the numbers to a moving average function. 
(I could just print them instead) The numbers come in the wrong order for all 
RDDs.

-A

From: Matei Zaharia [mailto:[email protected]]
Sent: April-23-14 2:58 PM
To: [email protected]
Subject: Re: default spark partitioner

It should keep them in order, but what kind of collection do you have? Maybe 
toArray changes the order.

Matei

On Apr 23, 2014, at 8:21 AM, Adrian Mocanu 
<[email protected]<mailto:[email protected]>> wrote:


How does the default spark partitioner partition RDD data? Does it keep the 
data in order?

I'm asking because I'm generating an RDD by hand via 
`ssc.sparkContext.makeRDD(collection.toArray)` and I collect and iterate over 
what I collect, but the data is in a different order than in the initial 
collection from which the RDD comes from.

-Adrian

Reply via email to