In several situations I would like to zip RDDs knowing that their order 
matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I 
would like to do:

myData.zip(myModel.predict(myData))

Also the first column in my RDD is a timestamp which I don’t want to be a part 
of the model, so in fact I would like to split the first column out of my RDD, 
then do:

myData.zip(myModel.predict(myData.map(dropTimestamp)))

Moreover I’d like my data to be scaled and go through a principal component 
analysis first, so the main steps would be like:

val noTs = myData.map(dropTimestamp)
val scaled = scaler.transform(noTs)
val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
val clusters = myModel.predict(projected)
val result = myData.zip(clusters)

Do you think there’s a chance that the 4 transformations above would preserve 
order so the zip at the end would be correct?


On 2017-09-13 19:51 CEST, lucas.g...@gmail.com wrote :

I'm wondering why you need order preserved, we've had situations where keeping 
the source as an artificial field in the dataset was important and I had to run 
contortions to inject that (In this case the datasource had no unique key).

Is this similar?

On 13 September 2017 at 10:46, Suzen, Mehmet 
<su...@acm.org<mailto:su...@acm.org>> wrote:
But what happens if one of the partitions fail, how fault tolarence recover 
elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo" 
<ankitmaloo1...@gmail.com<mailto:ankitmaloo1...@gmail.com>> wrote:
AFAIK, the order of a rdd is maintained across a partition for Map operations. 
There is no way a map operation  can change sequence across a partition as 
partition is local and computation happens one record at a time.

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <su...@acm.org<mailto:su...@acm.org>> 
wrote:
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

Reply via email to