RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Well, the dataframes make it easier to work on some columns of the data only 
and to store results in new columns, removing the need to zip it all back 
together and thus to preserve order.


On 2017-09-05 14:04 CEST, mehmet.su...@gmail.com wrote:

Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering issues are 
different there. Maybe you could create minimally large enough simulated data 
and example series of transformations as an example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD




On 15 September 2017 at 09:44,   wrote:
> Thanks all for your answers. After reading the provided links I am still 
> uncertain of the details of what I'd need to do to get my calculations right 
> with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
> the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



Re: RDD order preservation through transformations

2017-09-15 Thread Suzen, Mehmet
Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
and any documents, files or previous e-mail messages attached to it,
may contain confidential information that is legally privileged. If
you are not the intended recipient or a person responsible for
delivering it to the intended recipient, you are hereby notified that
any disclosure, copying, distribution or use of any of the information
contained in or attached to this transmission is STRICTLY PROHIBITED
within the applicable law. If you have received this transmission in
error, please: (1) immediately notify me by reply e-mail to
su...@acm.org,  and (2) destroy the original transmission and its
attachments without reading or saving in any manner. |


On 15 September 2017 at 09:44,   wrote:
> Thanks all for your answers. After reading the provided links I am still 
> uncertain of the details of what I'd need to do to get my calculations right 
> with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
> the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Thanks all for your answers. After reading the provided links I am still 
uncertain of the details of what I'd need to do to get my calculations right 
with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
the libs and I think they'll be better suited to my needs.

Best,
Johan Grande


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-14 Thread Suzen, Mehmet
On 14 September 2017 at 10:42,   wrote:
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you think there’s a chance that the 4 transformations above would
> preserve order so the zip at the end would be correct?

AFAIK, No. The sequence of transformation you have will not guarantee
to preserve order.
First, apply zip, then you need to keep track of indices in the
subsequent transformations,
with `_2`, as zip returns tuples.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-14 Thread Georg Heiler
Usually spark ml Models specify the columns they use for training. i.e. you
would only select your columns (X) for model training but metadata i.e.
target labels or your date column  (y) would still be present for each row.

 schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:

> In several situations I would like to zip RDDs knowing that their order
> matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors
> so I would like to do:
>
>
>
> myData.zip(myModel.predict(myData))
>
>
>
> Also the first column in my RDD is a timestamp which I don’t want to be a
> part of the model, so in fact I would like to split the first column out of
> my RDD, then do:
>
>
>
> myData.zip(myModel.predict(myData.map(dropTimestamp)))
>
>
>
> Moreover I’d like my data to be scaled and go through a principal
> component analysis first, so the main steps would be like:
>
>
>
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you think there’s a chance that the 4 transformations above would
> preserve order so the zip at the end would be correct?
>
>
>
>
>
> On 2017-09-13 19:51 CEST, lucas.g...@gmail.com wrote :
>
>
>
> I'm wondering why you need order preserved, we've had situations where
> keeping the source as an artificial field in the dataset was important and
> I had to run contortions to inject that (In this case the datasource had no
> unique key).
>
>
>
> Is this similar?
>
>
>
> On 13 September 2017 at 10:46, Suzen, Mehmet  wrote:
>
> But what happens if one of the partitions fail, how fault tolarence
> recover elements in other partitions.
>
>
>
> On 13 Sep 2017 18:39, "Ankit Maloo"  wrote:
>
> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation  can change sequence across a
> partition as partition is local and computation happens one record at a
> time.
>
>
>
> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet"  wrote:
>
> I think the order has no meaning in RDDs see this post, specially zip
> methods:
> https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>


RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
In several situations I would like to zip RDDs knowing that their order 
matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I 
would like to do:

myData.zip(myModel.predict(myData))

Also the first column in my RDD is a timestamp which I don’t want to be a part 
of the model, so in fact I would like to split the first column out of my RDD, 
then do:

myData.zip(myModel.predict(myData.map(dropTimestamp)))

Moreover I’d like my data to be scaled and go through a principal component 
analysis first, so the main steps would be like:

val noTs = myData.map(dropTimestamp)
val scaled = scaler.transform(noTs)
val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
val clusters = myModel.predict(projected)
val result = myData.zip(clusters)

Do you think there’s a chance that the 4 transformations above would preserve 
order so the zip at the end would be correct?


On 2017-09-13 19:51 CEST, lucas.g...@gmail.com wrote :

I'm wondering why you need order preserved, we've had situations where keeping 
the source as an artificial field in the dataset was important and I had to run 
contortions to inject that (In this case the datasource had no unique key).

Is this similar?

On 13 September 2017 at 10:46, Suzen, Mehmet 
> wrote:
But what happens if one of the partitions fail, how fault tolarence recover 
elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo" 
> wrote:
AFAIK, the order of a rdd is maintained across a partition for Map operations. 
There is no way a map operation  can change sequence across a partition as 
partition is local and computation happens one record at a time.

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" > 
wrote:
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
(Sorry Mehmet, I'm seeing just now your first reply with the link to SO; it had 
first gone to my spam folder :-/ )


On 2017-09-14 10:02 CEST, GRANDE Johan Ext DTSI/DSI wrote:

Well if the order cannot be guaranteed in case of a failure (or at all since 
failure can happen transparently), what does it mean to sort an RDD (method 
sortBy)?


On 2017-09-14 03:36 CEST mehmet.su...@gmail.com wrote:

I think it is one of the conceptual difference in Spark compare to other 
languages, there is no indexing in plain RDDs, This was the thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of failure. Also 
not sure if partitions are ordered. Can you get the same sequence of partitions 
in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo"  wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the 
> intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet"  wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover 
>> elements in other partitions.

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.



RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
Well if the order cannot be guaranteed in case of a failure (or at all since 
failure can happen transparently), what does it mean to sort an RDD (method 
sortBy)?


On 2017-09-14 03:36 CEST mehmet.su...@gmail.com wrote:

I think it is one of the conceptual difference in Spark compare to other 
languages, there is no indexing in plain RDDs, This was the thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of failure. Also 
not sure if partitions are ordered. Can you get the same sequence of partitions 
in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo"  wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the 
> intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet"  wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover 
>> elements in other partitions.

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to
other languages, there is no indexing in plain RDDs, This was the
thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of
failure. Also not sure if partitions are ordered. Can you get the same
sequence of partitions in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo"  wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the 
> intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet"  wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover 
>> elements in other partitions.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-13 Thread lucas.g...@gmail.com
I'm wondering why you need order preserved, we've had situations where
keeping the source as an artificial field in the dataset was important and
I had to run contortions to inject that (In this case the datasource had no
unique key).

Is this similar?

On 13 September 2017 at 10:46, Suzen, Mehmet  wrote:

> But what happens if one of the partitions fail, how fault tolarence
> recover elements in other partitions.
>
> On 13 Sep 2017 18:39, "Ankit Maloo"  wrote:
>
>> AFAIK, the order of a rdd is maintained across a partition for Map
>> operations. There is no way a map operation  can change sequence across a
>> partition as partition is local and computation happens one record at a
>> time.
>>
>> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet"  wrote:
>>
>> I think the order has no meaning in RDDs see this post, specially zip
>> methods:
>> https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>


Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover
elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo"  wrote:

> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation  can change sequence across a
> partition as partition is local and computation happens one record at a
> time.
>
> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet"  wrote:
>
> I think the order has no meaning in RDDs see this post, specially zip
> methods:
> https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: RDD order preservation through transformations

2017-09-13 Thread Ankit Maloo
AFAIK, the order of a rdd is maintained across a partition for Map
operations. There is no way a map operation  can change sequence across a
partition as partition is local and computation happens one record at a
time.

On 13-Sep-2017 9:54 PM, "Suzen, Mehmet"  wrote:

I think the order has no meaning in RDDs see this post, specially zip
methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org