RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Well, the dataframes make it easier to work on some columns of the data only and to store results in new columns, removing the need to zip it all back together and thus to preserve order. On 2017-09-05 14:04 CEST, mehmet.su...@gmail.com wrote: Hi Johan, DataFrames are building on top of

Re: RDD order preservation through transformations

2017-09-15 Thread Suzen, Mehmet
Hi Johan, DataFrames are building on top of RDDs, not sure if the ordering issues are different there. Maybe you could create minimally large enough simulated data and example series of transformations as an example to experiment on. Best, -m Mehmet Süzen, MSc, PhD | PRIVILEGED

RE: RDD order preservation through transformations

2017-09-15 Thread johan.grande.ext
Thanks all for your answers. After reading the provided links I am still uncertain of the details of what I'd need to do to get my calculations right with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of the libs and I think they'll be better suited to my needs. Best,

Re: RDD order preservation through transformations

2017-09-14 Thread Suzen, Mehmet
On 14 September 2017 at 10:42, wrote: > val noTs = myData.map(dropTimestamp) > > val scaled = scaler.transform(noTs) > > val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows > > val clusters = myModel.predict(projected) > > val result =

Re: RDD order preservation through transformations

2017-09-14 Thread Georg Heiler
Usually spark ml Models specify the columns they use for training. i.e. you would only select your columns (X) for model training but metadata i.e. target labels or your date column (y) would still be present for each row. schrieb am Do., 14. Sep. 2017 um 10:42 Uhr:

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
In several situations I would like to zip RDDs knowing that their order matches. In particular I’m using an MLLib KMeansModel on an RDD of Vectors so I would like to do: myData.zip(myModel.predict(myData)) Also the first column in my RDD is a timestamp which I don’t want to be a part of the

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
(Sorry Mehmet, I'm seeing just now your first reply with the link to SO; it had first gone to my spam folder :-/ ) On 2017-09-14 10:02 CEST, GRANDE Johan Ext DTSI/DSI wrote: Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what

RE: RDD order preservation through transformations

2017-09-14 Thread johan.grande.ext
Well if the order cannot be guaranteed in case of a failure (or at all since failure can happen transparently), what does it mean to sort an RDD (method sortBy)? On 2017-09-14 03:36 CEST mehmet.su...@gmail.com wrote: I think it is one of the conceptual difference in Spark compare to other

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to other languages, there is no indexing in plain RDDs, This was the thread with Ankit: Yes. So order preservation can not be guaranteed in the case of failure. Also not sure if partitions are ordered. Can you get the same sequence

Re: RDD order preservation through transformations

2017-09-13 Thread lucas.g...@gmail.com
I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key). Is this similar? On 13 September 2017 at 10:46, Suzen, Mehmet

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions. On 13 Sep 2017 18:39, "Ankit Maloo" wrote: > AFAIK, the order of a rdd is maintained across a partition for Map > operations. There is no way a map operation can

Re: RDD order preservation through transformations

2017-09-13 Thread Ankit Maloo
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation can change sequence across a partition as partition is local and computation happens one record at a time. On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote: I think the

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods: https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RDD order preservation through transformations

2017-09-13 Thread johan.grande.ext
Hi, I'm a beginner using Spark with Scala and I'm having trouble understanding ordering in RDDs. I understand that RDDs are ordered (as they can be sorted) but that some transformations don't preserve order. How can I know which transformations preserve order and which don't? Regarding map,