Re: Alternative for numpy in Spark Mlib

2018-05-23 Thread Suzen, Mehmet
You can use Breeze, which is part of spark distribution: https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra Check out the modules under import breeze._ On 23 May 2018 at 07:04, umargeek wrote: > Hi Folks, > > I am planning to rewrite one of my python

Re: RDD order preservation through transformations

2017-09-15 Thread Suzen, Mehmet
Hi Johan, DataFrames are building on top of RDDs, not sure if the ordering issues are different there. Maybe you could create minimally large enough simulated data and example series of transformations as an example to experiment on. Best, -m Mehmet Süzen, MSc, PhD | PRIVILEGED

Re: RDD order preservation through transformations

2017-09-14 Thread Suzen, Mehmet
On 14 September 2017 at 10:42, wrote: > val noTs = myData.map(dropTimestamp) > > val scaled = scaler.transform(noTs) > > val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows > > val clusters = myModel.predict(projected) > > val result =

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
of partitions in mapPartition? On 13 Sep 2017 19:54, "Ankit Maloo" <ankitmaloo1...@gmail.com> wrote: > > Rdd are fault tolerant as it can be recomputed using DAG without storing the > intermediate RDDs. > > On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" <

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
y a map operation can change sequence across a > partition as partition is local and computation happens one record at a > time. > > On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <su...@acm.org> wrote: > > I think the order has no meaning in RDDs see this post, specia

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods: https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
...@gmail.com=22> > > On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet <su...@acm.org> wrote: > >> It depends on what model you would like to train but models requiring >> optimisation could use SGD with mini batches. See: >> https://spark.apache.org/docs/latest/

Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
It depends on what model you would like to train but models requiring optimisation could use SGD with mini batches. See: https://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd On 23 August 2017 at 14:27, Sea aj wrote: > Hi, > > I am

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 03:00, Vadim Semenov wrote: > `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it > just saves data to some destination. Yes, that's what I thought, so the statement "..otherwise saving it on a file will require

Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 01:05, jeff saremi wrote: > Vadim: > > This is from the Mastering Spark book: > > "It is strongly recommended that a checkpointed RDD is persisted in memory, > otherwise saving it on a file will require recomputation." Is this really true? I had the

Re: A tool to generate simulation data

2017-07-27 Thread Suzen, Mehmet
I suggest RandomRDDs API. It provides nice tools. If you write wrappers around that might be good. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$ - To unsubscribe e-mail:

Re: Do we anything for Deep Learning in Spark?

2017-06-21 Thread Suzen, Mehmet
There is a BigDL project: https://github.com/intel-analytics/BigDL On 20 June 2017 at 16:17, Jules Damji wrote: > And we will having a webinar on July 27 going into some more details. Stay > tuned. > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :)

partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would

partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would