Re: Alternative for numpy in Spark Mlib

2018-05-23 Thread Suzen, Mehmet
You can  use Breeze, which is part of spark distribution:
https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra

Check out the modules under  import breeze._

On 23 May 2018 at 07:04, umargeek  wrote:
> Hi Folks,
>
> I am planning to rewrite one of my python module written for entropy
> calculation using numpy into Spark Mlib so that it can be processed in
> distributed manner.
>
> Can you please advise on the possibilities of the same approach or any
> alternatives.
>
> Thanks,
> Umar
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-15 Thread Suzen, Mehmet
Hi Johan,
 DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
and any documents, files or previous e-mail messages attached to it,
may contain confidential information that is legally privileged. If
you are not the intended recipient or a person responsible for
delivering it to the intended recipient, you are hereby notified that
any disclosure, copying, distribution or use of any of the information
contained in or attached to this transmission is STRICTLY PROHIBITED
within the applicable law. If you have received this transmission in
error, please: (1) immediately notify me by reply e-mail to
su...@acm.org,  and (2) destroy the original transmission and its
attachments without reading or saving in any manner. |


On 15 September 2017 at 09:44,   wrote:
> Thanks all for your answers. After reading the provided links I am still 
> uncertain of the details of what I'd need to do to get my calculations right 
> with RDDs. However I discovered DataFrames and Pipelines on the "ML" side of 
> the libs and I think they'll be better suited to my needs.
>
> Best,
> Johan Grande
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-14 Thread Suzen, Mehmet
On 14 September 2017 at 10:42,   wrote:
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you think there’s a chance that the 4 transformations above would
> preserve order so the zip at the end would be correct?

AFAIK, No. The sequence of transformation you have will not guarantee
to preserve order.
First, apply zip, then you need to keep track of indices in the
subsequent transformations,
with `_2`, as zip returns tuples.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to
other languages, there is no indexing in plain RDDs, This was the
thread with Ankit:

Yes. So order preservation can not be guaranteed in the case of
failure. Also not sure if partitions are ordered. Can you get the same
sequence of partitions in mapPartition?

On 13 Sep 2017 19:54, "Ankit Maloo" <ankitmaloo1...@gmail.com> wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the 
> intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" <su...@acm.org> wrote:
>>
>> But what happens if one of the partitions fail, how fault tolerance recover 
>> elements in other partitions.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover
elements in other partitions.

On 13 Sep 2017 18:39, "Ankit Maloo" <ankitmaloo1...@gmail.com> wrote:

> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation  can change sequence across a
> partition as partition is local and computation happens one record at a
> time.
>
> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" <su...@acm.org> wrote:
>
> I think the order has no meaning in RDDs see this post, specially zip
> methods:
> https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
SGD is supported. I see I assumed you were using Scala. Looks like you can
do streaming regression, not sure of pyspark API though:

https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression

On 23 August 2017 at 18:22, Sea aj <saj3...@gmail.com> wrote:

> Thanks for the reply.
>
> As far as I understood mini batch is not yet supported in ML libarary. As
> for MLLib minibatch, I could not find any pyspark api.
>
>
>
> <https://mailtrack.io/> Sent with Mailtrack
> <https://mailtrack.io/install?source=signature=en=saj3...@gmail.com=22>
>
> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet <su...@acm.org> wrote:
>
>> It depends on what model you would like to train but models requiring
>> optimisation could use SGD with mini batches. See:
>> https://spark.apache.org/docs/latest/mllib-optimization.html
>> #stochastic-gradient-descent-sgd
>>
>> On 23 August 2017 at 14:27, Sea aj <saj3...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
>>> crashes due to the shortage of memory.
>>>
>>> Is there a way to train the model on a subset of the data in multiple
>>> steps?
>>>
>>> Thanks
>>>
>>>
>>>
>>> <https://mailtrack.io/> Sent with Mailtrack
>>> <https://mailtrack.io/install?source=signature=en=saj3...@gmail.com=22>
>>>
>>
>>
>>
>> --
>>
>> Mehmet Süzen, MSc, PhD
>> <su...@acm.org>
>>
>> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
>> any documents, files or previous e-mail messages attached to it, may
>> contain confidential information that is legally privileged. If you are not
>> the intended recipient or a person responsible for delivering it to the
>> intended recipient, you are hereby notified that any disclosure, copying,
>> distribution or use of any of the information contained in or attached to
>> this transmission is STRICTLY PROHIBITED within the applicable law. If you
>> have received this transmission in error, please: (1) immediately notify me
>> by reply e-mail to su...@acm.org,  and (2) destroy the original
>> transmission and its attachments without reading or saving in any manner. |
>>
>
>


-- 

Mehmet Süzen, MSc, PhD
<su...@acm.org>

| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
any documents, files or previous e-mail messages attached to it, may
contain confidential information that is legally privileged. If you are not
the intended recipient or a person responsible for delivering it to the
intended recipient, you are hereby notified that any disclosure, copying,
distribution or use of any of the information contained in or attached to
this transmission is STRICTLY PROHIBITED within the applicable law. If you
have received this transmission in error, please: (1) immediately notify me
by reply e-mail to su...@acm.org,  and (2) destroy the original
transmission and its attachments without reading or saving in any manner. |


Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Suzen, Mehmet
It depends on what model you would like to train but models requiring
optimisation could use SGD with mini batches. See:
https://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd

On 23 August 2017 at 14:27, Sea aj  wrote:

> Hi,
>
> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
> crashes due to the shortage of memory.
>
> Is there a way to train the model on a subset of the data in multiple
> steps?
>
> Thanks
>
>
>
>  Sent with Mailtrack
> 
>



-- 

Mehmet Süzen, MSc, PhD


| PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
any documents, files or previous e-mail messages attached to it, may
contain confidential information that is legally privileged. If you are not
the intended recipient or a person responsible for delivering it to the
intended recipient, you are hereby notified that any disclosure, copying,
distribution or use of any of the information contained in or attached to
this transmission is STRICTLY PROHIBITED within the applicable law. If you
have received this transmission in error, please: (1) immediately notify me
by reply e-mail to su...@acm.org,  and (2) destroy the original
transmission and its attachments without reading or saving in any manner. |


Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 03:00, Vadim Semenov  wrote:
> `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
> just saves data to some destination.

Yes, that's what I thought, so the statement "..otherwise saving it on
a file will require recomputation."  from the book is not entirely
true.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How can i remove the need for calling cache

2017-08-02 Thread Suzen, Mehmet
On 3 August 2017 at 01:05, jeff saremi  wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."

Is this really true? I had the impression that DAG will not be carried
out once RDD is serialized to an external file, so 'saveAsObjectFile'
saves DAG as well?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: A tool to generate simulation data

2017-07-27 Thread Suzen, Mehmet
I suggest RandomRDDs API. It provides nice tools. If you write
wrappers around that might be good.

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Do we anything for Deep Learning in Spark?

2017-06-21 Thread Suzen, Mehmet
There is a BigDL project:
https://github.com/intel-analytics/BigDL

On 20 June 2017 at 16:17, Jules Damji  wrote:
> And we will having a webinar on July 27 going into some more  details. Stay
> tuned.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 20, 2017, at 7:00 AM, Michael Mior  wrote:
>
> It's still in the early stages, but check out Deep Learning Pipelines from
> Databricks
>
> https://github.com/databricks/spark-deep-learning
>
> --
> Michael Mior
> mm...@apache.org
>
> 2017-06-20 0:36 GMT-04:00 Gaurav1809 :
>>
>> Hi All,
>>
>> Similar to how we have machine learning library called ML, do we have
>> anything for deep learning?
>> If yes, please share the details. If not then what should be the approach?
>>
>> Thanks and regards,
>> Gaurav Pandya
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-anything-for-Deep-Learning-in-Spark-tp28772.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List,

 I was wondering what is the design principle that partition size of
an RDD is inherited from the parent.  See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would be the configuration option for this if there is any?

Best,
-m

[*]
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List,

 I was wondering what is the design principle that partition size of
an RDD is inherited from the parent.  See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would be the configuration option for this if there is any?

Best,
-m

[*]
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org