Re: Artificial Neural Network in Spark?

2014-06-30 Thread Debasish Das
I will let Xiangrui to comment on the PR process to add the code in mllib
but I would love to look into your initial version if you push it to
github...

As far as I remember Quoc got his best ANN results using back-propagation
algorithm and solved using CG...do you have those features or you are using
SGD style update



On Mon, Jun 30, 2014 at 8:13 PM, Bert Greevenbosch <
bert.greevenbo...@huawei.com> wrote:

> Hi Debasish, Alexander, all,
>
> Indeed I found the OpenDL project through the Powered by Spark page. I'll
> need some time to look into the code, but on the first sight it looks quite
> well-developed. I'll contact the author about this too.
>
> My own implementation (in Scala) works for multiple inputs and multiple
> outputs. It implements a single hidden layer, the number of nodes in it can
> be specified.
>
> The implementation is a general ANN implementation. As such, it should be
> useable for an autoencoder too, since that is just an ANN with some special
> input/output constraints.
>
> As said before, the implementation is built upon the linear regression
> model and gradient descent implementation. However it did require some
> tweaks:
>
> - The linear regression model only supports a single output "label" (as
> Double). Since the ANN can have multiple outputs, it ignores the "label"
> attribute, but for training divides the input vector into two parts, the
> first part being the genuine input vector, the second the target output
> vector.
>
> - The concatenation of input and target output vectors is only internally,
> the training function takes as input an RDD with tuples of two Vectors, one
> for each input and output.
>
> - The GradientDescend optimizer is re-used without modification.
>
> - I have made an even simpler updater than the SimpleUpdater, leaving out
> the division by the square root of the number of iterations. The
> SimpleUpdater can also be used, but I created this simpler one because I
> like to plot the result every now and then, and then continue the
> calculations. For this, I also wrote a training function with as input the
> weights from the previous training session.
>
> - I created a ParallelANNModel similar to the LinearRegressionModel.
>
> - I created a new GeneralizedSteepestDescendAlgorithm class similar to the
> GeneralizedLinearAlgorithm class.
>
> - Created some example code to test with 2D (1 input 1 output), 3D (2
> inputs 1 output) and 4D (1 input 3 outputs) functions.
>
> If there is interest, I would be happy to release the code. What would be
> the best way to do this? Is there some kind of review process?
>
> Best regards,
> Bert
>
>
> > -Original Message-
> > From: Debasish Das [mailto:debasish.da...@gmail.com]
> > Sent: 27 June 2014 14:02
> > To: dev@spark.apache.org
> > Subject: Re: Artificial Neural Network in Spark?
> >
> > Look into Powered by Spark page...I found a project there which used
> > autoencoder functions...It's not updated for a long time now !
> >
> > On Thu, Jun 26, 2014 at 10:51 PM, Ulanov, Alexander
> >  > > wrote:
> >
> > > Hi Bert,
> > >
> > > It would be extremely interesting. Do you plan to implement
> > autoencoder as
> > > well? It would be great to have deep learning in Spark.
> > >
> > > Best regards, Alexander
> > >
> > > 27.06.2014, в 4:47, "Bert Greevenbosch" 
> > > написал(а):
> > >
> > > > Hello all,
> > > >
> > > > I was wondering whether Spark/mllib supports Artificial Neural
> > Networks
> > > (ANNs)?
> > > >
> > > > If not, I am currently working on an implementation of it. I re-use
> > the
> > > code for linear regression and gradient descent as much as possible.
> > > >
> > > > Would the community be interested in such implementation? Or maybe
> > > somebody is already working on it?
> > > >
> > > > Best regards,
> > > > Bert
> > >
>


Re: Contributing to MLlib on GLM

2014-06-30 Thread Gang Bai
Thanks Xiaokai,

I’ve created a pull request to merge features in my PR to your repo. Please 
take a review here https://github.com/xwei-datageek/spark/pull/2 .

As for GLMs, here at Sina, we are solving the problem of predicting the num of 
visitors who read a particular news article or watch an online sports live 
stream in a particular period. I’m trying to improve the prediction results by 
tuning features and incorporating new models. So I’ll try Gamma regression 
later. Thanks for the implementation.

Cheers,
-Gang

On Jun 29, 2014, at 8:17 AM, xwei  wrote:

> Hi Gang,
> 
> No worries! 
> 
> I agree LBFGS would converge faster and your test suite is more 
> comprehensive. I'd like to merge my branch with yours.
> 
> I also agree with your viewpoint on the redundancy issue. For different GLMs, 
> usually they only differ in gradient calculation but the regression.scala 
> files are quite similar. For example, linearRegressionSGD, 
> logisticRegressionSGD, RidgeRegressionSGD, poissonRegressionSGD all share 
> quite a bit of common code in their class implementations. Since such 
> redundancy is already there in the legacy code, simply merging Poisson and 
> Gamma does not seem to help much. So I suggest we just leave them as separate 
> classes for the time being. 
> 
> 
> Best regards,
> 
> Xiaokai
> 
> On Jun 27, 2014, at 6:45 PM, Gang Bai [via Apache Spark Developers List] 
> wrote:
> 
>> Hi Xiaokai, 
>> 
>> My bad. I didn't notice this before I created another PR for Poisson 
>> regression. The mails were buried in junk by the corp mail master. Also, 
>> thanks for considering my comments and advice in your PR. 
>> 
>> Adding my two cents here: 
>> 
>> * PoissonRegressionModel and GammaRegressionModel have the same fields and 
>> prediction method. Shall we use one instead of two redundant classes? Say, a 
>> LogLinearModel. 
>> * The LBFGS optimizer takes fewer iterations and results in better 
>> convergence than SGD. I implemented two GeneralizedLinearAlgorithm classes 
>> using LBFGS and SGD respectively. You may take a look into it. If it's OK to 
>> you, I'd be happy to send a PR to your branch. 
>> * In addition to the generated test data, We may use some real-world data 
>> for testing. In my implementation, I added the test data from 
>> https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test 
>> suite. 
>> 
>> -Gang 
>> Sent from my iPad 
>> 
>>> On 2014年6月27日, at 下午6:03, "xwei" <[hidden email]> wrote: 
>>> 
>>> 
>>> Yes, that's what we did: adding two gradient functions to Gradient.scala 
>>> and 
>>> create PoissonRegression and GammaRegression using these gradients. We made 
>>> a PR on this. 
>>> 
>>> 
>>> 
>>> -- 
>>> View this message in context: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
>>> Sent from the Apache Spark Developers List mailing list archive at 
>>> Nabble.com. 
>> 
>> 
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7107.html
>> To unsubscribe from Contributing to MLlib on GLM, click here.
>> NAML
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7117.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



RE: Artificial Neural Network in Spark?

2014-06-30 Thread Bert Greevenbosch
Hi Debasish, Alexander, all,

Indeed I found the OpenDL project through the Powered by Spark page. I'll need 
some time to look into the code, but on the first sight it looks quite 
well-developed. I'll contact the author about this too.

My own implementation (in Scala) works for multiple inputs and multiple 
outputs. It implements a single hidden layer, the number of nodes in it can be 
specified.

The implementation is a general ANN implementation. As such, it should be 
useable for an autoencoder too, since that is just an ANN with some special 
input/output constraints.

As said before, the implementation is built upon the linear regression model 
and gradient descent implementation. However it did require some tweaks:

- The linear regression model only supports a single output "label" (as 
Double). Since the ANN can have multiple outputs, it ignores the "label" 
attribute, but for training divides the input vector into two parts, the first 
part being the genuine input vector, the second the target output vector.

- The concatenation of input and target output vectors is only internally, the 
training function takes as input an RDD with tuples of two Vectors, one for 
each input and output.

- The GradientDescend optimizer is re-used without modification.

- I have made an even simpler updater than the SimpleUpdater, leaving out the 
division by the square root of the number of iterations. The SimpleUpdater can 
also be used, but I created this simpler one because I like to plot the result 
every now and then, and then continue the calculations. For this, I also wrote 
a training function with as input the weights from the previous training 
session.

- I created a ParallelANNModel similar to the LinearRegressionModel.

- I created a new GeneralizedSteepestDescendAlgorithm class similar to the 
GeneralizedLinearAlgorithm class.

- Created some example code to test with 2D (1 input 1 output), 3D (2 inputs 1 
output) and 4D (1 input 3 outputs) functions.

If there is interest, I would be happy to release the code. What would be the 
best way to do this? Is there some kind of review process?

Best regards,
Bert


> -Original Message-
> From: Debasish Das [mailto:debasish.da...@gmail.com]
> Sent: 27 June 2014 14:02
> To: dev@spark.apache.org
> Subject: Re: Artificial Neural Network in Spark?
> 
> Look into Powered by Spark page...I found a project there which used
> autoencoder functions...It's not updated for a long time now !
> 
> On Thu, Jun 26, 2014 at 10:51 PM, Ulanov, Alexander
>  > wrote:
> 
> > Hi Bert,
> >
> > It would be extremely interesting. Do you plan to implement
> autoencoder as
> > well? It would be great to have deep learning in Spark.
> >
> > Best regards, Alexander
> >
> > 27.06.2014, в 4:47, "Bert Greevenbosch" 
> > написал(а):
> >
> > > Hello all,
> > >
> > > I was wondering whether Spark/mllib supports Artificial Neural
> Networks
> > (ANNs)?
> > >
> > > If not, I am currently working on an implementation of it. I re-use
> the
> > code for linear regression and gradient descent as much as possible.
> > >
> > > Would the community be interested in such implementation? Or maybe
> > somebody is already working on it?
> > >
> > > Best regards,
> > > Bert
> >


Re: Eliminate copy while sending data : any Akka experts here ?

2014-06-30 Thread Aaron Davidson
I don't know of any way to avoid Akka doing a copy, but I would like to
mention that it's on the priority list to piggy-back only the map statuses
relevant to a particular map task on the task itself, thus reducing the
total amount of data sent over the wire by a factor of N for N physical
machines in your cluster. Ideally we would also avoid Akka entirely when
sending the tasks, as these can get somewhat large and Akka doesn't work
well with large messages.

Do note that your solution of using broadcast to send the map tasks is very
similar to how the executor returns the result of a task when it's too big
for akka. We were thinking of refactoring this too, as using the block
manager has much higher latency than a direct TCP send.


On Mon, Jun 30, 2014 at 12:13 PM, Mridul Muralidharan 
wrote:

> Our current hack is to use Broadcast variables when serialized
> statuses are above some (configurable) size : and have the workers
> directly pull them from master.
> This is a workaround : so would be great if there was a
> better/principled solution.
>
> Please note that the responses are going to different workers
> requesting for the output statuses for shuffle (after map) - so not
> sure if back pressure buffers, etc would help.
>
>
> Regards,
> Mridul
>
>
> On Mon, Jun 30, 2014 at 11:07 PM, Mridul Muralidharan 
> wrote:
> > Hi,
> >
> >   While sending map output tracker result, the same serialized byte
> > array is sent multiple times - but the akka implementation copies it
> > to a private byte array within ByteString for each send.
> > Caching a ByteString instead of Array[Byte] did not help, since akka
> > does not support special casing ByteString : serializes the
> > ByteString, and copies the result out to an array before creating
> > ByteString out of it (in Array[Byte] serializing is thankfully simply
> > returning same array - so one copy only).
> >
> >
> > Given the need to send immutable data large number of times, is there
> > any way to do it in akka without copying internally in akka ?
> >
> >
> > To see how expensive it is, for 200 nodes withi large number of
> > mappers and reducers, the status becomes something like 30 mb for us -
> > and pulling this about 200 to 300 times results in OOM due to the
> > large number of copies sent out.
> >
> >
> > Thanks,
> > Mridul
>


Re: Eliminate copy while sending data : any Akka experts here ?

2014-06-30 Thread Mridul Muralidharan
Our current hack is to use Broadcast variables when serialized
statuses are above some (configurable) size : and have the workers
directly pull them from master.
This is a workaround : so would be great if there was a
better/principled solution.

Please note that the responses are going to different workers
requesting for the output statuses for shuffle (after map) - so not
sure if back pressure buffers, etc would help.


Regards,
Mridul


On Mon, Jun 30, 2014 at 11:07 PM, Mridul Muralidharan  wrote:
> Hi,
>
>   While sending map output tracker result, the same serialized byte
> array is sent multiple times - but the akka implementation copies it
> to a private byte array within ByteString for each send.
> Caching a ByteString instead of Array[Byte] did not help, since akka
> does not support special casing ByteString : serializes the
> ByteString, and copies the result out to an array before creating
> ByteString out of it (in Array[Byte] serializing is thankfully simply
> returning same array - so one copy only).
>
>
> Given the need to send immutable data large number of times, is there
> any way to do it in akka without copying internally in akka ?
>
>
> To see how expensive it is, for 200 nodes withi large number of
> mappers and reducers, the status becomes something like 30 mb for us -
> and pulling this about 200 to 300 times results in OOM due to the
> large number of copies sent out.
>
>
> Thanks,
> Mridul


Eliminate copy while sending data : any Akka experts here ?

2014-06-30 Thread Mridul Muralidharan
Hi,

  While sending map output tracker result, the same serialized byte
array is sent multiple times - but the akka implementation copies it
to a private byte array within ByteString for each send.
Caching a ByteString instead of Array[Byte] did not help, since akka
does not support special casing ByteString : serializes the
ByteString, and copies the result out to an array before creating
ByteString out of it (in Array[Byte] serializing is thankfully simply
returning same array - so one copy only).


Given the need to send immutable data large number of times, is there
any way to do it in akka without copying internally in akka ?


To see how expensive it is, for 200 nodes withi large number of
mappers and reducers, the status becomes something like 30 mb for us -
and pulling this about 200 to 300 times results in OOM due to the
large number of copies sent out.


Thanks,
Mridul


Re: Application level progress monitoring and communication

2014-06-30 Thread Chester Chen
Reynold
thanks for the reply. It's true, this is more to Yarn communication
than Spark.
But this is a general enough problem for all the YARN_CLUSTER mode
application. I thought
just to reach out to the community.

  If we choose to using Akka solution, then this is related to Spark, as
there is only one Akka actor system per JVM.

  Thanks for the suggestion regarding pass the client IP address. I was
only thinking  how to find out the IP address
of the spark drive node initially.

  Reporting Progress is just one of the use case, stopping spark job, We
are also considering interactive query jobs.

This gives me some thing to start with. I will try to with Akka first. Will
let community know once we got somewhere.

thanks
Chester


On Sun, Jun 29, 2014 at 11:07 PM, Reynold Xin  wrote:

> This isn't exactly about Spark itself, more about how an application on
> YARN/Mesos can communicate with another one.
>
> How about your application launch program just takes in a parameter (or env
> variable or command line argument) for the IP address of your client
> application, and just send updates? You basically just want to send
> messages to report progress. You can do it with a lot of different ways,
> such as Akka, custom REST API, Thrift ... I think any of them will do.
>
>
>
>
> On Sun, Jun 29, 2014 at 7:57 PM, Chester Chen 
> wrote:
>
> > Hi Spark dev community:
> >
> > I have several questions regarding Application and Spark communication
> >
> > 1) Application Level Progress Monitoring
> >
> > Currently, our application using in YARN_CLUSTER model running Spark
> Jobs.
> > This works well so far, but we would like to monitoring the application
> > level progress ( not spark system level progress).
> >
> > For example,
> > If we are doing Machine Learning Training, I would like to send some
> > message back the our application, current status of the training, number
> of
> > iterations etc via API.
> >
> > We can't use YARN_CLIENT mode for this purpose as we are running the
> spark
> > application in servlet container (tomcat/Jetty). If we run the
> yarn_client
> > mode, we will be limited to one SparkContext per JVM.
> >
> > So we are considering to leverage Akka messaging, essentially create
> > another Actor to send message back to the client application.
> > Notice that Spark already has an Akka ActorSystem defined for each
> > Executor. All we need to find Actor address (host, port) for the spark
> > driver executor.
> >
> > The trouble is that driver's host and port are not known until later when
> > Resource Manager give to the executor node. How to communicate the host,
> > port info back to the client application ?
> >
> > May be there is an Yarn API to obtain this information from Yarn Client.
> >
> >
> > 2) Application and Spark Job communication In YARN Cluster mode.
> >
> > There are several use cases we are thinking may require communication
> > between the client side application and Spark Running Job.
> >
> >  One example,
> >* Try to stop a running job -- while job is running, abort the
> long
> > running job in Yarn
> >
> >   Again, we are think to use Akka Actor to send a STOP job message.
> >
> >
> >
> > So here some of  questions:
> >
> > * Is there any work regarding this area in the community ?
> >
> > * what do you think the Akka approach ? Alternatives ?
> >
> > * Is there a way to get Spark's Akka host and port from Yarn Resource
> > Manager to Yarn Client ?
> >
> > Any suggestions welcome
> >
> > Thanks
> > Chester
> >
>


Contributing to MLlib

2014-06-30 Thread salexln
Hi guys,

I'm new to Spark & MLlib and this may be a dumb question, but still

As part of my M.Sc project, i'm working on implementation of Fuzzy C-means
(FCM) algorithm in MLlib.
FCM has many things in common with K - Means algorithm, which is already
implemented,  and I wanted to know whether should I create some inheritance
between them (some base class that would hold all the common stuff).

I could not find an answer to that in the "Spark Coding Guide"
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)


Appreciate your help

thanks,
Alex 




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.