RE: Data and Model Parallelism in MLPC

2016-01-04 Thread Ulanov, Alexander
Hi Disha,

Data is stacked into matrices to perform matrix-matrix multiplication (instead 
of matrix-vector) that is handled by native BLAS and one can get a speed-up. 
You can refer here for benchmarks https://github.com/fommil/netlib-java

With regards to your second question, data parallelism is handled by Spark RDD, 
i.e. each worker processes a subset of data partitions, and master serves the 
role of parameter server.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Wednesday, December 30, 2015 4:03 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Data and Model Parallelism in MLPC

Hi,
I went through the code for implementation of MLPC and couldn't understand why 
stacking/unstacking of the input data has been done. The description says " 
Block size for stacking input data in matrices to speed up the computation. 
Data is stacked within partitions. If block size is more than remaining data in 
a partition then it is adjusted to the size of this data. Recommended size is 
between 10 and 1000. Default: 128". I am not pretty sure what this means and 
how does this attain speed in computation?
Also, I couldn't find exactly how data parallelism as depicted in 
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf<http://static.googleusercontent.com/media/research.google.com/hi/archive/large_deep_networks_nips2012.pdf>
 is incorporated in the existing code. There seems to be no notion of parameter 
server and optimization routine is also normal LBFGS not Sandblaster LBFGS. The 
only parallelism seems to be coming from the way input data is read and stored.
Please correct me if I am wrong and clarify my doubt.
Thanks and Regards,
Disha

On Tue, Dec 29, 2015 at 5:40 PM, Disha Shrivastava 
<dishu@gmail.com<mailto:dishu@gmail.com>> wrote:
Hi Alexander,
Thanks a lot for your response.Yes, I am considering the use case when the 
weight matrix is too large to fit into the main memory of a single machine.
Can you tell me ways of dividing the weight matrix? According to my 
investigations so far, we can do this by two ways:

1. By parallelizing the weight matrix RDD using sc.parallelize and then using 
suitable map functions in the forward and backward pass.
2. By using RowMatrix / BlockMatrix to represent the weight matrix and do 
calculations on it.
Which of these methods will be efficient to use ? Also, I came across an 
implementation using Akka where layer-by-layer partitioning of the network has 
been done 
(http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html)
 which I believe is model parallelism in the true sense.
Please suggest any other ways/implementation that can help. I would love to 
hear your remarks on the above.
Thanks and Regards,
Disha

On Wed, Dec 9, 2015 at 1:29 AM, Ulanov, Alexander 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>> wrote:
Hi Disha,

Which use case do you have in mind that would require model parallelism? It 
should have large number of weights, so it could not fit into the memory of a 
single machine. For example, multilayer perceptron topologies, that are used 
for speech recognition, have up to 100M of weights. Present hardware is capable 
of accommodating this in the main memory. That might be a problem for GPUs, but 
this is a different topic.

The straightforward way of model parallelism for fully connected neural 
networks is to distribute horizontal (or vertical) blocks of weight matrices 
across several nodes. That means that the input data has to be reproduced on 
all these nodes. The forward and the backward passes will require re-assembling 
the outputs and the errors on each of the nodes after each layer, because each 
of the node can produce only partial results since it holds a part of weights. 
According to my estimations, this is inefficient due to large intermediate 
traffic between the nodes and should be used only if the model does not fit in 
memory of a single machine. Another way of model parallelism would be to 
represent the network as the graph and use GraphX to write forward and back 
propagation. However, this option does not seem very practical to me.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com<mailto:dishu@gmail.com>]
Sent: Tuesday, December 08, 2015 11:19 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Data and Model Parallelism in MLPC

Hi Alexander,
Thanks for your response. Can you suggest ways to incorporate Model Parallelism 
in MPLC? I am trying to do the same in Spark. I got hold of your post 
http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
 where you have divided the weight matrix into different worker machines. I 
have two basic questions in this regard:
1. How to actually visualize/

Re: Data and Model Parallelism in MLPC

2015-12-30 Thread Disha Shrivastava
Hi,

I went through the code for implementation of MLPC and couldn't understand
why stacking/unstacking of the input data has been done. The description
says " Block size for stacking input data in matrices to speed up the
computation. Data is stacked within partitions. If block size is more than
remaining data in a partition then it is adjusted to the size of this
data. Recommended
size is between 10 and 1000. Default: 128". I am not pretty sure what this
means and how does this attain speed in computation?

Also, I couldn't find exactly how data parallelism as depicted in
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
is incorporated in the existing code. There seems to be no notion of
parameter server and optimization routine is also normal LBFGS not
Sandblaster LBFGS. The only parallelism seems to be coming from the way
input data is read and stored.

Please correct me if I am wrong and clarify my doubt.

Thanks and Regards,
Disha

On Tue, Dec 29, 2015 at 5:40 PM, Disha Shrivastava <dishu@gmail.com>
wrote:

> Hi Alexander,
>
> Thanks a lot for your response.Yes, I am considering the use case when the
> weight matrix is too large to fit into the main memory of a single machine.
>
> Can you tell me ways of dividing the weight matrix? According to my
> investigations so far, we can do this by two ways:
>
> 1. By parallelizing the weight matrix RDD using sc.parallelize and then
> using suitable map functions in the forward and backward pass.
> 2. By using RowMatrix / BlockMatrix to represent the weight matrix and do
> calculations on it.
>
> Which of these methods will be efficient to use ? Also, I came across an
> implementation using Akka where layer-by-layer partitioning of the network
> has been done (
> http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html)
> which I believe is model parallelism in the true sense.
>
> Please suggest any other ways/implementation that can help. I would love
> to hear your remarks on the above.
>
> Thanks and Regards,
> Disha
>
> On Wed, Dec 9, 2015 at 1:29 AM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Hi Disha,
>>
>>
>>
>> Which use case do you have in mind that would require model parallelism?
>> It should have large number of weights, so it could not fit into the memory
>> of a single machine. For example, multilayer perceptron topologies, that
>> are used for speech recognition, have up to 100M of weights. Present
>> hardware is capable of accommodating this in the main memory. That might be
>> a problem for GPUs, but this is a different topic.
>>
>>
>>
>> The straightforward way of model parallelism for fully connected neural
>> networks is to distribute horizontal (or vertical) blocks of weight
>> matrices across several nodes. That means that the input data has to be
>> reproduced on all these nodes. The forward and the backward passes will
>> require re-assembling the outputs and the errors on each of the nodes after
>> each layer, because each of the node can produce only partial results since
>> it holds a part of weights. According to my estimations, this is
>> inefficient due to large intermediate traffic between the nodes and should
>> be used only if the model does not fit in memory of a single machine.
>> Another way of model parallelism would be to represent the network as the
>> graph and use GraphX to write forward and back propagation. However, this
>> option does not seem very practical to me.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
>> *Sent:* Tuesday, December 08, 2015 11:19 AM
>> *To:* Ulanov, Alexander
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Data and Model Parallelism in MLPC
>>
>>
>>
>> Hi Alexander,
>>
>> Thanks for your response. Can you suggest ways to incorporate Model
>> Parallelism in MPLC? I am trying to do the same in Spark. I got hold of
>> your post
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
>> where you have divided the weight matrix into different worker machines. I
>> have two basic questions in this regard:
>>
>> 1. How to actually visualize/analyze and control how nodes of the neural
>> network/ weights are divided across different workers?
>>
>> 2. Is there any alternate way to achieve model parallelism for MPLC in
>> Spark? I believe we need to have some kind of synchronization and control
>> for the updation of weights shared across diff

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha,

Multilayer perceptron classifier in Spark implements data parallelism.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 12:43 AM
To: dev@spark.apache.org; Ulanov, Alexander
Subject: Data and Model Parallelism in MLPC

Hi,
I would like to know if the implementation of MLPC in the latest released 
version of Spark ( 1.5.2 ) implements model parallelism and data parallelism as 
done in the DistBelief model implemented by Google  
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
Thanks And Regards,
Disha


Re: Data and Model Parallelism in MLPC

2015-12-08 Thread Disha Shrivastava
Hi Alexander,

Thanks for your response. Can you suggest ways to incorporate Model
Parallelism in MPLC? I am trying to do the same in Spark. I got hold of
your post
http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
where you have divided the weight matrix into different worker machines. I
have two basic questions in this regard:

1. How to actually visualize/analyze and control how nodes of the neural
network/ weights are divided across different workers?

2. Is there any alternate way to achieve model parallelism for MPLC in
Spark? I believe we need to have some kind of synchronization and control
for the updation of weights shared across different workers during
backpropagation.

Looking forward for your views on this.

Thanks and Regards,
Disha

On Wed, Dec 9, 2015 at 12:36 AM, Ulanov, Alexander  wrote:

> Hi Disha,
>
>
>
> Multilayer perceptron classifier in Spark implements data parallelism.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
> *Sent:* Tuesday, December 08, 2015 12:43 AM
> *To:* dev@spark.apache.org; Ulanov, Alexander
> *Subject:* Data and Model Parallelism in MLPC
>
>
>
> Hi,
>
> I would like to know if the implementation of MLPC in the latest released
> version of Spark ( 1.5.2 ) implements model parallelism and data
> parallelism as done in the DistBelief model implemented by Google
> http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
> 
>
>
> Thanks And Regards,
>
> Disha
>


RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha,

Which use case do you have in mind that would require model parallelism? It 
should have large number of weights, so it could not fit into the memory of a 
single machine. For example, multilayer perceptron topologies, that are used 
for speech recognition, have up to 100M of weights. Present hardware is capable 
of accommodating this in the main memory. That might be a problem for GPUs, but 
this is a different topic.

The straightforward way of model parallelism for fully connected neural 
networks is to distribute horizontal (or vertical) blocks of weight matrices 
across several nodes. That means that the input data has to be reproduced on 
all these nodes. The forward and the backward passes will require re-assembling 
the outputs and the errors on each of the nodes after each layer, because each 
of the node can produce only partial results since it holds a part of weights. 
According to my estimations, this is inefficient due to large intermediate 
traffic between the nodes and should be used only if the model does not fit in 
memory of a single machine. Another way of model parallelism would be to 
represent the network as the graph and use GraphX to write forward and back 
propagation. However, this option does not seem very practical to me.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 11:19 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Data and Model Parallelism in MLPC

Hi Alexander,
Thanks for your response. Can you suggest ways to incorporate Model Parallelism 
in MPLC? I am trying to do the same in Spark. I got hold of your post 
http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
 where you have divided the weight matrix into different worker machines. I 
have two basic questions in this regard:
1. How to actually visualize/analyze and control how nodes of the neural 
network/ weights are divided across different workers?
2. Is there any alternate way to achieve model parallelism for MPLC in Spark? I 
believe we need to have some kind of synchronization and control for the 
updation of weights shared across different workers during backpropagation.
Looking forward for your views on this.
Thanks and Regards,
Disha

On Wed, Dec 9, 2015 at 12:36 AM, Ulanov, Alexander 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>> wrote:
Hi Disha,

Multilayer perceptron classifier in Spark implements data parallelism.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com<mailto:dishu@gmail.com>]
Sent: Tuesday, December 08, 2015 12:43 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>; Ulanov, Alexander
Subject: Data and Model Parallelism in MLPC

Hi,
I would like to know if the implementation of MLPC in the latest released 
version of Spark ( 1.5.2 ) implements model parallelism and data parallelism as 
done in the DistBelief model implemented by Google  
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf<http://static.googleusercontent.com/media/research.google.com/hi/archive/large_deep_networks_nips2012.pdf>
Thanks And Regards,
Disha