No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-11 Thread Disha Shrivastava
Dear Spark developers,

I am trying to study the effect of increasing number of cores ( CPU's) on
speedup and accuracy ( scalability with spark ANN ) performance for the
MNIST dataset using ANN implementation provided in the latest spark release.

I have formed a cluster of 5 machines with 88 cores in total.The thing
which is troubling me is that even if I have more than 2 workers in my
spark cluster the job gets divided only to 2 workers.( executors) which
Spark takes by default and hence it takes the same time . I know we can set
the number of partitions manually using sc.parallelize(train_data,10)
suppose which then divides the data in 10 partitions and all the workers
are involved in the computation.I am using the below code:


import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

// Load training data
val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
val train = splits(0)
val test = splits(1)
//val tr=sc.parallelize(train,10);
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and
output of size 3 (classes)
val layers = Array[Int](784,160,10)
// create the trainer and set its parameters
val trainer = new
MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new
MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))

Can you please suggest me how can I ensure that the data/task is divided
equally to all the worker machines?

Thanks and Regards,
Disha Shrivastava
Masters student, IIT Delhi


Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-11 Thread Mike Hynes
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't do scalability tests in the manner
you're thinking by changing the partitioning.

On 10/11/15, Disha Shrivastava  wrote:
> Dear Spark developers,
>
> I am trying to study the effect of increasing number of cores ( CPU's) on
> speedup and accuracy ( scalability with spark ANN ) performance for the
> MNIST dataset using ANN implementation provided in the latest spark
> release.
>
> I have formed a cluster of 5 machines with 88 cores in total.The thing
> which is troubling me is that even if I have more than 2 workers in my
> spark cluster the job gets divided only to 2 workers.( executors) which
> Spark takes by default and hence it takes the same time . I know we can set
> the number of partitions manually using sc.parallelize(train_data,10)
> suppose which then divides the data in 10 partitions and all the workers
> are involved in the computation.I am using the below code:
>
>
> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.sql.Row
>
> // Load training data
> val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
> // Split the data into train and test
> val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> val train = splits(0)
> val test = splits(1)
> //val tr=sc.parallelize(train,10);
> // specify layers for the neural network:
> // input layer of size 4 (features), two intermediate of size 5 and 4 and
> output of size 3 (classes)
> val layers = Array[Int](784,160,10)
> // create the trainer and set its parameters
> val trainer = new
> MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
> // train the model
> val model = trainer.fit(train)
> // compute precision on the test set
> val result = model.transform(test)
> val predictionAndLabels = result.select("prediction", "label")
> val evaluator = new
> MulticlassClassificationEvaluator().setMetricName("precision")
> println("Precision:" + evaluator.evaluate(predictionAndLabels))
>
> Can you please suggest me how can I ensure that the data/task is divided
> equally to all the worker machines?
>
> Thanks and Regards,
> Disha Shrivastava
> Masters student, IIT Delhi
>


-- 
Thanks,
Mike

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-11 Thread Disha Shrivastava
Actually I have 5 workers running ( 1 per physical machine) as displayed by
the spark UI on spark://IP_of_the_master:7077. I have entered all the
physical machines IP in a file named slaves in spark/conf directory and
using the script start-all.sh to start the cluster.

My question is that is there a way to control how the tasks are distributed
among different workers? To my knowledge it is done by Spark automatically
and is not in our control.

On Sun, Oct 11, 2015 at 9:49 PM, Mike Hynes <91m...@gmail.com> wrote:

> Having only 2 workers for 5 machines would be your problem: you
> probably want 1 worker per physical machine, which entails running the
> spark-daemon.sh script to start a worker on those machines.
> The partitioning is agnositic to how many executors are available for
> running the tasks, so you can't do scalability tests in the manner
> you're thinking by changing the partitioning.
>
> On 10/11/15, Disha Shrivastava  wrote:
> > Dear Spark developers,
> >
> > I am trying to study the effect of increasing number of cores ( CPU's) on
> > speedup and accuracy ( scalability with spark ANN ) performance for the
> > MNIST dataset using ANN implementation provided in the latest spark
> > release.
> >
> > I have formed a cluster of 5 machines with 88 cores in total.The thing
> > which is troubling me is that even if I have more than 2 workers in my
> > spark cluster the job gets divided only to 2 workers.( executors) which
> > Spark takes by default and hence it takes the same time . I know we can
> set
> > the number of partitions manually using sc.parallelize(train_data,10)
> > suppose which then divides the data in 10 partitions and all the workers
> > are involved in the computation.I am using the below code:
> >
> >
> > import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> > import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> > import org.apache.spark.mllib.util.MLUtils
> > import org.apache.spark.sql.Row
> >
> > // Load training data
> > val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
> > // Split the data into train and test
> > val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> > val train = splits(0)
> > val test = splits(1)
> > //val tr=sc.parallelize(train,10);
> > // specify layers for the neural network:
> > // input layer of size 4 (features), two intermediate of size 5 and 4 and
> > output of size 3 (classes)
> > val layers = Array[Int](784,160,10)
> > // create the trainer and set its parameters
> > val trainer = new
> >
> MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
> > // train the model
> > val model = trainer.fit(train)
> > // compute precision on the test set
> > val result = model.transform(test)
> > val predictionAndLabels = result.select("prediction", "label")
> > val evaluator = new
> > MulticlassClassificationEvaluator().setMetricName("precision")
> > println("Precision:" + evaluator.evaluate(predictionAndLabels))
> >
> > Can you please suggest me how can I ensure that the data/task is divided
> > equally to all the worker machines?
> >
> > Thanks and Regards,
> > Disha Shrivastava
> > Masters student, IIT Delhi
> >
>
>
> --
> Thanks,
> Mike
>


RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-12 Thread Ulanov, Alexander
Hi Disha,

The problem might be as follows. The data that you have might physically reside 
only on two nodes and Spark launches data-local tasks. As a result, only two 
workers are used. You might want to force Spark to distribute the data across 
all nodes, however it does not seem to be worthwhile for this rather small 
dataset.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Sunday, October 11, 2015 9:29 AM
To: Mike Hynes
Cc: dev@spark.apache.org; Ulanov, Alexander
Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in 
number of cores

Actually I have 5 workers running ( 1 per physical machine) as displayed by the 
spark UI on spark://IP_of_the_master:7077. I have entered all the physical 
machines IP in a file named slaves in spark/conf directory and using the script 
start-all.sh to start the cluster.
My question is that is there a way to control how the tasks are distributed 
among different workers? To my knowledge it is done by Spark automatically and 
is not in our control.

On Sun, Oct 11, 2015 at 9:49 PM, Mike Hynes 
<91m...@gmail.com<mailto:91m...@gmail.com>> wrote:
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't do scalability tests in the manner
you're thinking by changing the partitioning.

On 10/11/15, Disha Shrivastava 
mailto:dishu@gmail.com>> wrote:
> Dear Spark developers,
>
> I am trying to study the effect of increasing number of cores ( CPU's) on
> speedup and accuracy ( scalability with spark ANN ) performance for the
> MNIST dataset using ANN implementation provided in the latest spark
> release.
>
> I have formed a cluster of 5 machines with 88 cores in total.The thing
> which is troubling me is that even if I have more than 2 workers in my
> spark cluster the job gets divided only to 2 workers.( executors) which
> Spark takes by default and hence it takes the same time . I know we can set
> the number of partitions manually using sc.parallelize(train_data,10)
> suppose which then divides the data in 10 partitions and all the workers
> are involved in the computation.I am using the below code:
>
>
> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.sql.Row
>
> // Load training data
> val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
> // Split the data into train and test
> val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> val train = splits(0)
> val test = splits(1)
> //val tr=sc.parallelize(train,10);
> // specify layers for the neural network:
> // input layer of size 4 (features), two intermediate of size 5 and 4 and
> output of size 3 (classes)
> val layers = Array[Int](784,160,10)
> // create the trainer and set its parameters
> val trainer = new
> MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
> // train the model
> val model = trainer.fit(train)
> // compute precision on the test set
> val result = model.transform(test)
> val predictionAndLabels = result.select("prediction", "label")
> val evaluator = new
> MulticlassClassificationEvaluator().setMetricName("precision")
> println("Precision:" + evaluator.evaluate(predictionAndLabels))
>
> Can you please suggest me how can I ensure that the data/task is divided
> equally to all the worker machines?
>
> Thanks and Regards,
> Disha Shrivastava
> Masters student, IIT Delhi
>

--
Thanks,
Mike



Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Disha Shrivastava
Hi Alexander,

Thanks for your reply.Actually I am working with a modified version of the
actual MNIST dataset ( maximum samples = 8.2 M)
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. I
have been running different sized versions*( 1,10,50,1M,8M
samples)* on different number of workers(*1,2,3,4,5*) and obtaining
results. I have observed that when I specify partitions manually, the
cluster actually shows scalability performance with decrease in time taken
with increase in number of cores. With default settings, Spark
automatically divides the data into partitions ( I guess based on data
size,etc) and this number is fixed irrespective of the actual number of
workers present in the cluster.

As per the data residing on two machines is concerned, I am reading the
data from HDFS ( multi-node hadoop cluster setup done for all worker
machines). With default number of partitions, Spark gives better results (
less time and better accuracy) as compared to when I manually set the
number of partitions; but the problem here is that I can't observe the
effect of scalability.

My question is that if I have to obtain both scalability and optimality how
should I go about it in Spark? Because clearly in my case, scalable
implementation is not necessarily optimal. Here, by scalability I mean that
if I increase he number of worker machines , I should get a better
performance ( less time taken).

Thanks and Regards
Disha

On Mon, Oct 12, 2015 at 11:45 PM, Ulanov, Alexander <
alexander.ula...@hpe.com> wrote:

> Hi Disha,
>
>
>
> The problem might be as follows. The data that you have might physically
> reside only on two nodes and Spark launches data-local tasks. As a result,
> only two workers are used. You might want to force Spark to distribute the
> data across all nodes, however it does not seem to be worthwhile for this
> rather small dataset.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
> *Sent:* Sunday, October 11, 2015 9:29 AM
> *To:* Mike Hynes
> *Cc:* dev@spark.apache.org; Ulanov, Alexander
> *Subject:* Re: No speedup in MultiLayerPerceptronClassifier with increase
> in number of cores
>
>
>
> Actually I have 5 workers running ( 1 per physical machine) as displayed
> by the spark UI on spark://IP_of_the_master:7077. I have entered all the
> physical machines IP in a file named slaves in spark/conf directory and
> using the script start-all.sh to start the cluster.
>
> My question is that is there a way to control how the tasks are
> distributed among different workers? To my knowledge it is done by Spark
> automatically and is not in our control.
>
>
>
> On Sun, Oct 11, 2015 at 9:49 PM, Mike Hynes <91m...@gmail.com> wrote:
>
> Having only 2 workers for 5 machines would be your problem: you
> probably want 1 worker per physical machine, which entails running the
> spark-daemon.sh script to start a worker on those machines.
> The partitioning is agnositic to how many executors are available for
> running the tasks, so you can't do scalability tests in the manner
> you're thinking by changing the partitioning.
>
>
> On 10/11/15, Disha Shrivastava  wrote:
> > Dear Spark developers,
> >
> > I am trying to study the effect of increasing number of cores ( CPU's) on
> > speedup and accuracy ( scalability with spark ANN ) performance for the
> > MNIST dataset using ANN implementation provided in the latest spark
> > release.
> >
> > I have formed a cluster of 5 machines with 88 cores in total.The thing
> > which is troubling me is that even if I have more than 2 workers in my
> > spark cluster the job gets divided only to 2 workers.( executors) which
> > Spark takes by default and hence it takes the same time . I know we can
> set
> > the number of partitions manually using sc.parallelize(train_data,10)
> > suppose which then divides the data in 10 partitions and all the workers
> > are involved in the computation.I am using the below code:
> >
> >
> > import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> > import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> > import org.apache.spark.mllib.util.MLUtils
> > import org.apache.spark.sql.Row
> >
> > // Load training data
> > val data = MLUtils.loadLibSVMFile(sc, "data/1_libsvm").toDF()
> > // Split the data into train and test
> > val splits = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> > val train = splits(0)
> > val test = splits(1)
> > //val tr=sc.parallelize(train,10);
> > // specify layers for the neural network:
> > // input layer of size 4 (features), two intermediate 

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Ulanov, Alexander
Hi Disha,

This is a good question. We plan to elaborate on it in our talk on the upcoming 
Spark Summit. Less workers means less compute power, more workers means more 
communication overhead. So, there exist an optimal number of workers for 
solving optimization problem with batch gradient given the size of the data and 
the model. Also, you have to make sure that all workers own local data, that is 
a separate thing to the number of partitions.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Thursday, October 15, 2015 10:13 AM
To: Ulanov, Alexander
Cc: Mike Hynes; dev@spark.apache.org
Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in 
number of cores

Hi Alexander,
Thanks for your reply.Actually I am working with a modified version of the 
actual MNIST dataset ( maximum samples = 8.2 M) 
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. I have 
been running different sized versions( 1,10,50,1M,8M samples) on 
different number of workers(1,2,3,4,5) and obtaining results. I have observed 
that when I specify partitions manually, the cluster actually shows scalability 
performance with decrease in time taken with increase in number of cores. With 
default settings, Spark automatically divides the data into partitions ( I 
guess based on data size,etc) and this number is fixed irrespective of the 
actual number of workers present in the cluster.
As per the data residing on two machines is concerned, I am reading the data 
from HDFS ( multi-node hadoop cluster setup done for all worker machines). With 
default number of partitions, Spark gives better results ( less time and better 
accuracy) as compared to when I manually set the number of partitions; but the 
problem here is that I can't observe the effect of scalability.
My question is that if I have to obtain both scalability and optimality how 
should I go about it in Spark? Because clearly in my case, scalable 
implementation is not necessarily optimal. Here, by scalability I mean that if 
I increase he number of worker machines , I should get a better performance ( 
less time taken).
Thanks and Regards
Disha

On Mon, Oct 12, 2015 at 11:45 PM, Ulanov, Alexander 
mailto:alexander.ula...@hpe.com>> wrote:
Hi Disha,

The problem might be as follows. The data that you have might physically reside 
only on two nodes and Spark launches data-local tasks. As a result, only two 
workers are used. You might want to force Spark to distribute the data across 
all nodes, however it does not seem to be worthwhile for this rather small 
dataset.

Best regards, Alexander

From: Disha Shrivastava [mailto:dishu@gmail.com<mailto:dishu@gmail.com>]
Sent: Sunday, October 11, 2015 9:29 AM
To: Mike Hynes
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>; Ulanov, Alexander
Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in 
number of cores

Actually I have 5 workers running ( 1 per physical machine) as displayed by the 
spark UI on spark://IP_of_the_master:7077. I have entered all the physical 
machines IP in a file named slaves in spark/conf directory and using the script 
start-all.sh to start the cluster.
My question is that is there a way to control how the tasks are distributed 
among different workers? To my knowledge it is done by Spark automatically and 
is not in our control.

On Sun, Oct 11, 2015 at 9:49 PM, Mike Hynes 
<91m...@gmail.com<mailto:91m...@gmail.com>> wrote:
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't do scalability tests in the manner
you're thinking by changing the partitioning.

On 10/11/15, Disha Shrivastava 
mailto:dishu@gmail.com>> wrote:
> Dear Spark developers,
>
> I am trying to study the effect of increasing number of cores ( CPU's) on
> speedup and accuracy ( scalability with spark ANN ) performance for the
> MNIST dataset using ANN implementation provided in the latest spark
> release.
>
> I have formed a cluster of 5 machines with 88 cores in total.The thing
> which is troubling me is that even if I have more than 2 workers in my
> spark cluster the job gets divided only to 2 workers.( executors) which
> Spark takes by default and hence it takes the same time . I know we can set
> the number of partitions manually using sc.parallelize(train_data,10)
> suppose which then divides the data in 10 partitions and all the workers
> are involved in the computation.I am using the below code:
>
>
> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>