Hi,

I am working on artificial neural networks for Spark. It is solved with 
Gradient Descent, so each step the data is read, sum of gradients is calculated 
for each data partition (on each worker), aggregated (on the driver) and 
broadcasted back. I noticed that the gradient computation time is few times 
less than the total time needed for each step. To narrow down my observation, I 
run the gradient on a single machine with single partition of data of site 
100MB that I persist (data.persist). This should minimize the overhead for 
aggregation at least, but the gradient computation still takes much less time 
than the whole step. Just in case, data is loaded by MLUtil. loadLibSVMFile in 
RDD[LabeledPoint], this is my code:

    val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
    val train = MLUtils.loadLibSVMFile(new SparkContext(conf), 
"/data/mnist/mnist.scale").repartition(1).persist()
    val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, 1e-4) 
//training data, batch size, hidden layer size, iterations, LBFGS tolerance

Profiler shows that there are two threads, one is doing Gradient and the other 
I don't know what. The Gradient takes 10% of this thread. Almost all other time 
is spent by MemoryStore. Below is the screenshot (first thread):
https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing
Second thread:
https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing

Could Spark developers please elaborate what's going on in MemoryStore? It 
seems that it does some string operations (parsing libsvm file? Why every 
step?) and a lot of InputStream reading. It seems that the overall time depends 
on the size of the data batch (or size of vector) I am processing. However it 
does not seems linear to me.

Also, I would like to know how to speedup these operations.

Best regards, Alexander

Reply via email to