[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-29 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-68241121
  
I have compared the ANN with Support Vector Machine (SVM) and Logistic 
Regression.

I have tested using a master "local(5)" configuration, and applied the 
MNIST dataset, using 6 training examples and 1 test examples.

Since SVM and Logistic Regression are binary classifiers, I applied two 
methods to convert them to a multinary classifier: majority vote and ad-hoc 
tree.

For the majority vote, I trained 10 different models, each to distinguish a 
single class from the rest. The classification was done by looking at which 
model gives the highest positive output. I performed 100 iterations per class, 
leading to 1000 iterations in total.

For ANN, I used a single hidden layer with 32 nodes (not counting the bias 
nodes). I performed 100 iterations.

For LBFGS I used tolerance 1e-5.

Because of the poor performance of SVM+SGD, I re-ran it with 1000 
iterations per class (1 in total). The performance was similar.

I found the following results for the test set:

```
  Algorithm Accuracy   Time# correct   # 
incorrect

+-+--+---+---+-+
| ANN (LBFGS) |95.1% |  665s |  9510 | 
490 |

+-+--+---+---+-+
| Logistic Regression (SGD)   |72.0% | 1325s |  7202 |
2798 |

+-+--+---+---+-+
| Logistic Regression (LBFGS) |86.6% | 1635s |  8658 |
1342 |

+-+--+---+---+-+
| SVM (SGD)   |18.6% | 1294s |  1860 |
8140 |

+-+--+---+---+-+
| (SVM (SGD) 1000 iterations) |18.5% |12658s |  1850 |
8150 |

+-+--+---+---+-+
| SVM (LBFGS) |86.2% | 1453s |  8622 |
1378 |

+-+--+---+---+-+
```

I also created an ad-hoc tree model. This separates the collection of 
training examples in two approximately equal size partitions, where I tried to 
separate the numbers based on how different they look. I continued with the two 
separated partitions, until each output class corresponded to a single number.

The partioning choice was made manually and intuitively, as follows:

0123456789 -> (04689, 12357)
04689 -> (068, 49)
068 -> (0, 68)
68 -> (6, 8)
49 -> (4, 9)
12357 -> (17, 235)
17 -> (1, 7)
235 -> (2, 35)
35 -> (3, 5)

Notice that this leads to only nine classification runs, not ten as in the 
voting scheme.

After training, I used the trained models to classify the test set. I got 
the following results (same parameters as with the voting scheme):

```
  Algorithm Accuracy   Time# correct   # 
incorrect

+-+--+---+---+-+
| ANN (LBFGS) |95.1% |  665s |  9510 | 
490 |

+-+--+---+---+-+
| Logistic Regression (SGD)   |82.3% | 1146s |  8228 |
1772 |

+-+--+---+---+-+
| Logistic Regression (LBFGS) |87.2% | 1273s |  8719 |
1281 |

+-+--+---+---+-+
| SVM (SGD)   |61.1% | 1148s |  6113 |
3887 |

+-+--+---+---+-+
| SVM (LBFGS) |87.5% | 1182s |  8753 |
1247 |

+-+--+---+---+-+
```

Notice that I left ANN in the table because this is to compare ANN with 
other algorithms. Since ANN is a multinary classifier by nature, it didn't use 
the ad-hoc tree.

It would be great if someone could verify of my results. I am particularly 
amazed of the low performance of SVM+SGD with voting, and the increase with the 
ad-hoc tree. I used the same code for SGD and LBFGS, and only changed the 
optimiser and related parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-22 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67915148
  
@jkbradley @avulanov 

Agree that we should refrain from adding to much options at this point in 
time, and keep the implementation simple but robust.

Concerning interchangeable optimisers: I am developing a preference for 
using the case classes as discussed before. This will also get rid of the 
plurality of training functions, since the case class instance includes the 
default parameters or changed parameters if set by the application. No matter 
default or customised values, the case class instance can be input to a single 
train function.

When to do this is the question though, especially since such solution 
could be useful for other learning algorithms as well. However, if we don't do 
it now, we will have to accept that we will have to keep the different training 
functions for backward compatibility reasons for at least some time in the 
future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-18 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67585200
  
Addendum: notice that the ANNClassifier.train function has several 
instances, and the number of nodes in the hidden layer(s) is quite critical. 
Hence I would prefer using:
```
randomWeights(data: RDD[LabeledPoint], hiddenLayersTopology: Array[Int])
```
where hiddenLayersTopology indeed only indicates the nodes in the hidden 
layers, as number of input and output nodes are deduced from the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-18 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67584041
  
> @bgreeven I have cloned your branch and am trying to run the MNIST 
dataset.
> I can't quite understand how to set the number of output neurons though.
> The topology array seems to only apply to the hidden layers. I have seen
> some tests of MNIST on your code though, so I was curious how this was
> done?

The ANN learns the number of input nodes and number of output nodes from a 
sample from the data. This is done automatically when invoking the train 
function.

The ArtificialNeuralNetwork object/class only does approximation, not 
classification. To do classification, you best use the ANNClassifier 
object/class.

The simplest call to ANNClassifier is the following:
```
val model = ANNClassifier.train(data)
```
where `data: RDD[LabeledPoint]`

In this case, the features of each LabeledPoint are the bitmap values, 
whereas the label could be one of the values 0, 1, ..., 9.

Subsequent prediction can be done by
```
val predictedLabel = model.predict(v)
```

where v is a vector of features, and the output is one of the labels 0, 1, 
...9.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-17 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67422794
  
@jkbradley @Lewuathe 

Indeed I have been thinking about such interface as well. I quite like it, 
but...:

GradientDescent is private in MLLIB, so you can't create a GradientDescent 
object from the application:

```
class GradientDescent private[mllib] (private var gradient: Gradient, 
private var updater: Updater)
  extends Optimizer with Logging
```

That can be easily resolved by removing `private[mllib]` though.

What worries me more, is that both of the GradientDescent and LBFGS classes 
require the gradient and updater as parameters. However, we define a customised 
gradient and updater for ANN, which may be best to keep private for the ANN.

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-16 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67281152
  
So you mean something as follows?

---
abstract class OptimizerInfo {}

case class OptimizerInfoSGD extends OptimizerInfo {
var noIterations: Int = 1000;
var batchFrac: Double = 1.0;
}

case class OptimizerInfoLBFGS extends OptimizerInfo {
var maxIterations: Int = 1000;
var tolerance: Double = 1e-4;
}
---

And in the training function:

---
def train(..., oi: OptimizerInfo): ArtificalNeuralNetworkModel =  {
oi match {
  case OptimizerInfoLBFGS() =>
// perform LBFGS optimisation
  case OptimizerInfoSGD() =>
// perform SGD optimisation
}
  }
---


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-16 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67267487
  
@avulanov @jkbradley 

An advantage of the string is that you can pass it as an opaque from the 
ANNClassifier class to the ArtificialNeuralNetwork class, i.e. the 
ANNClassifier class will work with whatever optimiser is defined in the string, 
without actually knowing about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-16 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67267236
  
@avulanov @jkbradley

The issue is, that some optimisers use different parameters than others. 
For example, LBFGS uses tolerance and whereas SGD has miniBatchFraction and 
stepSize. So if you use a train function, you should somehow convey these 
parameters within that string (or maybe another string, but you should have 
some mechanism). Maybe something like "SGD,miniBatchFraction=1.0,stepSize=1.0"? 
Seems a bit of an artificial workaround though. However, you could define 
default values, making the calling simpler, and indeed omitting the need for 
defining separate functions for different sets of default values.

With different train functions in a single class, you can define the 
parameters on a per training function basis. You would have to create the 
optimizer objects on invoking the training function, but that may not be too 
big an issue.

Creating a new class e.g. ArtificialNeuralNetworkWithSGD is yet another 
possibility. It has the disadvantage of duplicated code though, which is 
especially bothersome when some of the core code is changed. That issue could 
be reduced by moving as much as possible to Scala traits.

Just philosophising here, needs some more thought...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-12-10 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r21603916
  
--- Diff: docs/mllib-ann.md ---
@@ -0,0 +1,239 @@
+---
+layout: global
+title: Artificial Neural Networks - MLlib
+displayTitle: MLlib - Artificial Neural 
Networks
+---
+
+# Introduction
+
+This document describes the MLlib's Artificial Neural Network (ANN) 
implementation.
+
+The implementation currently consist of the following files:
+
+* 'ArtificialNeuralNetwork.scala': implements the ANN
+* 'ANNSuite': implements automated tests for the ANN and its gradient
+* 'ANNDemo': a demo that approximates three functions and shows a 
graphical representation of
+the result
+
+# Summary of usage
+
+The "ArtificialNeuralNetwork" object is used as an interface to the neural 
network. It is
+called as follows:
+
+```
+val annModel = ArtificialNeuralNetwork.train(rdd, hiddenLayersTopology, 
maxNumIterations)
+```
+
+where
+
+* `rdd` is an RDD of type (Vector,Vector), the first element containing 
the input vector and
+the second the associated output vector.
+* `hiddenLayersTopology` is an array of integers (Array[Int]), which 
contains the number of
+nodes per hidden layer, starting with the layer that takes inputs from the 
input layer, and
+finishing with the layer that outputs to the output layer. The bias nodes 
are not counted.
+* `maxNumIterations` is an upper bound to the number of iterations to be 
performed.
+* `ANNmodel` contains the trained ANN parameters, and can be used to 
calculated the ANNs
+approximation to arbitrary input values.
+
+The approximations can be calculated as follows:
+
+```
+val v_out = annModel.predict(v_in)
+```
+
+where v_in is either a Vector or an RDD of Vectors, and v_out respectively 
a Vector or RDD of
+(Vector,Vector) pairs, corresponding to input and output values.
+
+Further details and other calling options will be elaborated upon below.
+
+# Architecture and Notation
+
+The file ArtificialNeuralNetwork.scala implements the ANN. The following 
picture shows the
+architecture of a 3-layer ANN:
+
+```
+ +---+
+ |   |
+ | N_0,0 |
+ |   | 
+ +---++---+
+  |   |
+ +---+| N_0,1 |   +---+
+ |   ||   |   |   |
+ | N_1,0 |-   +---+ ->| N_0,2 |
+ |   | \ Wij1  /  |   |
+ +---+  --+---+  --   +---+
+   \  |   | / Wjk2
+ :  ->| N_1,1 |-  +---+
+ :|   |   |   |
+ :+---+   | N_1,2 |
+ :|   |
+ ::   +---+
+ ::
+ :::
+ :: 
+ ::   +---+
+ ::   |   |
+ ::   |N_K-1,2|
+ :|   |
+ :+---+   +---+
+ :|   |
+ :|N_J-1,1|
+  |   |
+ +---++---+
+ |   | 
+ |N_I-1,0|  
+ |   |
+ +---+
+
+ +---+++
+ |   |||
+ |   -1  ||   -1   |
+ |   |||
+ +---+++
+
+INPUT LAYER  HIDDEN LAYEROUTPUT LAYER
+```
+
+The i-th node in layer l is denoted by N_{i,l}, both i and l starting with 
0. The weight
+between node i in layer l-1 and node j in layer l is denoted by Wijl. 
Layer 0 is the input
+layer, whereas layer L is the output layer.
+
+The ANN also implements bias units. These are nodes that always output the 
value -1. The bias
+units are in all layers except the output layer. They act similar to other 
nodes, but do not
+have input.
+
+The value of node N_{j,l} is calculated  as follows:
+
+`$N_{j,l} = g( \sum_{i=0}^{topology_l} W_{i,j,l)*N_{i,l-1} )$`
+
+Where g is the sigmoid function
+
+`$g(t) = \frac{e^{\beta t} }{1+e^{\beta t}}$`
+
+# LBFGS
+
+MLlib's ANN implementation uses the LBFGS optimisation algorithm for 
training. It minimises the
+following error function:
+
+`$E = \sum_{k=0}^{K-1} (N_{k,L} - Y_k)^2$`
+
+where Y_k is the target output given inputs N_{0,0} ... N_{I-1,0}.
+
+# Implementation Details
+
+## The "ArtificialNeuralNetwork" class
+
+The "ArtificialNeuralNetwork" class has the foll

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-19 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r20620263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala 
---
@@ -0,0 +1,528 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.ann
+
+import breeze.linalg.{DenseVector, Vector => BV, axpy => brzAxpy}
+
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.optimization._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/*
+ * Implements a Artificial Neural Network (ANN)
+ *
+ * The data consists of an input vector and an output vector, combined 
into a single vector
+ * as follows:
+ *
+ * [ ---input--- ---output--- ]
+ *
+ * NOTE: output values should be in the range [0,1]
+ *
+ * For a network of H hidden layers:
+ *
+ * hiddenLayersTopology(h) indicates the number of nodes in hidden layer 
h, excluding the bias
+ * node. h counts from 0 (first hidden layer, taking inputs from input 
layer) to H - 1 (last
+ * hidden layer, sending outputs to the output layer).
+ *
+ * hiddenLayersTopology is converted internally to topology, which adds 
the number of nodes
+ * in the input and output layers.
+ *
+ * noInput = topology(0), the number of input nodes
+ * noOutput = topology(L-1), the number of output nodes
+ *
+ * input = data( 0 to noInput-1 )
+ * output = data( noInput to noInput + noOutput - 1 )
+ *
+ * W_ijl is the weight from node i in layer l-1 to node j in layer l
+ * W_ijl goes to position ofsWeight(l) + j*(topology(l-1)+1) + i in the 
weights vector
+ *
+ * B_jl is the bias input of node j in layer l
+ * B_jl goes to position ofsWeight(l) + j*(topology(l-1)+1) + 
topology(l-1) in the weights vector
+ *
+ * error function: E( O, Y ) = sum( O_j - Y_j )
+ * (with O = (O_0, ..., O_(noOutput-1)) the output of the ANN,
+ * and (Y_0, ..., Y_(noOutput-1)) the input)
+ *
+ * node_jl is node j in layer l
+ * node_jl goes to position ofsNode(l) + j
+ *
+ * The weights gradient is defined as dE/dW_ijl and dE/dB_jl
+ * It has same mapping as W_ijl and B_jl
+ *
+ * For back propagation:
+ * delta_jl = dE/dS_jl, where S_jl the output of node_jl, but before 
applying the sigmoid
+ * delta_jl has the same mapping as node_jl
+ *
+ * Where E = ((estOutput-output),(estOutput-output)),
+ * the inner product of the difference between estimation and target 
output with itself.
+ *
+ */
+
+/**
+ * Artificial neural network (ANN) model
+ *
+ * @param weights the weights between the neurons in the ANN.
+ * @param topology array containing the number of nodes per layer in the 
network, including
+ * the nodes in the input and output layer, but excluding the bias nodes.
+ */
+class ArtificialNeuralNetworkModel private[mllib](val weights: Vector, val 
topology: Array[Int])
+  extends Serializable with ANNHelper {
+
+  /**
+   * Predicts values for a single data point using the trained model.
+   *
+   * @param testData represents a single data point.
+   * @return prediction using the trained model.
+   */
+  def predict(testData: Vector): Vector = {
+Vectors.dense(computeValues(testData.toArray, weights.toArray))
+  }
+
+  /**
+   * Predict values for an RDD of data points using the trained model.
+   *
+   * @param testDataRDD RDD representing the input vectors.
+   * @return RDD with predictions using the trained model as (input, 
output) pairs.
+   */
+  def predict(testDataRDD: RDD[Vector]): RDD[(Vector,Vector)] = {
+testDataRDD.map(T => (T, predict(T)) )
+  }
+
+  private def computeValues(arrData: Array[Double], arrWeights: 
Array[Double]): Array[Double] = {
+val arrNodes = forwardRun(arrData, arrWeights)
+arrNodes.sli

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-04 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-61749598
  
Let's discuss a bit more about making the optimiser, updater, gradient, and 
error function customizable.

Notice that for the current LBFGS algorithm, the error function is used 
both for the gradient (as the error function is minimized), which is used in 
the updater and optimizer. Hence for a pluggable error function, the gradient 
needs to be pluggable.

I think there would be value in making the "updater" and "optimizer" 
pluggable too. For the optimizers we have already seen the candidates LBFGS and 
SGD, both with their pres and contras. Also, there may be other optimizers that 
use something else than the gradient. Since the updater currently depends on 
the gradient, I suggest to make it pluggable too. (I played around a bit with a 
genetic optimizer - doesn't work very well but is an example of an optimizer 
that doesn't use the gradient.)

Maybe we can start with making the "optimizer", "gradient" and "updater" in 
the ArtificialNeuralNetwork class vars instead of vals. Then we can create a 
different ANN object for each "optimizer", "gradient" and "updater" 
combination, e.g. "ArtificialNeuralNetworkWithLBFGS". We also need to remove 
the convergenceTol from the ArtificialNeuralNetwork constructor, since that is 
LBFGS specific.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-03 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r19733442
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala 
---
@@ -0,0 +1,528 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.ann
+
+import breeze.linalg.{DenseVector, Vector => BV, axpy => brzAxpy}
+
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.optimization._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/*
+ * Implements a Artificial Neural Network (ANN)
+ *
+ * The data consists of an input vector and an output vector, combined 
into a single vector
+ * as follows:
+ *
+ * [ ---input--- ---output--- ]
+ *
+ * NOTE: output values should be in the range [0,1]
+ *
+ * For a network of H hidden layers:
+ *
+ * hiddenLayersTopology(h) indicates the number of nodes in hidden layer 
h, excluding the bias
+ * node. h counts from 0 (first hidden layer, taking inputs from input 
layer) to H - 1 (last
+ * hidden layer, sending outputs to the output layer).
+ *
+ * hiddenLayersTopology is converted internally to topology, which adds 
the number of nodes
+ * in the input and output layers.
+ *
+ * noInput = topology(0), the number of input nodes
+ * noOutput = topology(L-1), the number of output nodes
+ *
+ * input = data( 0 to noInput-1 )
+ * output = data( noInput to noInput + noOutput - 1 )
+ *
+ * W_ijl is the weight from node i in layer l-1 to node j in layer l
+ * W_ijl goes to position ofsWeight(l) + j*(topology(l-1)+1) + i in the 
weights vector
+ *
+ * B_jl is the bias input of node j in layer l
+ * B_jl goes to position ofsWeight(l) + j*(topology(l-1)+1) + 
topology(l-1) in the weights vector
+ *
+ * error function: E( O, Y ) = sum( O_j - Y_j )
+ * (with O = (O_0, ..., O_(noOutput-1)) the output of the ANN,
+ * and (Y_0, ..., Y_(noOutput-1)) the input)
+ *
+ * node_jl is node j in layer l
+ * node_jl goes to position ofsNode(l) + j
+ *
+ * The weights gradient is defined as dE/dW_ijl and dE/dB_jl
+ * It has same mapping as W_ijl and B_jl
+ *
+ * For back propagation:
+ * delta_jl = dE/dS_jl, where S_jl the output of node_jl, but before 
applying the sigmoid
+ * delta_jl has the same mapping as node_jl
+ *
+ * Where E = ((estOutput-output),(estOutput-output)),
+ * the inner product of the difference between estimation and target 
output with itself.
+ *
+ */
+
+/**
+ * Artificial neural network (ANN) model
+ *
+ * @param weights the weights between the neurons in the ANN.
+ * @param topology array containing the number of nodes per layer in the 
network, including
+ * the nodes in the input and output layer, but excluding the bias nodes.
+ */
+class ArtificialNeuralNetworkModel private[mllib](val weights: Vector, val 
topology: Array[Int])
+  extends Serializable with ANNHelper {
+
+  /**
+   * Predicts values for a single data point using the trained model.
+   *
+   * @param testData represents a single data point.
+   * @return prediction using the trained model.
+   */
+  def predict(testData: Vector): Vector = {
+Vectors.dense(computeValues(testData.toArray, weights.toArray))
+  }
+
+  /**
+   * Predict values for an RDD of data points using the trained model.
+   *
+   * @param testDataRDD RDD representing the input vectors.
+   * @return RDD with predictions using the trained model as (input, 
output) pairs.
+   */
+  def predict(testDataRDD: RDD[Vector]): RDD[(Vector,Vector)] = {
+testDataRDD.map(T => (T, predict(T)) )
+  }
+
+  private def computeValues(arrData: Array[Double], arrWeights: 
Array[Double]): Array[Double] = {
+val arrNodes = forwardRun(arrData, arrWeights)
+arrNodes.sli

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-03 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r19733377
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala 
---
@@ -0,0 +1,528 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.ann
+
+import breeze.linalg.{DenseVector, Vector => BV, axpy => brzAxpy}
+
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.optimization._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/*
+ * Implements a Artificial Neural Network (ANN)
+ *
+ * The data consists of an input vector and an output vector, combined 
into a single vector
+ * as follows:
+ *
+ * [ ---input--- ---output--- ]
+ *
+ * NOTE: output values should be in the range [0,1]
+ *
+ * For a network of H hidden layers:
+ *
+ * hiddenLayersTopology(h) indicates the number of nodes in hidden layer 
h, excluding the bias
+ * node. h counts from 0 (first hidden layer, taking inputs from input 
layer) to H - 1 (last
+ * hidden layer, sending outputs to the output layer).
+ *
+ * hiddenLayersTopology is converted internally to topology, which adds 
the number of nodes
+ * in the input and output layers.
+ *
+ * noInput = topology(0), the number of input nodes
+ * noOutput = topology(L-1), the number of output nodes
+ *
+ * input = data( 0 to noInput-1 )
+ * output = data( noInput to noInput + noOutput - 1 )
+ *
+ * W_ijl is the weight from node i in layer l-1 to node j in layer l
+ * W_ijl goes to position ofsWeight(l) + j*(topology(l-1)+1) + i in the 
weights vector
+ *
+ * B_jl is the bias input of node j in layer l
+ * B_jl goes to position ofsWeight(l) + j*(topology(l-1)+1) + 
topology(l-1) in the weights vector
+ *
+ * error function: E( O, Y ) = sum( O_j - Y_j )
+ * (with O = (O_0, ..., O_(noOutput-1)) the output of the ANN,
+ * and (Y_0, ..., Y_(noOutput-1)) the input)
+ *
+ * node_jl is node j in layer l
+ * node_jl goes to position ofsNode(l) + j
+ *
+ * The weights gradient is defined as dE/dW_ijl and dE/dB_jl
+ * It has same mapping as W_ijl and B_jl
+ *
+ * For back propagation:
+ * delta_jl = dE/dS_jl, where S_jl the output of node_jl, but before 
applying the sigmoid
+ * delta_jl has the same mapping as node_jl
+ *
+ * Where E = ((estOutput-output),(estOutput-output)),
+ * the inner product of the difference between estimation and target 
output with itself.
+ *
+ */
+
+/**
+ * Artificial neural network (ANN) model
+ *
+ * @param weights the weights between the neurons in the ANN.
+ * @param topology array containing the number of nodes per layer in the 
network, including
+ * the nodes in the input and output layer, but excluding the bias nodes.
+ */
+class ArtificialNeuralNetworkModel private[mllib](val weights: Vector, val 
topology: Array[Int])
+  extends Serializable with ANNHelper {
+
+  /**
+   * Predicts values for a single data point using the trained model.
+   *
+   * @param testData represents a single data point.
+   * @return prediction using the trained model.
+   */
+  def predict(testData: Vector): Vector = {
+Vectors.dense(computeValues(testData.toArray, weights.toArray))
+  }
+
+  /**
+   * Predict values for an RDD of data points using the trained model.
+   *
+   * @param testDataRDD RDD representing the input vectors.
+   * @return RDD with predictions using the trained model as (input, 
output) pairs.
+   */
+  def predict(testDataRDD: RDD[Vector]): RDD[(Vector,Vector)] = {
+testDataRDD.map(T => (T, predict(T)) )
+  }
+
+  private def computeValues(arrData: Array[Double], arrWeights: 
Array[Double]): Array[Double] = {
+val arrNodes = forwardRun(arrData, arrWeights)
+arrNodes.sli

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-03 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r19723239
  
--- Diff: docs/mllib-ann.md ---
@@ -0,0 +1,223 @@
+---
+layout: global
+title: Artificial Neural Networks - MLlib
+displayTitle: MLlib - Artificial Neural 
Networks
+---
+
+# Introduction
+
+This document describes the MLlib's Artificial Neural Network (ANN) 
implementation.
+
+The implementation currently consist of the following files:
+
+* 'ArtificialNeuralNetwork.scala': implements the ANN
+* 'ANNSuite': implements automated tests for the ANN and its gradient
+* 'ANNDemo': a demo that approximates three functions and shows a 
graphical representation of
+the result
+
+# Summary of usage
+
+The "ArtificialNeuralNetwork" object is used as an interface to the neural 
network. It is
+called as follows:
+
+```
+val annModel = ArtificialNeuralNetwork.train(rdd, hiddenLayersTopology, 
maxNumIterations)
--- End diff --

@manishamde: In most cases, one hidden layer is enough. For some special 
functions two hidden layers are needed. This is a nice text about the choice of 
number of layers and number of nodes per layer:
http://www.heatonresearch.com/node/707
Especially the number of nodes per layer depends heavily on the particular 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-11-03 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r19722711
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/ann/ArtificialNeuralNetwork.scala 
---
@@ -0,0 +1,528 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.ann
+
+import breeze.linalg.{DenseVector, Vector => BV, axpy => brzAxpy}
+
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.optimization._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/*
+ * Implements a Artificial Neural Network (ANN)
+ *
+ * The data consists of an input vector and an output vector, combined 
into a single vector
+ * as follows:
+ *
+ * [ ---input--- ---output--- ]
+ *
+ * NOTE: output values should be in the range [0,1]
+ *
+ * For a network of H hidden layers:
+ *
+ * hiddenLayersTopology(h) indicates the number of nodes in hidden layer 
h, excluding the bias
+ * node. h counts from 0 (first hidden layer, taking inputs from input 
layer) to H - 1 (last
+ * hidden layer, sending outputs to the output layer).
+ *
+ * hiddenLayersTopology is converted internally to topology, which adds 
the number of nodes
+ * in the input and output layers.
+ *
+ * noInput = topology(0), the number of input nodes
+ * noOutput = topology(L-1), the number of output nodes
+ *
+ * input = data( 0 to noInput-1 )
+ * output = data( noInput to noInput + noOutput - 1 )
+ *
+ * W_ijl is the weight from node i in layer l-1 to node j in layer l
+ * W_ijl goes to position ofsWeight(l) + j*(topology(l-1)+1) + i in the 
weights vector
+ *
+ * B_jl is the bias input of node j in layer l
+ * B_jl goes to position ofsWeight(l) + j*(topology(l-1)+1) + 
topology(l-1) in the weights vector
+ *
+ * error function: E( O, Y ) = sum( O_j - Y_j )
+ * (with O = (O_0, ..., O_(noOutput-1)) the output of the ANN,
+ * and (Y_0, ..., Y_(noOutput-1)) the input)
+ *
+ * node_jl is node j in layer l
+ * node_jl goes to position ofsNode(l) + j
+ *
+ * The weights gradient is defined as dE/dW_ijl and dE/dB_jl
+ * It has same mapping as W_ijl and B_jl
+ *
+ * For back propagation:
+ * delta_jl = dE/dS_jl, where S_jl the output of node_jl, but before 
applying the sigmoid
+ * delta_jl has the same mapping as node_jl
+ *
+ * Where E = ((estOutput-output),(estOutput-output)),
+ * the inner product of the difference between estimation and target 
output with itself.
+ *
+ */
+
+/**
+ * Artificial neural network (ANN) model
+ *
+ * @param weights the weights between the neurons in the ANN.
+ * @param topology array containing the number of nodes per layer in the 
network, including
+ * the nodes in the input and output layer, but excluding the bias nodes.
+ */
+class ArtificialNeuralNetworkModel private[mllib](val weights: Vector, val 
topology: Array[Int])
+  extends Serializable with ANNHelper {
+
+  /**
+   * Predicts values for a single data point using the trained model.
+   *
+   * @param testData represents a single data point.
+   * @return prediction using the trained model.
+   */
+  def predict(testData: Vector): Vector = {
+Vectors.dense(computeValues(testData.toArray, weights.toArray))
+  }
+
+  /**
+   * Predict values for an RDD of data points using the trained model.
+   *
+   * @param testDataRDD RDD representing the input vectors.
+   * @return RDD with predictions using the trained model as (input, 
output) pairs.
+   */
+  def predict(testDataRDD: RDD[Vector]): RDD[(Vector,Vector)] = {
+testDataRDD.map(T => (T, predict(T)) )
+  }
+
+  private def computeValues(arrData: Array[Double], arrWeights: 
Array[Double]): Array[Double] = {
+val arrNodes = forwardRun(arrData, arrWeights)
+arrNodes.sli

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-09-22 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-56344396
  
I also needed to change the demo, as the fast convergence doesn't give an 
interesting converging graph anymore. I moved the demo to the examples 
directory, but we can consider whether we want to keep it at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-09-22 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-56343909
  
Changed optimiser to LBFGS. Works much faster, but has the disadvantage 
(due to the increased convergence speed per iteration) that it also starts to 
exhibit overfitting earlier (after much fewer iterations).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-09-08 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-54927726
  
Thanks for your feedback. Your points are very helpful indeed.

Here is my response:

  1.  The user guide is for normal users and it should focus on how to use 
ANN. If we want to leave some notes for developers, we can append a section at 
the end.
[bgreeven]: Sure. I think the user guide needs a lot of revision anyway, 
but as you said, it is better to wait until the code is more stable to update 
the user guide.

  1.  We don't ask users to treat unit tests as demos or examples. Instead, 
we put a short code snippet in the user guide and put a complete example under 
examples/.
[bgreeven]: OK, I'll see how to convert the demo into a unit test.

  1.  GeneralizedModel and GeneralizedAlgorithm are definitely out of the 
scope of this PR and they should not live under mllib.ann. We can make a 
separate JIRA to discuss the APIs. Could you remove them in this PR?
  2.  predict outputs the prediction for the first node. Would the first 
node be the only special node? How about having predict(v) output the full 
prediction and predict(v, i) output the prediction to the i-th node?

    [bgreeven]: I certainly understand your concerns on points 3 and 4. My 
reasons for adding GeneralizedModel and GeneralizedAlgorithm were, that I see 
more uses of ANNs than classification only. A LabeledPoint implementation would 
restrict the output to essentially a one dimensional value. If you want to 
learn e.g. a multidimensional function (such as in the demo), then you need 
something more general than LabeledPoint.

The architecture of taking only the first element of an output vector is 
for legacy reasons. GeneralizedLinearModel (to which GeneralizedModel was 
modelled) as well as the ClassificationModel only output a one dimensional 
output value, hence I made the interface of predict(v) the same and created a 
separate function predictV(v) to output the multidimensional result.

I think we can indeed open a second JIRA to discuss this, since I think 
there can also be other uses for multidimensional output  than just 
classification.

  1.  Could you try to use LBFGS instead of GradientDescent?
    [bgreeven] Tried it, and that works too. Actually, I would like to make the 
code more flexible, to allow for replacing the optimisation function. There is 
a lot of research in (parallelisation of) training ANNs, so the future may 
bring better optimisation strategies, and it should be easy to plug those into 
the existing code.

  1.  Please replace for loops by while loops. The latter is faster in 
Scala.
    [bgreeven] Makes sense. Will do so.

  1.  Please follow the Spark Code Style 
Guide<https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide> 
and update the style, e.g., a. remove space after ( and before ) b. add 
ScalaDoc for all public classes and methods c. line width should be smaller 
than 100 chars (in both main and test) d. some verification code are left as 
comments, please find a way to move them to unit tests e. organize imports into 
groups and order them alphabetically within each group f. do not add return or 
; unless they are necessary
[bgreeven] OK, I can do that. B.T.W. it seems that the Spark Code Style 
Guide is missing some rules. I would be happy to volunteer expanding the Style 
Guide, also since "sbt/sbt scalastyle" enforces some rules (such as mandatory 
spaces before and after '+') that are not mentioned in the Style Guide.

  1.  Please use existing unit tests as templates. For example, please 
rename TestParallelANN to ParallelANNSuite and use LocalSparkContext and 
FunSuite for Spark setup and test assertions. Remove main() from unit tests.
[bgreeven] OK, I will look at this and see how to convert the demo to a 
unit test.

  1.  Is Parallel necessary in the name ParallelANN?
[bgreeven] Not really. Better naming is desirable indeed.

  1.  What methods and classes are public? Could they be private or package 
private? Please generate the API doc and check the public ones.
[bgreeven] Yes I found out about this too. Some classes and methods need to 
be made public, as they currently cannot be access from outside. Maybe adding a 
Scala Object as interface (as is done in Alexander's code) is indeed better.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-09-01 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-54102428
  
Now updated such that the code supports true back-propagation.

Thanks to Alexander Ulanov (avulanov) for implementing true 
back-propagation in his repository first. This code borrows extensively from 
his code, and uses the same back-propagation algorithm (save for using arrays 
rather than matrices/vectors) and "layers" vector (here called "ontology").

Looking forward to continue our collaboration!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-22 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-53037400
  
Joining efforts / cooperation is always good of course. :-)

Let me have a closer look at your code first, and see how it differs from 
mine. I'll try it with my data and see its outcome, usability and speed.

Since the optimisation, my code also works with 1024 input nodes, 512 
hidden nodes and 26 output nodes. I still need to play a bit more to find the 
optimal parameters for this particular problem though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-22 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-53031845
  
Added documentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-21 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-52889153
  
I have updated the code. Indeed the LeastSquaresGradientANN.compute 
function was the culprit.

I removed the Breeze instructions, and replaced them by simple 
Array[Double] instructions. I think especially the removal of taking a Breeze 
subvector helps.

In addition, there are some values that could be re-used, and don't need 
re-calculation for each loop in the LeastSquaresGradientANN.compute function. 
So I have changed the loop order and moved some computations up the loop 
hierarchy.

There is considerable speed-up. It now works well with a test set size of 
256 input, 128 hidden and 26 output nodes (letter classifier).


From: Alexander Ulanov [mailto:notificati...@github.com]
Sent: 18 August 2014 16:41
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden 
layer Artificial Neural Network (ANN) (#1290)


@bgreeven<https://github.com/bgreeven> I've looked at your code and the 
algorithm seems to be implemented correctly to the best of my knowledge. 
Probably, copying of the array of weights harms the performance. I played with 
single threaded implementation of perceptron in Scala and it works fine for my 
size of data (i.e. around few minutes).

—
Reply to this email directly or view it on 
GitHub<https://github.com/apache/spark/pull/1290#issuecomment-52465553>.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-17 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-52456503
  
Thanks for your feedback. I'll write some documentation, and also add some 
comments. I'll try with similar size data.

The internal data structure of the weights (and gradient) would have a 
dimension of (1001*500)+(501*18)=509518 floats. The weights are stored in a 
non-sparse vector, sometimes converted to Breeze. There may be an issue with 
that for this size of data. It should be possible though, so worth to have a 
look on how to fix it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-12 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-51997639
  
The ANN uses the existent GradientDescent from mllib.optimization for back 
propagation. It uses the gradient from the new LeastSquaresGradientANN class, 
and updates using the new ANNUpdater class.

This line in ANNUpdater.compute is the backbone of the back propagation:

brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-11 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-51875281
  
SteepestDescend -> SteepestDescent can be changed. Thanks for noticing.

Hung Pham, did it work out for you now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-31 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50851021
  
Thanks a lot! I have added the extension now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50569408
  
I updated the two sources to comply with "sbt/sbt scalastyle". Maybe retry 
the unit tests with the new modifications?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-28 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50436968
  
Jenkins, retest this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-28 Thread bgreeven
Github user bgreeven commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50421747
  
Hi Matthew,

Sure, I can. I was on holiday during the last two weeks, but now back in 
office. I'll update the code this week.

Best regards,
Bert


Bert Greevenbosch
Huawei Technologies Co., Ltd.

bert.greevenbo...@huawei.com

Huawei Industrial Base F1-8
Bantian, Longgang District
Shenzhen 518129
P.R. China

http://www.huawei.com

本邮件及其附件含有华为公司的保密信息,仅
限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于å…
¨éƒ¨æˆ–部分地泄露、复制、或散发)本邮件中

的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并åˆ
 é™¤æœ¬é‚®ä»¶ï¼
This e-mail and its attachments contain confidential information from 
HUAWEI, which
is intended only for the person or entity whose address is listed above. 
Any use of the
information contained herein in any way (including, but not limited to, 
total or partial
disclosure, reproduction, or dissemination) by persons other than the 
intended
recipient(s) is prohibited. If you receive this e-mail in error, please 
notify the sender by
phone or email immediately and delete it!

From: Matthew Burke [mailto:notificati...@github.com]
Sent: 20 July 2014 06:46
To: apache/spark
Cc: Bert Greevenbosch
Subject: Re: [spark] [MLLIB] [spark-2352] Implementation of an 1-hidden 
layer Artificial Neural Network (ANN) (#1290)


@bgreeven<https://github.com/bgreeven> Are you continuing work on this pull 
request so that it passes all unit tests?

—
Reply to this email directly or view it on 
GitHub<https://github.com/apache/spark/pull/1290#issuecomment-49531526>.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [spark-2352] Implementation of an 1-hidden lay...

2014-07-02 Thread bgreeven
GitHub user bgreeven opened a pull request:

https://github.com/apache/spark/pull/1290

[spark-2352] Implementation of an 1-hidden layer Artificial Neural Network 
(ANN)

The code contains a single layer ANN, with variable number of inputs, 
outputs and hidden nodes. It takes as input an RDD vector pairs, corresponding 
to the training set with inputs and outputs.

A test program is also included, which also contains a graphical 
representation that can be switched on using the "graph" parameter. Without it, 
the summed squared error over the testing set is displayed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bgreeven/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1290


commit e60d8786d16e55264a61d8afc8d67bf06068aaaf
Author: Bert Greevenbosch 
Date:   2014-07-03T03:33:22Z

Create ParallelANN.scala

This is the main ParallelANN class and associated Model

commit 52da23d54254b38eea7181872e4caa25981c028e
Author: Bert Greevenbosch 
Date:   2014-07-03T03:34:41Z

Create GeneralizedSteepestDescendAlgorithm

This is the general steepest descend model, with as inputs Vectors and 
outputs Vectors or Doubles.

commit 152b8baf84742ecd0c622d41d3804eb74c0310a3
Author: Bert Greevenbosch 
Date:   2014-07-03T03:36:31Z

Create TestParallelANN.scala

This is a test program for parallel ANNs.

commit c8af840149ac8d1903afe4ac826a626d030bd385
Author: Bert Greevenbosch 
Date:   2014-07-03T03:42:29Z

Create TestParallelANNgraphics.scala

Visualisation tools; only used when "TestParallelANN" is given the "graph" 
parameter.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---