Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hi,

i also had problems getting up to speed but i made the cardinality of the
vectors responsible for that. i didn't do the math exactly but while
streaming k-means improves over regular k-means in using log(k) and
(n_umber of datapoints / k) passes, the d_imension parameter from the
original k*d*n stays untouched, right?

What is your vector's cardinality?


On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Ted,

 What were the CLI parameters when you ran this test for 1M points - no. of
 clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?







 On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 For reference, on a 16 core machine, I was able to run the sequential
 version of streaming k-means on 1,000,000 points, each with 10 dimensions
 in about 20 seconds.  The map-reduce versions are comparable subject to
 scaling except for startup time.



 On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
 wrote:

  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14, Suneel Marthi wrote:
   Has anyone be successful running Streaming KMeans clustering on a large
  dataset ( 100,000 points)?
  
  
   It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after the
  job has been kicked off.
  
   Its the same story when trying to run in sequential mode.
  
   Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour of
  the algorithm I am not sure if the sequence of steps in there is correct.
  
  
   There are few calls that call themselves repeatedly over and over again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
  
   We really need to have this working on datasets that are larger than
 20K
  reuters datasets.
  
   I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
  
 
 



Re: Streaming KMeans clustering

2013-12-25 Thread Sebastian Schelter
Hi Johannes,

can you share some details about the dataset that you ran streaming
k-means on (number of datapoints, cardinality, etc)?

@Ted/Suneel Shouldn't the approximate searching techniques (e.g.
projection search) help cope with high dimensional inputs?

--sebastian


On 25.12.2013 10:42, Johannes Schulte wrote:
 Hi,
 
 i also had problems getting up to speed but i made the cardinality of the
 vectors responsible for that. i didn't do the math exactly but while
 streaming k-means improves over regular k-means in using log(k) and
 (n_umber of datapoints / k) passes, the d_imension parameter from the
 original k*d*n stays untouched, right?
 
 What is your vector's cardinality?
 
 
 On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:
 
 Ted,

 What were the CLI parameters when you ran this test for 1M points - no. of
 clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?







 On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 For reference, on a 16 core machine, I was able to run the sequential
 version of streaming k-means on 1,000,000 points, each with 10 dimensions
 in about 20 seconds.  The map-reduce versions are comparable subject to
 scaling except for startup time.



 On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
 wrote:

 That the algorithm runs a single reducer is expected. The algorithm
 creates a sketch of the data in parallel in the map-phase, which is
 collected by the reducer afterwards. The reducer then applies an
 expensive in-memory clustering algorithm to the sketch.

 Which dataset are you using for testing? I can also do some tests on a
 cluster here.

 I can imagine two possible causes for the problems: Maybe there's a
 problem with the vectors and some calculations take very long because
 the wrong access pattern or implementation is chosen.

 Another problem could be that the mappers and reducers have too few
 memory and spend a lot of time running garbage collections.

 --sebastian


 On 23.12.2013 22:14, Suneel Marthi wrote:
 Has anyone be successful running Streaming KMeans clustering on a large
 dataset ( 100,000 points)?


 It just seems to take a very long time ( 4hrs) for the mappers to
 finish on about 300K data points and the reduce phase has only a single
 reducer running and throws an OOM failing the job several hours after the
 job has been kicked off.

 Its the same story when trying to run in sequential mode.

 Looking at the code the bottleneck seems to be in
 StreamingKMeans.clusterInternal(), without understanding the behaviour of
 the algorithm I am not sure if the sequence of steps in there is correct.


 There are few calls that call themselves repeatedly over and over again
 like SteamingKMeans.clusterInternal() and Searcher.searchFirst().

 We really need to have this working on datasets that are larger than
 20K
 reuters datasets.

 I am trying to run this on 300K vectors with k= 100, km = 1261 and
 FastProjectSearch.




 



Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).

As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes


On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Johannes,

 can you share some details about the dataset that you ran streaming
 k-means on (number of datapoints, cardinality, etc)?

 @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
 projection search) help cope with high dimensional inputs?

 --sebastian


 On 25.12.2013 10:42, Johannes Schulte wrote:
  Hi,
 
  i also had problems getting up to speed but i made the cardinality of the
  vectors responsible for that. i didn't do the math exactly but while
  streaming k-means improves over regular k-means in using log(k) and
  (n_umber of datapoints / k) passes, the d_imension parameter from the
  original k*d*n stays untouched, right?
 
  What is your vector's cardinality?
 
 
  On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Ted,
 
  What were the CLI parameters when you ran this test for 1M points - no.
 of
  clusters k, km, distanceMeasure, projectionSearch,
 estimatedDistanceCutoff?
 
 
 
 
 
 
 
  On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
 ted.dunn...@gmail.com
  wrote:
 
  For reference, on a 16 core machine, I was able to run the sequential
  version of streaming k-means on 1,000,000 points, each with 10
 dimensions
  in about 20 seconds.  The map-reduce versions are comparable subject to
  scaling except for startup time.
 
 
 
  On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
  wrote:
 
  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14, Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a
 large
  dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after
 the
  job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour
 of
  the algorithm I am not sure if the sequence of steps in there is
 correct.
 
 
  There are few calls that call themselves repeatedly over and over
 again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than
  20K
  reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
 
 
 
 
 




Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
@Johannes, I didn't quite get reading your 2 emails if Streaming kmeans worked 
for you or not? What were the issues you had identified with pending additions 
and projection?






On Wednesday, December 25, 2013 5:40 AM, Johannes Schulte 
johannes.schu...@gmail.com wrote:
 
Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).

As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes



On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Johannes,

 can you share some details about the dataset that you ran streaming
 k-means on (number of datapoints, cardinality, etc)?

 @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
 projection search) help cope with high dimensional inputs?

 --sebastian


 On 25.12.2013 10:42, Johannes Schulte wrote:
  Hi,
 
  i also had problems getting up to speed but i made the cardinality of the
  vectors responsible for that. i didn't do the math exactly but while
  streaming k-means improves over regular k-means in using log(k) and
  (n_umber of datapoints / k) passes, the d_imension parameter from the
  original k*d*n stays untouched, right?
 
  What is your vector's cardinality?
 
 
  On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Ted,
 
  What were the CLI parameters when you ran this test for 1M points - no.
 of
  clusters k, km, distanceMeasure, projectionSearch,
 estimatedDistanceCutoff?
 
 
 
 
 
 
 
  On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
 ted.dunn...@gmail.com
  wrote:
 
  For reference, on a 16 core machine, I was able to run the sequential
  version of streaming k-means on 1,000,000 points, each with 10
 dimensions
  in about 20 seconds.  The map-reduce versions are comparable subject to
  scaling except for startup time.
 
 
 
  On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
  wrote:
 
  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14, Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a
 large
  dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after
 the
  job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour
 of
  the algorithm I am not sure if the sequence of steps in there is
 correct.
 
 
  There are few calls that call themselves repeatedly over and over
 again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than
  20K
  reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
 
 
 
 
 



[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-25 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856592#comment-13856592
 ] 

Suneel Marthi commented on MAHOUT-1388:
---

[~yxjiang] Also please provide adequate Logging statements in ur code. Would u 
be only supporting CSV as input format? 

 Add command line support and logging for MLP
 

 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
  Labels: mlp, sgd
 Fix For: 1.0


 The user should have the ability to run the Perceptron from the command line.
 There are two modes for MLP, the training and labeling, the first one takes 
 the data as input and outputs the model, the second one takes the model and 
 unlabeled data as input and outputs the results.
 The parameters are as follows:
 
 --mode -mo // train or label
 --input -i (input data)
 --model -mo  // in training mode, this is the location to store the model (if 
 the specified location has an existing model, it will update the model 
 through incremental learning), in labeling mode, this is the location to 
 store the result
 --output -o   // this is only useful in labeling mode
 --layersize -ls (no. of units per hidden layer) // use comma separated number 
 to indicate the number of neurons in each layer (including input layer and 
 output layer)
 --momentum -m 
 --learningrate -l
 --regularizationweight -r
 --costfunction -cf   // the type of cost function,
 
 For example, train a 3-layer (including input, hidden, and output) MLP with 
 Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 
 regularization weight, the parameter would be:
 mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 
 -m 0.1 -r 0.01 -cf minus_squared
 This command would read the training data from /tmp/training-data.csv and 
 write the trained model to /tmp/model.model.
 If a user need to use an existing model, it will use the following command:
 mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o 
 /tmp/label-result
 Moreover, we should be providing default values if the user does not specify 
 any. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1358) StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true

2013-12-25 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1358:
--

Description: 
Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true and when 
no estimatedDistanceCutoff is specified, throws the following error

{Code}

java.lang.IllegalArgumentException: Must have nonzero number of training and 
test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0]
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
at 
org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
at 
org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

{Code}

The issue is caused by the following code in StreamingKMeansThread.call()

{Code}
IteratorCentroid datapointsIterator = datapoints.iterator();
if (estimateDistanceCutoff == 
StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) {
  ListCentroid estimatePoints = 
Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS);
  while (datapointsIterator.hasNext()  estimatePoints.size()  
NUM_ESTIMATE_POINTS) {
estimatePoints.add(datapointsIterator.next());
  }
  estimateDistanceCutoff = 
ClusteringUtils.estimateDistanceCutoff(estimatePoints, 
searcher.getDistanceMeasure());
}

StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, 
estimateDistanceCutoff);
while (datapointsIterator.hasNext()) {
  clusterer.cluster(datapointsIterator.next());
}
{Code}

The code is using the same iterator twice, and it fails on the second use for 
obvious reasons.


  was:
Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, throws 
the following error

{Code}

java.lang.IllegalArgumentException: Must have nonzero number of training and 
test vectors. Asked for %.1f %% of %d vectors for test [10.00149011612, 0]
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
at 
org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
at 
org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)

{Code}

The issue is caused by the following code in StreamingKMeansThread.call()

{Code}
IteratorCentroid datapointsIterator = datapoints.iterator();
if (estimateDistanceCutoff == 
StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) {
  ListCentroid estimatePoints = 
Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS);
  while (datapointsIterator.hasNext()  estimatePoints.size()  
NUM_ESTIMATE_POINTS) {
estimatePoints.add(datapointsIterator.next());
  }
  estimateDistanceCutoff = 
ClusteringUtils.estimateDistanceCutoff(estimatePoints, 
searcher.getDistanceMeasure());
}

StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, 
estimateDistanceCutoff);
while (datapointsIterator.hasNext()) {
  clusterer.cluster(datapointsIterator.next());
}
{Code}

The code is using the same iterator twice, and it fails on the second use for 
obvious reasons.



 StreamingKMeansThread throws IllegalArgumentException when 
 REDUCE_STREAMING_KMEANS is set to true
 -

 Key: MAHOUT-1358
 URL: https://issues.apache.org/jira/browse/MAHOUT-1358
 Project: Mahout
  Issue Type: Bug
  Components: 

Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi





On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com 
wrote:
 
For reference, on a 16 core machine, I was able to run the sequential
version of streaming k-means on 1,000,000 points, each with 10 dimensions
in about 20 seconds.  The map-reduce versions are comparable subject to
scaling except for startup time.

@Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure 
how this would have even worked for u in sequential mode in light of the issues 
reported against M-1314, M-1358, M-1380 (all of which impact the sequential 
mode); unless u had fixed them locally.
What were ur estimatedDistanceCutoff, number of clusters 'k', projection search 
and how much memory did u have to allocate to the single Reducer?




On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote:

 That the algorithm runs a single reducer is expected. The algorithm
 creates a sketch of
 the data in parallel in the map-phase, which is
 collected by the reducer afterwards. The reducer then applies an
 expensive in-memory clustering algorithm to the sketch.

 Which dataset are you using for testing? I can also do some tests on a
 cluster here.

 I can imagine two possible causes for the problems: Maybe there's a
 problem with the vectors and some calculations take very long because
 the wrong access pattern or implementation is chosen.

 Another problem could be that the mappers and reducers have too few
 memory and spend a lot of time running garbage collections.

 --sebastian


 On 23.12.2013 22:14,
 Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a large
 dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
 finish on about 300K data points and the reduce phase has only a single
 reducer running and throws an OOM failing the job several hours after the
 job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
 StreamingKMeans.clusterInternal(), without understanding the behaviour of
 the algorithm I am not sure if the sequence of steps in there is correct.
 
 
  There are few calls that call themselves repeatedly over and over again
 like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than 20K
 reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
 FastProjectSearch.
 



Re: Streaming KMeans clustering

2013-12-25 Thread Sebastian Schelter
On 25.12.2013 14:19, Suneel Marthi wrote:
 
 
 
 
 
 On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com 
 wrote:
  
 For reference, on a 16 core machine, I was able to run the sequential
 version of streaming k-means on 1,000,000 points, each with 10 dimensions
 in about 20 seconds.  The map-reduce versions are comparable subject to
 scaling except for startup time.
 
 @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure 
 how this would have even worked for u in sequential mode in light of the 
 issues reported against M-1314, M-1358, M-1380 (all of which impact the 
 sequential mode); unless u had fixed them locally.
 What were ur estimatedDistanceCutoff, number of clusters 'k', projection 
 search and how much memory did u have to allocate to the single Reducer?

If I read the source code correctly, the final reducer clusters the
sketch which should contain m * k * log n intermediate centroids, where
k is the number of desired clusters, m is the number of mappers run and
n is the number of datapoints. Those centroids are expected to be dense,
so we can estimate the memory required for the final reducer using this
formula.

 
 
 
 
 On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote:
 
 That the algorithm runs a single reducer is expected. The algorithm
 creates a sketch of
  the data in parallel in the map-phase, which is
 collected by the reducer afterwards. The reducer then applies an
 expensive in-memory clustering algorithm to the sketch.

 Which dataset are you using for testing? I can also do some tests on a
 cluster here.

 I can imagine two possible causes for the problems: Maybe there's a
 problem with the vectors and some calculations take very long because
 the wrong access pattern or implementation is chosen.

 Another problem could be that the mappers and reducers have too few
 memory and spend a lot of time running garbage collections.

 --sebastian


 On 23.12.2013 22:14,
  Suneel Marthi wrote:
 Has anyone be successful running Streaming KMeans clustering on a large
 dataset ( 100,000 points)?


 It just seems to take a very long time ( 4hrs) for the mappers to
 finish on about 300K data points and the reduce phase has only a single
 reducer running and throws an OOM failing the job several hours after the
 job has been kicked off.

 Its the same story when trying to run in sequential mode.

 Looking at the code the bottleneck seems to be in
 StreamingKMeans.clusterInternal(), without understanding the behaviour of
 the algorithm I am not sure if the sequence of steps in there is correct.


 There are few calls that call themselves repeatedly over and over again
 like SteamingKMeans.clusterInternal() and Searcher.searchFirst().

 We really need to have this working on datasets that are larger than 20K
 reuters datasets.

 I am trying to run this on 300K vectors with k= 100, km = 1261 and
 FastProjectSearch.






Re: Streaming KMeans clustering

2013-12-25 Thread Suneel Marthi
Not sure how that would work in a corporate setting wherein there's a fixed 
systemwide setting that cannot be overridden. 

Sent from my iPhone

 On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org wrote:
 
 On 25.12.2013 14:19, Suneel Marthi wrote:
 
 
 
 
 
 On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com 
 wrote:
 
 For reference, on a 16 core machine, I was able to run the sequential
 version of streaming k-means on 1,000,000 points, each with 10 dimensions
 in about 20 seconds.  The map-reduce versions are comparable subject to
 scaling except for startup time.
 
 @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not 
 sure how this would have even worked for u in sequential mode in light of 
 the issues reported against M-1314, M-1358, M-1380 (all of which impact the 
 sequential mode); unless u had fixed them locally.
 What were ur estimatedDistanceCutoff, number of clusters 'k', projection 
 search and how much memory did u have to allocate to the single Reducer?
 
 If I read the source code correctly, the final reducer clusters the
 sketch which should contain m * k * log n intermediate centroids, where
 k is the number of desired clusters, m is the number of mappers run and
 n is the number of datapoints. Those centroids are expected to be dense,
 so we can estimate the memory required for the final reducer using this
 formula.
 
 
 
 
 
 On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org wrote:
 
 That the algorithm runs a single reducer is expected. The algorithm
 creates a sketch of
 the data in parallel in the map-phase, which is
 collected by the reducer afterwards. The reducer then applies an
 expensive in-memory clustering algorithm to the sketch.
 
 Which dataset are you using for testing? I can also do some tests on a
 cluster here.
 
 I can imagine two possible causes for the problems: Maybe there's a
 problem with the vectors and some calculations take very long because
 the wrong access pattern or implementation is chosen.
 
 Another problem could be that the mappers and reducers have too few
 memory and spend a lot of time running garbage collections.
 
 --sebastian
 
 
 On 23.12.2013 22:14,
 Suneel Marthi wrote:
 Has anyone be successful running Streaming KMeans clustering on a large
 dataset ( 100,000 points)?
 
 
 It just seems to take a very long time ( 4hrs) for the mappers to
 finish on about 300K data points and the reduce phase has only a single
 reducer running and throws an OOM failing the job several hours after the
 job has been kicked off.
 
 Its the same story when trying to run in sequential mode.
 
 Looking at the code the bottleneck seems to be in
 StreamingKMeans.clusterInternal(), without understanding the behaviour of
 the algorithm I am not sure if the sequence of steps in there is correct.
 
 
 There are few calls that call themselves repeatedly over and over again
 like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
 We really need to have this working on datasets that are larger than 20K
 reuters datasets.
 
 I am trying to run this on 300K vectors with k= 100, km = 1261 and
 FastProjectSearch.
 


Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
everybody should have the right to do

job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G);

for that :)


For my problems, i always felt the sketching took too long. i put up a
simple comparison here:

g...@github.com:baunz/cluster-comprarison.git

it generates some sample vectors and clusters them with regular k-means,
and streaming k-means, both sequentially. i took 10 kmeans iterations as a
benchmark and used the default values for FastProjectionSearch from the
kMeans Driver Class.

Visual VM tells me the most time is spent in FastProjectionSearch.remove().
This is called on every added datapoint.

Maybe i got something wrong but for this sparse, high dimensional vectors i
never got streaming k-means faster than the regula version




On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Not sure how that would work in a corporate setting wherein there's a
 fixed systemwide setting that cannot be overridden.

 Sent from my iPhone

  On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org
 wrote:
 
  On 25.12.2013 14:19, Suneel Marthi wrote:
 
 
 
 
 
  On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
 ted.dunn...@gmail.com wrote:
 
  For reference, on a 16 core machine, I was able to run the sequential
  version of streaming k-means on 1,000,000 points, each with 10
 dimensions
  in about 20 seconds.  The map-reduce versions are comparable subject
 to
  scaling except for startup time.
 
  @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
 Not sure how this would have even worked for u in sequential mode in light
 of the issues reported against M-1314, M-1358, M-1380 (all of which impact
 the sequential mode); unless u had fixed them locally.
  What were ur estimatedDistanceCutoff, number of clusters 'k',
 projection search and how much memory did u have to allocate to the single
 Reducer?
 
  If I read the source code correctly, the final reducer clusters the
  sketch which should contain m * k * log n intermediate centroids, where
  k is the number of desired clusters, m is the number of mappers run and
  n is the number of datapoints. Those centroids are expected to be dense,
  so we can estimate the memory required for the final reducer using this
  formula.
 
 
 
 
 
  On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of
  the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14,
  Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a
 large
  dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after
 the
  job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour
 of
  the algorithm I am not sure if the sequence of steps in there is
 correct.
 
 
  There are few calls that call themselves repeatedly over and over
 again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than
 20K
  reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
 



[jira] [Commented] (MAHOUT-1388) Add command line support and logging for MLP

2013-12-25 Thread Yexi Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856671#comment-13856671
 ] 

Yexi Jiang commented on MAHOUT-1388:


[~smarthi] OK, I'll add it. Currently, it only supports CSV.



 Add command line support and logging for MLP
 

 Key: MAHOUT-1388
 URL: https://issues.apache.org/jira/browse/MAHOUT-1388
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 1.0
Reporter: Yexi Jiang
  Labels: mlp, sgd
 Fix For: 1.0


 The user should have the ability to run the Perceptron from the command line.
 There are two modes for MLP, the training and labeling, the first one takes 
 the data as input and outputs the model, the second one takes the model and 
 unlabeled data as input and outputs the results.
 The parameters are as follows:
 
 --mode -mo // train or label
 --input -i (input data)
 --model -mo  // in training mode, this is the location to store the model (if 
 the specified location has an existing model, it will update the model 
 through incremental learning), in labeling mode, this is the location to 
 store the result
 --output -o   // this is only useful in labeling mode
 --layersize -ls (no. of units per hidden layer) // use comma separated number 
 to indicate the number of neurons in each layer (including input layer and 
 output layer)
 --momentum -m 
 --learningrate -l
 --regularizationweight -r
 --costfunction -cf   // the type of cost function,
 
 For example, train a 3-layer (including input, hidden, and output) MLP with 
 Minus_Square cost function, 0.1 learning rate, 0.1 momentum rate, and 0.01 
 regularization weight, the parameter would be:
 mlp -mo train -i /tmp/training-data.csv -o /tmp/model.model -ls 5,3,1 -l 0.1 
 -m 0.1 -r 0.01 -cf minus_squared
 This command would read the training data from /tmp/training-data.csv and 
 write the trained model to /tmp/model.model.
 If a user need to use an existing model, it will use the following command:
 mlp -mo label -i /tmp/unlabel-data.csv -m /tmp/model.model -o 
 /tmp/label-result
 Moreover, we should be providing default values if the user does not specify 
 any. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Happy Holidays!

2013-12-25 Thread Tharindu Rusira
Happy Holidays everyone !!! :)


On Wed, Dec 25, 2013 at 8:09 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Merry Christmas and a Happy New Year!

  On Dec 24, 2013, at 3:36 PM, Stevo Slavić ssla...@gmail.com wrote:
 
  Happy Holidays Everyone!
 
 
  On Tue, Dec 24, 2013 at 12:28 PM, Frank Scholten fr...@frankscholten.nl
 wrote:
 
  Best wishes!
 
 
  On Tue, Dec 24, 2013 at 11:11 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  dito!
 
  On 24.12.2013 11:09, Isabel Drost-Fromm wrote:
 
  I'd like to take some time and wish everyone a Happy Holiday!
  Enjoy the time with your family and friends.
 
  Thank you all for your contributions and work on Mahout. Looking
  forward to an exciting 2014.
 
  Isabel
 




-- 
M.P. Tharindu Rusira Kumara

Department of Computer Science and Engineering,
University of Moratuwa,
Sri Lanka.
+94757033733
www.tharindu-rusira.blogspot.com


Re: Streaming KMeans clustering

2013-12-25 Thread Ted Dunning
Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.

You didn't run multiple sketching passes did you?


Also, which version?  There was a horrendous clone in there at one time.




On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte 
johannes.schu...@gmail.com wrote:

 everybody should have the right to do

 job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G);

 for that :)


 For my problems, i always felt the sketching took too long. i put up a
 simple comparison here:

 g...@github.com:baunz/cluster-comprarison.git

 it generates some sample vectors and clusters them with regular k-means,
 and streaming k-means, both sequentially. i took 10 kmeans iterations as a
 benchmark and used the default values for FastProjectionSearch from the
 kMeans Driver Class.

 Visual VM tells me the most time is spent in FastProjectionSearch.remove().
 This is called on every added datapoint.

 Maybe i got something wrong but for this sparse, high dimensional vectors i
 never got streaming k-means faster than the regula version




 On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  Not sure how that would work in a corporate setting wherein there's a
  fixed systemwide setting that cannot be overridden.
 
  Sent from my iPhone
 
   On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org
  wrote:
  
   On 25.12.2013 14:19, Suneel Marthi wrote:
  
  
  
  
  
   On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
  ted.dunn...@gmail.com wrote:
  
   For reference, on a 16 core machine, I was able to run the
 sequential
   version of streaming k-means on 1,000,000 points, each with 10
  dimensions
   in about 20 seconds.  The map-reduce versions are comparable subject
  to
   scaling except for startup time.
  
   @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
  Not sure how this would have even worked for u in sequential mode in
 light
  of the issues reported against M-1314, M-1358, M-1380 (all of which
 impact
  the sequential mode); unless u had fixed them locally.
   What were ur estimatedDistanceCutoff, number of clusters 'k',
  projection search and how much memory did u have to allocate to the
 single
  Reducer?
  
   If I read the source code correctly, the final reducer clusters the
   sketch which should contain m * k * log n intermediate centroids, where
   k is the number of desired clusters, m is the number of mappers run and
   n is the number of datapoints. Those centroids are expected to be
 dense,
   so we can estimate the memory required for the final reducer using this
   formula.
  
  
  
  
  
   On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
  wrote:
  
   That the algorithm runs a single reducer is expected. The algorithm
   creates a sketch of
   the data in parallel in the map-phase, which is
   collected by the reducer afterwards. The reducer then applies an
   expensive in-memory clustering algorithm to the sketch.
  
   Which dataset are you using for testing? I can also do some tests on
 a
   cluster here.
  
   I can imagine two possible causes for the problems: Maybe there's a
   problem with the vectors and some calculations take very long because
   the wrong access pattern or implementation is chosen.
  
   Another problem could be that the mappers and reducers have too few
   memory and spend a lot of time running garbage collections.
  
   --sebastian
  
  
   On 23.12.2013 22:14,
   Suneel Marthi wrote:
   Has anyone be successful running Streaming KMeans clustering on a
  large
   dataset ( 100,000 points)?
  
  
   It just seems to take a very long time ( 4hrs) for the mappers to
   finish on about 300K data points and the reduce phase has only a
 single
   reducer running and throws an OOM failing the job several hours after
  the
   job has been kicked off.
  
   Its the same story when trying to run in sequential mode.
  
   Looking at the code the bottleneck seems to be in
   StreamingKMeans.clusterInternal(), without understanding the
 behaviour
  of
   the algorithm I am not sure if the sequence of steps in there is
  correct.
  
  
   There are few calls that call themselves repeatedly over and over
  again
   like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
  
   We really need to have this working on datasets that are larger than
  20K
   reuters datasets.
  
   I am trying to run this on 300K vectors with k= 100, km = 1261 and
   FastProjectSearch.
  
 



Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
To be honest, i always cancelled the sketching after a while because i
wasn't satisfied with the points per second speed. The version used is the
0.8 release.

if i find the time i'm gonna look what is called when and where and how
often and what the problem could be.


On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.

 You didn't run multiple sketching passes did you?


 Also, which version?  There was a horrendous clone in there at one time.




 On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte 
 johannes.schu...@gmail.com wrote:

  everybody should have the right to do
 
  job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G);
 
  for that :)
 
 
  For my problems, i always felt the sketching took too long. i put up a
  simple comparison here:
 
  g...@github.com:baunz/cluster-comprarison.git
 
  it generates some sample vectors and clusters them with regular k-means,
  and streaming k-means, both sequentially. i took 10 kmeans iterations as
 a
  benchmark and used the default values for FastProjectionSearch from the
  kMeans Driver Class.
 
  Visual VM tells me the most time is spent in
 FastProjectionSearch.remove().
  This is called on every added datapoint.
 
  Maybe i got something wrong but for this sparse, high dimensional
 vectors i
  never got streaming k-means faster than the regula version
 
 
 
 
  On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com
  wrote:
 
   Not sure how that would work in a corporate setting wherein there's a
   fixed systemwide setting that cannot be overridden.
  
   Sent from my iPhone
  
On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org
   wrote:
   
On 25.12.2013 14:19, Suneel Marthi wrote:
   
   
   
   
   
On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
   ted.dunn...@gmail.com wrote:
   
For reference, on a 16 core machine, I was able to run the
  sequential
version of streaming k-means on 1,000,000 points, each with 10
   dimensions
in about 20 seconds.  The map-reduce versions are comparable
 subject
   to
scaling except for startup time.
   
@Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
   Not sure how this would have even worked for u in sequential mode in
  light
   of the issues reported against M-1314, M-1358, M-1380 (all of which
  impact
   the sequential mode); unless u had fixed them locally.
What were ur estimatedDistanceCutoff, number of clusters 'k',
   projection search and how much memory did u have to allocate to the
  single
   Reducer?
   
If I read the source code correctly, the final reducer clusters the
sketch which should contain m * k * log n intermediate centroids,
 where
k is the number of desired clusters, m is the number of mappers run
 and
n is the number of datapoints. Those centroids are expected to be
  dense,
so we can estimate the memory required for the final reducer using
 this
formula.
   
   
   
   
   
On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter 
 s...@apache.org
   wrote:
   
That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of
the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.
   
Which dataset are you using for testing? I can also do some tests
 on
  a
cluster here.
   
I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long
 because
the wrong access pattern or implementation is chosen.
   
Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.
   
--sebastian
   
   
On 23.12.2013 22:14,
Suneel Marthi wrote:
Has anyone be successful running Streaming KMeans clustering on a
   large
dataset ( 100,000 points)?
   
   
It just seems to take a very long time ( 4hrs) for the mappers to
finish on about 300K data points and the reduce phase has only a
  single
reducer running and throws an OOM failing the job several hours
 after
   the
job has been kicked off.
   
Its the same story when trying to run in sequential mode.
   
Looking at the code the bottleneck seems to be in
StreamingKMeans.clusterInternal(), without understanding the
  behaviour
   of
the algorithm I am not sure if the sequence of steps in there is
   correct.
   
   
There are few calls that call themselves repeatedly over and over
   again
like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
   
We really need to have this working on datasets that are larger
 than
   20K
reuters datasets.
   
I am trying to run this on 300K vectors with k= 100, km = 1261 and
FastProjectSearch.