[jira] [Commented] (MAHOUT-1385) Caching Encoders don't cache

2014-02-24 Thread Johannes Schulte (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910299#comment-13910299
 ] 

Johannes Schulte commented on MAHOUT-1385:
--

Hi Manjo.

My point was that these classes are (hopefully) meant to be a performance 
improvement, which would be the case if the string's hashcode could be reused 
(because the java string object caches it's own hash code).

Using the byte[] hash code is evil because it depends on the reference
Using another library / hashing strategy for the values inside the byte array 
is nonsense because this is what we are trying to cache if i understand the 
will of the creator correctly.

The more I think about it - was this ever correct? Using the string hash code 
as a lookup to the murmurHash-based location? There will be different 
collisions leading to other results than with no caching, which should be 
avoided?

 Caching Encoders don't cache
 

 Key: MAHOUT-1385
 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch


 The Caching... line of encoders contains code of caching the hash code terms 
 added to the vector. However, the method hashForProbe inside this classes 
 is never called as the signature has String for the parameter original form 
 (instead of byte[] like other encoders).
 Changing this to byte[] however would lose the java String internal caching 
 of the Strings hash code , that is used as a key in the cache map, triggering 
 another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Streaming KMeans clustering

2013-12-30 Thread Johannes Schulte
Right. Up until now i'am helping myself with some minDf truncation since i
am using tf idf weighted vectors anyway and have the idf counts at hand.
But having a true loss driven sparsification might be better


On Sun, Dec 29, 2013 at 11:43 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Johannes,

 One thing that might be of real interest is something that I haven't tried,
 nor read about.  It should still be interesting.

 The idea is L_1 regularized clustering.  This will give you (hopefully)
 sparse centroids that still are useful.  In your example, where you want
 defensible (aka salable) clusters, it would make the clusters much easier
 to understand.  It would also make them take less memory.



 On Sun, Dec 29, 2013 at 1:55 PM, Johannes Schulte 
 johannes.schu...@gmail.com wrote:

  Ted, thanks for the long response! I agree with you on the benefit of a
 lot
  of clusters. the reason i chose 10 is not because i think there are truly
  10 clusters in the data (there are probably thousands, e-commerce site
  behavioural data), but for technical reasons:
 
  - the cluster distances are used in a probability prediction model with
  rare events data, so i want every cluster to contain at least some
 positive
  examples
  - the cluster centroids need to be kept in memory online for real time
  feature vector generation and olr scoring. as the vectors are quite big
 and
  there are ~100 clients to handle simultaneously (100 clients * 10
 clusters
  * ~1 non zero features per cluster centroid * 8 byte per vector
  element)
  - when selling the clusters, visualized, it's easier to plot something
  with only 10 clusters
 
  i'll give it another try with more clusters on the sketch phase and see
  what i can achieve. thanks for your help!
 
 
  On Sun, Dec 29, 2013 at 1:32 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  
   On Sat, Dec 28, 2013 at 1:10 PM, Johannes Schulte 
   johannes.schu...@gmail.com wrote:
  
   Okay, understood. For a lot of clusters (which i don't necessarily
   attribute to big data problems but definetly to nearest neighbour like
   usage of clusters), the every cluster sees every point doesnt scale
   well.
  
  
  
   Nearest neighbor sorts of problems.  Or market segmentation kinds of
   problems.  Or feature generation kinds of problems.  Or volume
  quantization
   kinds of problems.
  
   The point is that with more data, you can derive a more detailed model.
That means more clusters.
  
   The ideal case is that we have an infinite mixture model to describe
 the
   data distribution, but we haven't seen most of the mixture components
  yet.
As we get more data, we can justify saying we have more components.
  
   Even if you think that there are *really* 10 clusters in the data, I
 can
   make the case that you are going to get a better unsupervised
 description
   of those clusters by using 100 or more clusters in the k-means
 algorithm.
The idea is that each of your real clusters will be composed of
 several
  of
   the k-means clusters and finding the attachments is much easier because
  100
   clusters is much smaller than the original number of samples.
  
   As a small data example, suppose you are looking at Anderson's Iris
 data
   which is available built into R.  If we plot the 150 data points
 against
   the features we have in pairs, we can see various patterns and see
 that a
   non-linear classifier should do quite well in separating the classes (I
   hope the image makes it through the emailer):
  
   [image: Inline image 1]
  
   But if we try to do k-means on these data with only 3 clusters, we get
   very poor assignment to the three species:
  
  
   * k = kmeans(iris[,1:4], centers=3, nstart=10)*
   * table(iris$Species, k$cluster)*
  
 cluster
 1  2  3
 setosa* 50  0  0*
 versicolor*  0 48  2*
 virginica*   0 14 36*
  
   Cluster 1 captured the isolated setosa species rather well, but
  versicolor
   and virginica are not well separated because cluster 2 has 80% of
   versicolor and 20% of virginica.
  
   On the other hand, if we use 7 clusters,
  
  
   * k = kmeans(iris[,1:4], centers=7, nstart=10)*
  
   * table(iris$Species, k$cluster)*
  
  cluster
 1  2  3  4  5  6  7
 setosa*  0  0 28  0 22  0  0*
 versicolor*  0  7  0 20  0  0 23*
 virginica*  12  0  0  1  0 24 13*
  
   Each cluster is now composed of almost exactly one species.  Only
 cluster
   4 has any impurity and it is 95% composed of just versicolor samples.
  
   What this means is that we can use the 7 cluster k-means results to
 build
   a classifier that has a 1 of 7 input feature (cluster id) instead of 4
  real
   values.  That is, we have compressed the 4 original continuous features
   down to about 2.7 bits on average and this compressed representation
   actually makes building a classifier nearly trivial.
  
   * -sum(table(k$cluster) * log(table(k$cluster)/ 150

Re: Streaming KMeans clustering

2013-12-28 Thread Johannes Schulte
Okay, understood. For a lot of clusters (which i don't necessarily
attribute to big data problems but definetly to nearest neighbour like
usage of clusters), the every cluster sees every point doesnt scale well.

However, for (final) 1000 clusters i see around 200 distance measure
calculations per point, a lot better than 1000. And another thing that
makes me think is that the test runs A LOT faster than with 10 final
clusters.


On Sat, Dec 28, 2013 at 1:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Of course, this is just for one pass of k-means.  If you need 4 passes, you
 have break-even.

 More typically for big data problems, k=1000 or some such.  Total number of
 distance computations for streaming k-means will still be about 40 (or
 adjust to the more theory motivated value of log k + log log N = 10 + 5 and
 then adjust with a bit of fudge for real world).

 For k-means in that case, you still have 1000 distances to compute per pass
 and multiple passes to do.  That ratio then becomes something more like
 10,000 / 40 = 250.



 On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte 
 johannes.schu...@gmail.com wrote:

  I updated the repository (with the typo)
 
  g...@github.com:baunz/cluster-comprarison.git
 
  to include more logging information about the number of times the
 distance
  measure calculation is triggered (which is the most expensive thing imo).
  the factor of dist. measure calculations per point seen is about 40 at
  streaming k-means and 10 for regular k-means (because there are 10
  clusters).
 
  This is of course dependent on the searchSize Parameter but i used the
  default value of 2.
 
 
 
  On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm isa...@apache.org
  wrote:
 
  
   Hi Dan,
  
  
   On Fri, 27 Dec 2013 14:13:51 +0200
   Dan Filimon dfili...@apache.org wrote:
Thoughts?
  
   First of all - good to see you back on dev@ :)
  
   Seems a few people have run into these issues. As currently there is no
   high level documentation for the whole streaming kmeans implementation
   - would you mind writing up the limitation and advise you have for
 users
   of this algorithm? Doesn't need to be anything fancy - essentially a
   here's how you compute how much memory you need to run this, here's the
   limitations and the flags to deal with these, here's things that should
   be changed or fixed in a later iteration - unless your previous mail
   covers all of this already. This could safe people a few debugging
   cycles when getting started with this at scale.
  
   Feel free to get it into our web page (if you are short in time, just
   write something up using markdown, I can take over publishing it).
  
   Isabel
  
 



Re: Streaming KMeans clustering

2013-12-27 Thread Johannes Schulte
I updated the repository (with the typo)

g...@github.com:baunz/cluster-comprarison.git

to include more logging information about the number of times the distance
measure calculation is triggered (which is the most expensive thing imo).
the factor of dist. measure calculations per point seen is about 40 at
streaming k-means and 10 for regular k-means (because there are 10
clusters).

This is of course dependent on the searchSize Parameter but i used the
default value of 2.



On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm isa...@apache.orgwrote:


 Hi Dan,


 On Fri, 27 Dec 2013 14:13:51 +0200
 Dan Filimon dfili...@apache.org wrote:
  Thoughts?

 First of all - good to see you back on dev@ :)

 Seems a few people have run into these issues. As currently there is no
 high level documentation for the whole streaming kmeans implementation
 - would you mind writing up the limitation and advise you have for users
 of this algorithm? Doesn't need to be anything fancy - essentially a
 here's how you compute how much memory you need to run this, here's the
 limitations and the flags to deal with these, here's things that should
 be changed or fixed in a later iteration - unless your previous mail
 covers all of this already. This could safe people a few debugging
 cycles when getting started with this at scale.

 Feel free to get it into our web page (if you are short in time, just
 write something up using markdown, I can take over publishing it).

 Isabel



Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hi,

i also had problems getting up to speed but i made the cardinality of the
vectors responsible for that. i didn't do the math exactly but while
streaming k-means improves over regular k-means in using log(k) and
(n_umber of datapoints / k) passes, the d_imension parameter from the
original k*d*n stays untouched, right?

What is your vector's cardinality?


On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Ted,

 What were the CLI parameters when you ran this test for 1M points - no. of
 clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?







 On Tuesday, December 24, 2013 4:23 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 For reference, on a 16 core machine, I was able to run the sequential
 version of streaming k-means on 1,000,000 points, each with 10 dimensions
 in about 20 seconds.  The map-reduce versions are comparable subject to
 scaling except for startup time.



 On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
 wrote:

  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14, Suneel Marthi wrote:
   Has anyone be successful running Streaming KMeans clustering on a large
  dataset ( 100,000 points)?
  
  
   It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after the
  job has been kicked off.
  
   Its the same story when trying to run in sequential mode.
  
   Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour of
  the algorithm I am not sure if the sequence of steps in there is correct.
  
  
   There are few calls that call themselves repeatedly over and over again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
  
   We really need to have this working on datasets that are larger than
 20K
  reuters datasets.
  
   I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
  
 
 



Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).

As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes


On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Johannes,

 can you share some details about the dataset that you ran streaming
 k-means on (number of datapoints, cardinality, etc)?

 @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
 projection search) help cope with high dimensional inputs?

 --sebastian


 On 25.12.2013 10:42, Johannes Schulte wrote:
  Hi,
 
  i also had problems getting up to speed but i made the cardinality of the
  vectors responsible for that. i didn't do the math exactly but while
  streaming k-means improves over regular k-means in using log(k) and
  (n_umber of datapoints / k) passes, the d_imension parameter from the
  original k*d*n stays untouched, right?
 
  What is your vector's cardinality?
 
 
  On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Ted,
 
  What were the CLI parameters when you ran this test for 1M points - no.
 of
  clusters k, km, distanceMeasure, projectionSearch,
 estimatedDistanceCutoff?
 
 
 
 
 
 
 
  On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
 ted.dunn...@gmail.com
  wrote:
 
  For reference, on a 16 core machine, I was able to run the sequential
  version of streaming k-means on 1,000,000 points, each with 10
 dimensions
  in about 20 seconds.  The map-reduce versions are comparable subject to
  scaling except for startup time.
 
 
 
  On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
  wrote:
 
  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14, Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a
 large
  dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after
 the
  job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour
 of
  the algorithm I am not sure if the sequence of steps in there is
 correct.
 
 
  There are few calls that call themselves repeatedly over and over
 again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than
  20K
  reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
 
 
 
 
 




Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
everybody should have the right to do

job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G);

for that :)


For my problems, i always felt the sketching took too long. i put up a
simple comparison here:

g...@github.com:baunz/cluster-comprarison.git

it generates some sample vectors and clusters them with regular k-means,
and streaming k-means, both sequentially. i took 10 kmeans iterations as a
benchmark and used the default values for FastProjectionSearch from the
kMeans Driver Class.

Visual VM tells me the most time is spent in FastProjectionSearch.remove().
This is called on every added datapoint.

Maybe i got something wrong but for this sparse, high dimensional vectors i
never got streaming k-means faster than the regula version




On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Not sure how that would work in a corporate setting wherein there's a
 fixed systemwide setting that cannot be overridden.

 Sent from my iPhone

  On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org
 wrote:
 
  On 25.12.2013 14:19, Suneel Marthi wrote:
 
 
 
 
 
  On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
 ted.dunn...@gmail.com wrote:
 
  For reference, on a 16 core machine, I was able to run the sequential
  version of streaming k-means on 1,000,000 points, each with 10
 dimensions
  in about 20 seconds.  The map-reduce versions are comparable subject
 to
  scaling except for startup time.
 
  @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
 Not sure how this would have even worked for u in sequential mode in light
 of the issues reported against M-1314, M-1358, M-1380 (all of which impact
 the sequential mode); unless u had fixed them locally.
  What were ur estimatedDistanceCutoff, number of clusters 'k',
 projection search and how much memory did u have to allocate to the single
 Reducer?
 
  If I read the source code correctly, the final reducer clusters the
  sketch which should contain m * k * log n intermediate centroids, where
  k is the number of desired clusters, m is the number of mappers run and
  n is the number of datapoints. Those centroids are expected to be dense,
  so we can estimate the memory required for the final reducer using this
  formula.
 
 
 
 
 
  On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  That the algorithm runs a single reducer is expected. The algorithm
  creates a sketch of
  the data in parallel in the map-phase, which is
  collected by the reducer afterwards. The reducer then applies an
  expensive in-memory clustering algorithm to the sketch.
 
  Which dataset are you using for testing? I can also do some tests on a
  cluster here.
 
  I can imagine two possible causes for the problems: Maybe there's a
  problem with the vectors and some calculations take very long because
  the wrong access pattern or implementation is chosen.
 
  Another problem could be that the mappers and reducers have too few
  memory and spend a lot of time running garbage collections.
 
  --sebastian
 
 
  On 23.12.2013 22:14,
  Suneel Marthi wrote:
  Has anyone be successful running Streaming KMeans clustering on a
 large
  dataset ( 100,000 points)?
 
 
  It just seems to take a very long time ( 4hrs) for the mappers to
  finish on about 300K data points and the reduce phase has only a single
  reducer running and throws an OOM failing the job several hours after
 the
  job has been kicked off.
 
  Its the same story when trying to run in sequential mode.
 
  Looking at the code the bottleneck seems to be in
  StreamingKMeans.clusterInternal(), without understanding the behaviour
 of
  the algorithm I am not sure if the sequence of steps in there is
 correct.
 
 
  There are few calls that call themselves repeatedly over and over
 again
  like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
 
  We really need to have this working on datasets that are larger than
 20K
  reuters datasets.
 
  I am trying to run this on 300K vectors with k= 100, km = 1261 and
  FastProjectSearch.
 



Re: Streaming KMeans clustering

2013-12-25 Thread Johannes Schulte
To be honest, i always cancelled the sketching after a while because i
wasn't satisfied with the points per second speed. The version used is the
0.8 release.

if i find the time i'm gonna look what is called when and where and how
often and what the problem could be.


On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.

 You didn't run multiple sketching passes did you?


 Also, which version?  There was a horrendous clone in there at one time.




 On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte 
 johannes.schu...@gmail.com wrote:

  everybody should have the right to do
 
  job.getConfiguration().set(mapred.reduce.child.java.opts, -Xmx2G);
 
  for that :)
 
 
  For my problems, i always felt the sketching took too long. i put up a
  simple comparison here:
 
  g...@github.com:baunz/cluster-comprarison.git
 
  it generates some sample vectors and clusters them with regular k-means,
  and streaming k-means, both sequentially. i took 10 kmeans iterations as
 a
  benchmark and used the default values for FastProjectionSearch from the
  kMeans Driver Class.
 
  Visual VM tells me the most time is spent in
 FastProjectionSearch.remove().
  This is called on every added datapoint.
 
  Maybe i got something wrong but for this sparse, high dimensional
 vectors i
  never got streaming k-means faster than the regula version
 
 
 
 
  On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi suneel_mar...@yahoo.com
  wrote:
 
   Not sure how that would work in a corporate setting wherein there's a
   fixed systemwide setting that cannot be overridden.
  
   Sent from my iPhone
  
On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter s...@apache.org
   wrote:
   
On 25.12.2013 14:19, Suneel Marthi wrote:
   
   
   
   
   
On Tuesday, December 24, 2013 4:23 PM, Ted Dunning 
   ted.dunn...@gmail.com wrote:
   
For reference, on a 16 core machine, I was able to run the
  sequential
version of streaming k-means on 1,000,000 points, each with 10
   dimensions
in about 20 seconds.  The map-reduce versions are comparable
 subject
   to
scaling except for startup time.
   
@Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
   Not sure how this would have even worked for u in sequential mode in
  light
   of the issues reported against M-1314, M-1358, M-1380 (all of which
  impact
   the sequential mode); unless u had fixed them locally.
What were ur estimatedDistanceCutoff, number of clusters 'k',
   projection search and how much memory did u have to allocate to the
  single
   Reducer?
   
If I read the source code correctly, the final reducer clusters the
sketch which should contain m * k * log n intermediate centroids,
 where
k is the number of desired clusters, m is the number of mappers run
 and
n is the number of datapoints. Those centroids are expected to be
  dense,
so we can estimate the memory required for the final reducer using
 this
formula.
   
   
   
   
   
On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter 
 s...@apache.org
   wrote:
   
That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of
the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.
   
Which dataset are you using for testing? I can also do some tests
 on
  a
cluster here.
   
I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long
 because
the wrong access pattern or implementation is chosen.
   
Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.
   
--sebastian
   
   
On 23.12.2013 22:14,
Suneel Marthi wrote:
Has anyone be successful running Streaming KMeans clustering on a
   large
dataset ( 100,000 points)?
   
   
It just seems to take a very long time ( 4hrs) for the mappers to
finish on about 300K data points and the reduce phase has only a
  single
reducer running and throws an OOM failing the job several hours
 after
   the
job has been kicked off.
   
Its the same story when trying to run in sequential mode.
   
Looking at the code the bottleneck seems to be in
StreamingKMeans.clusterInternal(), without understanding the
  behaviour
   of
the algorithm I am not sure if the sequence of steps in there is
   correct.
   
   
There are few calls that call themselves repeatedly over and over
   again
like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
   
We really need to have this working on datasets that are larger
 than
   20K
reuters datasets.
   
I am trying to run this on 300K vectors with k= 100, km = 1261 and
FastProjectSearch.
   
  
 



[jira] [Created] (MAHOUT-1385) Caching Encoders don't cache

2013-12-20 Thread Johannes Schulte (JIRA)
Johannes Schulte created MAHOUT-1385:


 Summary: Caching Encoders don't cache
 Key: MAHOUT-1385
 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor


The Caching... line of encoders contains code of caching the hash code terms 
added to the vector. However, the method hashForProbe inside this classes is 
never called as the signature has String for the parameter original form 
(instead of byte[] like other encoders).

Changing this to byte[] however would lose the java String internal caching of 
the Strings hash code , that is used as a key in the cache map, triggering 
another hash code calculation.





--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (MAHOUT-1385) Caching Encoders don't cache

2013-12-20 Thread Johannes Schulte (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Schulte updated MAHOUT-1385:
-

Attachment: MAHOUT-1385-test.patch

No solution but demonstration of the defect

 Caching Encoders don't cache
 

 Key: MAHOUT-1385
 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Attachments: MAHOUT-1385-test.patch


 The Caching... line of encoders contains code of caching the hash code terms 
 added to the vector. However, the method hashForProbe inside this classes 
 is never called as the signature has String for the parameter original form 
 (instead of byte[] like other encoders).
 Changing this to byte[] however would lose the java String internal caching 
 of the Strings hash code , that is used as a key in the cache map, triggering 
 another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries

2013-11-15 Thread Johannes Schulte (JIRA)
Johannes Schulte created MAHOUT-1357:


 Summary: InteractionValueEncoder produces wrong traceDictionary 
entries
 Key: MAHOUT-1357
 URL: https://issues.apache.org/jira/browse/MAHOUT-1357
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor


In the trace code the byte values of the terms being hashed are not converted 
back to string but just concatenated in their raw form with Arrays.asString()

This makes the reverse engineering even harder!

Fix is to just create new string, see patch attached.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries

2013-11-15 Thread Johannes Schulte (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Schulte updated MAHOUT-1357:
-

Status: Patch Available  (was: Open)

 InteractionValueEncoder produces wrong traceDictionary entries
 --

 Key: MAHOUT-1357
 URL: https://issues.apache.org/jira/browse/MAHOUT-1357
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Attachments: MAHOUT-1357.patch


 In the trace code the byte values of the terms being hashed are not converted 
 back to string but just concatenated in their raw form with Arrays.asString()
 This makes the reverse engineering even harder!
 Fix is to just create new string, see patch attached.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries

2013-11-15 Thread Johannes Schulte (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Schulte updated MAHOUT-1357:
-

Attachment: MAHOUT-1357.patch

 InteractionValueEncoder produces wrong traceDictionary entries
 --

 Key: MAHOUT-1357
 URL: https://issues.apache.org/jira/browse/MAHOUT-1357
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
Reporter: Johannes Schulte
Priority: Minor
 Attachments: MAHOUT-1357.patch


 In the trace code the byte values of the terms being hashed are not converted 
 back to string but just concatenated in their raw form with Arrays.asString()
 This makes the reverse engineering even harder!
 Fix is to just create new string, see patch attached.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1357) InteractionValueEncoder produces wrong traceDictionary entries

2013-11-15 Thread Johannes Schulte (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823719#comment-13823719
 ] 

Johannes Schulte commented on MAHOUT-1357:
--

Thanks!

 InteractionValueEncoder produces wrong traceDictionary entries
 --

 Key: MAHOUT-1357
 URL: https://issues.apache.org/jira/browse/MAHOUT-1357
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
Reporter: Johannes Schulte
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 0.9

 Attachments: MAHOUT-1357.patch


 In the trace code the byte values of the terms being hashed are not converted 
 back to string but just concatenated in their raw form with Arrays.asString()
 This makes the reverse engineering even harder!
 Fix is to just create new string, see patch attached.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


FeatureVectorEncoder Extension

2013-05-31 Thread Johannes Schulte
Hi,

i created an extension of the feature vector encoder framework that allows
for byte array offset and length to be passed in . Some questions remain
before creating an issue and attaching a diff:


1. When using Sun Conventions with 2 spaces (the link is broken by the
way), which line length to choose? I'm using eclipse and my code looks
somewhat different with the 80 char line length

2. I extended all existing methods that take an byte[] array to also take
offset and length, the old byte[] methods stay the same. After implementing

public void addInteractionToVector(byte[] originalForm1, int offset1,
intlength1,
byte[] originalForm2, int offset2, int length2, double weight, Vector data)

i thought it would me maybe smarter to user ByteBuffer for passing in the
byte array, offset and positions. A ByteBuffer is created EVERY time inside
the MurmurHash class anyway so it wouldn't produce any more objects. Any
comments / wishes?


Cheers,


Johannes