[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

2010-02-18 Thread zhao zhendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
-

Attachment: SVMonMahout0.5.1.patch

MapReduce/MapReduceUtil.java
should have been mapreduce/MapReduceUtil.java
the folders are NOT in camel case. I still see camel casing everywhere.
>> Done. Change MapReduce -> mapreduce, ParallelAlgorithms -> 
>> parallelalgorithms and SequentialAlgorithms -> sequentialalgorithms

+  public static final String DEFAULT_HDFS_SERVER = "hdfs://localhost:12009";
+  // For HBASE
+  public static final String DEFAULT_HBASE_SERVER = "localhost:6";
These are read from the hadoop conf and hbase configuraiton file. Mahout 
shouldnt be doing any sort of configuration internally.
>> Hard coding in hadoop and hbase configuration have been removed. The Default 
>> HDFS and Hbase setting in SVMParameters only for MapReduce application 
>> runtime default setting.

No System.out.Println use the Logger log instead
>> Done.

HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the 
code. As i said Opening a FileSystem using the Configuration object would 
in-turn decide between local fs or hdfs based on the execution context
>> Yeap, the Sequential algorithms use this principle you mentioned, it 
>> determines which file system it should choose according to parameter "hdfs" 
>> is given or not in training and prediction procedures. HDFSReader only 
>> server Sequential Algorithms but not for parallel algorithms based on 
>> Map/Reduce framework. 


> Implementation of sequential SVM solver based on Pegasos
> 
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.4
>Reporter: zhao zhendong
> Fix For: 0.4
>
> Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, 
> SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, SVMDataset.patch, 
> SVMonMahout0.5.1.patch, SVMonMahout0.5.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> ---
> Currently, this package provides (Features):
> ---
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package 
> supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the 
> default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data 
> set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on 
> Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole 
> parallelized version).
> 7. Supporting Regression.
> ---
> TODO:
> ---
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> ---
> Usage:
> ---
> >>
> Classification:
> >>
> 
> @@ Training: @@
> 
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> ~~
> @ For the case that training data set on HDFS:@
> ~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs 
> hdfs://localhost:12009
> ~~
> @ Multi-class Training [Based on MapReduce Framework]:@

Re: Welcome Drew Farris

2010-02-18 Thread deneche abdelhakim
Welcome Drew

=D

On Fri, Feb 19, 2010 at 5:02 AM, Grant Ingersoll  wrote:
>
> On Feb 18, 2010, at 8:32 PM, Drew Farris wrote:
>
>>  There's lots more stuff I'd like to get in there,
>> now I only need to figure how to squeeze 48 hours of consciousness
>> into a day.
>
> I believe there is a compression algorithm for that.
>


[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-299:
---

Status: Patch Available  (was: Open)

> Collocations: improve performance by making Gram BinaryComparable
> -
>
> Key: MAHOUT-299
> URL: https://issues.apache.org/jira/browse/MAHOUT-299
> Project: Mahout
>  Issue Type: Improvement
>  Components: Utils
>Affects Versions: 0.3
>Reporter: Drew Farris
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-299.patch
>
>
> Robin's profiling indicated that a large portion of a run was spent in 
> readFields() in Gram due to the deserialization occuring as a part of Gram 
> comparions for sorting. He pointed me to BinaryComparable and the 
> implementation in Text.
> Like Text, in this new implementation, Gram stores its string in binary form. 
> When encoding the string at construction time we allocate an extra 
> character's worth of data to hold the Gram type information. When sorting 
> Grams, the binary arrays are compared instead of deserializing and comparing 
> fields.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-299:
---

Attachment: MAHOUT-299.patch

Patch as described above:

Included other cleanups:

* Gram is no longer mutable, except in the case of readFields of course.
* Added explicit NGRAM type, remove constructors that implicitly set type.
* Added unit tests for constuctors, writability. One should be added for 
sortability/comparison.
* Better unigram handling in the mappers/reducers (no need to setType on these 
anymore)
* Switched to adjustOrPutValue when accumulating frequencies in 
OpenObjectIntHashMaps

Also, NGramCollector, NGramCollectorTest should be removed from the repo. They 
are no longer relevant. Applying this patch with -E will empty and erase these 
files, but it's up to svn to do the rest.



> Collocations: improve performance by making Gram BinaryComparable
> -
>
> Key: MAHOUT-299
> URL: https://issues.apache.org/jira/browse/MAHOUT-299
> Project: Mahout
>  Issue Type: Improvement
>  Components: Utils
>Affects Versions: 0.3
>Reporter: Drew Farris
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-299.patch
>
>
> Robin's profiling indicated that a large portion of a run was spent in 
> readFields() in Gram due to the deserialization occuring as a part of Gram 
> comparions for sorting. He pointed me to BinaryComparable and the 
> implementation in Text.
> Like Text, in this new implementation, Gram stores its string in binary form. 
> When encoding the string at construction time we allocate an extra 
> character's worth of data to hold the Gram type information. When sorting 
> Grams, the binary arrays are compared instead of deserializing and comparing 
> fields.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)
Collocations: improve performance by making Gram BinaryComparable
-

 Key: MAHOUT-299
 URL: https://issues.apache.org/jira/browse/MAHOUT-299
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Fix For: 0.3


Robin's profiling indicated that a large portion of a run was spent in 
readFields() in Gram due to the deserialization occuring as a part of Gram 
comparions for sorting. He pointed me to BinaryComparable and the 
implementation in Text.

Like Text, in this new implementation, Gram stores its string in binary form. 
When encoding the string at construction time we allocate an extra character's 
worth of data to hold the Gram type information. When sorting Grams, the binary 
arrays are compared instead of deserializing and comparing fields.

 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll

On Feb 18, 2010, at 8:32 PM, Drew Farris wrote:

>  There's lots more stuff I'd like to get in there,
> now I only need to figure how to squeeze 48 hours of consciousness
> into a day.

I believe there is a compression algorithm for that.


Re: Welcome Drew Farris

2010-02-18 Thread zhao zhendong
Hi Drew,

Cong. and thank you for all helps on dependencies stuff.

Cheers,
Zhendong

On Fri, Feb 19, 2010 at 6:27 AM, Drew Farris  wrote:

> Hi Grant, fellow Mahouts,
>
> Thanks for the chance to join the team. I really look forward to
> contributing my skills to the project and learning a great deal as
> well.
>
> So, a little bit about myself;
>
> It all started with an Apple //+ back in 1982. Growing up, I never
> thought I'd do something serious with computers. In college I studied
> Computer Graphics in the Art School, Architecture and ended up getting
> a Masters in Information Resource Management on top of that.
>
> Since then I've been a software developer and who has brushed up
> against information retrieval, search and NLP for many years. I got my
> start in search and content management working as a web-developer for
> a newspaper in the early days of the Internet.
>
> As Grant mentioned, I've worked at TextWise for a number of years. The
> company grew out of a NLP-oriented research group headed by Liz Liddy
> at Syracuse University and continues to focus on the commercial
> applications of text-oriented technologies albeit with a more
> statistical orientation as of late.
>
> While a TextWise, I've worked on projects ranging everything from
> cross-language IR to contextual advertising. Mostly I've been involved
> in developing the glue that holds the core algorithms together,
> helping them scale and combining the various moving parts of an system
> into a cohesive whole. I've had a chance to do everything from web
> crawling, document processing, database, visualization, web-app and
> distributed systems work. To that end, I've worked on an off with
> Lucene, Nutch, and many other projects from the Apache ecosystem for
> years.
>
> Reading "Programming Collective Intelligence" a couple years back
> really solidified my interest in machine learning algorithms. After
> building a number of different systems to process large amounts of
> content, the ability to quickly and effortlessly scale things up with
> hadoop/mapreduce really appeals to me. Thje Mahout project is
> wonderful to me in that it combines the these things I'm interested in
> personally, has relevance to the things I do for work and has a really
> outstanding group of people working on it.
>
> I'm looking forward to working with you all,
>
> Drew
>
> On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil  wrote:
> > Welcome Drew
> >
> > @Grant: No customary introduction? :)
> >
> > Robin
> >
> > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll  >wrote:
> >
> >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
> >> newest member of the Mahout committer family.  Drew has been
> contributing
> >> some really nice work to Mahout in recent months and I look forward to
> his
> >> continuing involvement with Mahout.
> >>
> >> Congrats, Drew!
> >>
> >>
> >> -Grant
> >
>



-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>

Department of Computer Science
School of Computing
National University of Singapore

>>><><><><><><><><<><>><><<


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
Can't remember.  I suppose I should look at the code.  :-)

On Thu, Feb 18, 2010 at 6:19 PM, Jake Mannix  wrote:

> Don't we already have generalized scalar aggregation?




-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
Don't we already have generalized scalar aggregation?  I thought I committed
that a while back.  Its very useful for inner products, distances, and
stats.

Vector accumulation using a BinaryFunction as a map just needs to be made
more efficient (sparsity and random accessibility taken into account), but
works.

The only remaining piece is something like accumulate(Vector v,
BinaryFunction map, BinaryFunction aggregator) - a method on Matrix, which
aggregates partial map() combinations af each row with the input Vector, and
returns a Vector.  This generalizes times(Vector).  I guess
Matrix.assign(Vector v, BinaryFunction map) could be useful for mutating a
matrix, but on HDFS would operate by making new sequencefiles.

  -jake

On Feb 18, 2010 5:11 PM, "Ted Dunning"  wrote:

On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix  wrote:
> What would this metho...
This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

The reason why we need a specialized function is to do things in a nicely >
mutating way: Hadoop M...
We definitely need that too.

> The only thing more we need than what we have now is in the assign method
> - > currently we ha...
That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.


Re: Welcome Drew Farris

2010-02-18 Thread Drew Farris
On Thu, Feb 18, 2010 at 7:45 PM, Jake Mannix  wrote:
> Welcome Drew!  I've been using your excellent colloc code quite a bit
> in testing my svd stuff (produces nicely bigger vectors out of text!),
> looking
> forward to more cool stuff (NLP package!  Bring it on! :) ).
>

Heh, great to hear! There's lots more stuff I'd like to get in there,
now I only need to figure how to squeeze 48 hours of consciousness
into a day.


[jira] Created: (MAHOUT-298) 2 test case fails while trying to mvn clean install after checking out revision 911542 of trunk

2010-02-18 Thread Anish Shah (JIRA)
2 test case fails while trying to mvn clean install after checking out revision 
911542 of trunk
---

 Key: MAHOUT-298
 URL: https://issues.apache.org/jira/browse/MAHOUT-298
 Project: Mahout
  Issue Type: Test
  Components: Clustering
Affects Versions: 0.3
 Environment: Windows 7 with Cygwin
Reporter: Anish Shah
Priority: Minor


I checked out revision 911542 from trunk and seeing the following failed tests 
when I ran mvn clean install:

Results :

Failed tests:
  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
sClustering)

Tests run: 338, Failures: 2, Errors: 0, Skipped: 0

[INFO] 
[ERROR] BUILD FAILURE
[INFO] 
[INFO] There are test failures.

I looked in the surefire-reports and see the following details on the failures:

testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)  Time 
elapsed: 11.169 sec  <<< FAILURE!
junit.framework.AssertionFailedError: clusters[3] expected:<4> but was:<2>
at junit.framework.Assert.fail(Assert.java:47)
...

testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  Time elapsed: 3.35 sec  <<< FAILURE!
junit.framework.AssertionFailedError: num points[0] expected:<4> but was:<1>
at junit.framework.Assert.fail(Assert.java:47)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
I have created https://issues.apache.org/jira/browse/MAHOUT-298 to track
this.

On Thu, Feb 18, 2010 at 6:59 PM, Ted Dunning  wrote:

> Darn.  That uses up all of my ideas.
>
> I would vote for a platform issue.
>
> On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah  wrote:
>
> > I checked out revision 911542 (after removing the mahout-trunk from my
> > local
> > machine) and
> > tried again and still getting the same 2 failures upon running mvn clean
> > install!
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix  wrote:

> What would this method mean?  aggregatorUnit means what?  What would this
> be a method on?
>

This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

 The reason why we need a specialized function is to do things in a nicely
> mutating way: Hadoop M/R is functional in the lispy-sensen: read-only
> immutable objects (once on the filesystem).
>

We definitely need that too.


>  The only thing more we need than what we have now is in the assign method
> -
> currently we have it with a map, with reduce being the identity (with
> replacement -
> the calling object becomes the output of the reduce -ie the output of the
> map):
>

That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.


Re: Welcome Drew Farris

2010-02-18 Thread Jake Mannix
Welcome Drew!  I've been using your excellent colloc code quite a bit
in testing my svd stuff (produces nicely bigger vectors out of text!),
looking
forward to more cool stuff (NLP package!  Bring it on! :) ).

  -jake

On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris  wrote:

> Hi Grant, fellow Mahouts,
>
> Thanks for the chance to join the team. I really look forward to
> contributing my skills to the project and learning a great deal as
> well.
>
> So, a little bit about myself;
>
> It all started with an Apple //+ back in 1982. Growing up, I never
> thought I'd do something serious with computers. In college I studied
> Computer Graphics in the Art School, Architecture and ended up getting
> a Masters in Information Resource Management on top of that.
>
> Since then I've been a software developer and who has brushed up
> against information retrieval, search and NLP for many years. I got my
> start in search and content management working as a web-developer for
> a newspaper in the early days of the Internet.
>
> As Grant mentioned, I've worked at TextWise for a number of years. The
> company grew out of a NLP-oriented research group headed by Liz Liddy
> at Syracuse University and continues to focus on the commercial
> applications of text-oriented technologies albeit with a more
> statistical orientation as of late.
>
> While a TextWise, I've worked on projects ranging everything from
> cross-language IR to contextual advertising. Mostly I've been involved
> in developing the glue that holds the core algorithms together,
> helping them scale and combining the various moving parts of an system
> into a cohesive whole. I've had a chance to do everything from web
> crawling, document processing, database, visualization, web-app and
> distributed systems work. To that end, I've worked on an off with
> Lucene, Nutch, and many other projects from the Apache ecosystem for
> years.
>
> Reading "Programming Collective Intelligence" a couple years back
> really solidified my interest in machine learning algorithms. After
> building a number of different systems to process large amounts of
> content, the ability to quickly and effortlessly scale things up with
> hadoop/mapreduce really appeals to me. Thje Mahout project is
> wonderful to me in that it combines the these things I'm interested in
> personally, has relevance to the things I do for work and has a really
> outstanding group of people working on it.
>
> I'm looking forward to working with you all,
>
> Drew
>
> On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil  wrote:
> > Welcome Drew
> >
> > @Grant: No customary introduction? :)
> >
> > Robin
> >
> > On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll  >wrote:
> >
> >> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
> >> newest member of the Mahout committer family.  Drew has been
> contributing
> >> some really nice work to Mahout in recent months and I look forward to
> his
> >> continuing involvement with Mahout.
> >>
> >> Congrats, Drew!
> >>
> >>
> >> -Grant
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
On Thu, Feb 18, 2010 at 3:58 PM, Ted Dunning  wrote:

> Actually, this makes the case that we should have something like:
>
> microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction,
> vectorA, vectorB)
>

What would this method mean?  aggregatorUnit means what?  What would this
be a method on?

The reason why we need a specialized function is to do things in a nicely
mutating way: Hadoop M/R is functional in the lispy-sensen: read-only
immutable
objects (once on the filesystem).

The only thing more we need than what we have now is in the assign method -
currently we have it with a map, with reduce being the identity (with
replacement -
the calling object becomes the output of the reduce -ie the output of the
map):

  Vector.assign(Vector other, BinaryFunction map) {
// implemented effectively as follows in AbstractVector
for(int i=0;i it = sparse ? other.iterateNonZero() :
other.iterateAll();
 while(it.hasNext()) {
   Element e = it.next();
   int i = e.index();
   e.set(i, map.apply(getQuick(i), e.get()));
 }
 // do stuff with the reduce - what exactly?
 return this;
  }

(is the reduce necessary?)


  -jake


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Ted Dunning
Darn.  That uses up all of my ideas.

I would vote for a platform issue.

On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah  wrote:

> I checked out revision 911542 (after removing the mahout-trunk from my
> local
> machine) and
> tried again and still getting the same 2 failures upon running mvn clean
> install!
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
Actually, this makes the case that we should have something like:

 microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction,
vectorA, vectorB)

The name should be changed after its rhetorical effect has worn off.  As the
Chukwa guys tend to say, its turtles all the way down.  We can have
map-reduce inside map-reduce.

On Thu, Feb 18, 2010 at 3:41 PM, Robin Anil  wrote:

> TODO: sum of minus to be optimised without having to hold the intermediate
> vector.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
Yes.  addTo is just a specialization of a very, very common case.

On Thu, Feb 18, 2010 at 1:06 PM, Sean Owen  wrote:

> Isn't this basically what assign() is for?
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
TODO: sum of minus to be optimised without having to hold the intermediate
vector.


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
Ted,

I checked out revision 911542 (after removing the mahout-trunk from my local
machine) and
tried again and still getting the same 2 failures upon running mvn clean
install!

Anish

On Thu, Feb 18, 2010 at 11:51 AM, Ted Dunning  wrote:

> Note the different version number here.
>
> I think that Anish has somehow gotten stuck on an old version.  Anish, can
> you do a clean checkout and build?
>
> On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil  wrote:
>
> > I am building Revision: 911405 on a Mac, and things work fine for me. I
> am
> > assuming same is the case for sean(mac)
> >
> > ...
> >
> >
> > On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah  wrote:
> >
> > ...
> > > $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
> > > Checked out revision 911364.
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Last one had a correctness problem in manhattan. This one fixes it.



> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
> MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Welcome Drew Farris

2010-02-18 Thread Ted Dunning
We have already enjoyed working with you and look forward to more of it.
Good to have you on board.

On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris  wrote:

> I'm looking forward to working with you all,




-- 
Ted Dunning, CTO
DeepDyve


Re: Welcome Drew Farris

2010-02-18 Thread Drew Farris
Hi Grant, fellow Mahouts,

Thanks for the chance to join the team. I really look forward to
contributing my skills to the project and learning a great deal as
well.

So, a little bit about myself;

It all started with an Apple //+ back in 1982. Growing up, I never
thought I'd do something serious with computers. In college I studied
Computer Graphics in the Art School, Architecture and ended up getting
a Masters in Information Resource Management on top of that.

Since then I've been a software developer and who has brushed up
against information retrieval, search and NLP for many years. I got my
start in search and content management working as a web-developer for
a newspaper in the early days of the Internet.

As Grant mentioned, I've worked at TextWise for a number of years. The
company grew out of a NLP-oriented research group headed by Liz Liddy
at Syracuse University and continues to focus on the commercial
applications of text-oriented technologies albeit with a more
statistical orientation as of late.

While a TextWise, I've worked on projects ranging everything from
cross-language IR to contextual advertising. Mostly I've been involved
in developing the glue that holds the core algorithms together,
helping them scale and combining the various moving parts of an system
into a cohesive whole. I've had a chance to do everything from web
crawling, document processing, database, visualization, web-app and
distributed systems work. To that end, I've worked on an off with
Lucene, Nutch, and many other projects from the Apache ecosystem for
years.

Reading "Programming Collective Intelligence" a couple years back
really solidified my interest in machine learning algorithms. After
building a number of different systems to process large amounts of
content, the ability to quickly and effortlessly scale things up with
hadoop/mapreduce really appeals to me. Thje Mahout project is
wonderful to me in that it combines the these things I'm interested in
personally, has relevance to the things I do for work and has a really
outstanding group of people working on it.

I'm looking forward to working with you all,

Drew

On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil  wrote:
> Welcome Drew
>
> @Grant: No customary introduction? :)
>
> Robin
>
> On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote:
>
>> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
>> newest member of the Mahout committer family.  Drew has been contributing
>> some really nice work to Mahout in recent months and I look forward to his
>> continuing involvement with Mahout.
>>
>> Congrats, Drew!
>>
>>
>> -Grant
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
I have made all changes take a look. Same could be done for fuzzy kmeans,
dirichlet and lda. Havent had time to look at internals yet.



On Fri, Feb 19, 2010 at 3:35 AM, Robin Anil  wrote:

> 2 second canopy clustering over reuters :D
>
>
>
> On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil  wrote:
>
>> This really doesnt work for, i cant modify any vectors inside distance
>> measure. So i have wrote a subtract inside manhattan distance itself. Works
>> great for now
>>
>>
>> On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix wrote:
>>
>>> currentVector.assign(otherVector, minus) takes the other vector, and
>>> subtracts
>>> it from currentVector, which mutates currentVector.  If currentVector is
>>> DenseVector,
>>> this is already optimized.  It could be optimized if currentVector is
>>> RandomAccessSparse.
>>>
>>>  -jake
>>>
>>> On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil 
>>> wrote:
>>>
>>> > Just to be clear, this does:
>>> > currentVector-otherVector ?
>>> >
>>> > currentVector.assign(otherVector, Functions.minus);
>>> >
>>> >
>>> >
>>> > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix 
>>> > wrote:
>>> >
>>> > > to do subtractFrom, you can instead just do
>>> > >
>>> > >  Vector.assign(otherVector, Functions.minus);
>>> > >
>>> > > The problem is that while DenseVector has an optimization here: if
>>> the
>>> > > BinaryFunction passed in is additive (it's an instance of PlusMult),
>>> > > sparse iteration over "otherVector" is executed, applying the binary
>>> > > function and mutating self.  AbstractVector should have this
>>> optimization
>>> > > in general, as it would be useful in RandomAccessSparseVector
>>> (although
>>> > > not terribly useful in SequentialAccessSparseVector, but still better
>>> > than
>>> > > current).
>>> > >
>>> > >  -jake
>>> > >
>>> > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil 
>>> > wrote:
>>> > >
>>> > > > I just had to change it at one place(and the tests pass, which is
>>> > scary).
>>> > > > Canopy is really fast now :). Still could be pushed
>>> > > > Now the bottleneck is minus
>>> > > >
>>> > > > maybe a subtractFrom on the lines of addTo? or a mutable negate
>>> > function
>>> > > > for
>>> > > > vector, before adding to
>>> > > >
>>> > > > Robin
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix <
>>> jake.man...@gmail.com>
>>> > > > wrote:
>>> > > >
>>> > > > > I use it (addTo) in decomposer, for exactly this performance
>>> issue.
>>> > > > > Changing
>>> > > > > plus into addTo requires care, because since plus() leaves
>>> arguments
>>> > > > > immutable,
>>> > > > > there may be code which *assumes* that this is the case, and
>>> doing
>>> > > > addTo()
>>> > > > > leaves side effects which might not be expected.  This bit me
>>> hard on
>>> > > svd
>>> > > > > migration, because I had other assumptions about mutability in
>>> there.
>>> > > > >
>>> > > > >  -jake
>>> > > > >
>>> > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil <
>>> robin.a...@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > > ah! Its not being used anywhere :). Should we make that a big
>>> task
>>> > > > before
>>> > > > > > 0.3 ? Sweep through code(mainly clustering) and change all
>>> these
>>> > > > things.
>>> > > > > >
>>> > > > > > Robin
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen 
>>> > wrote:
>>> > > > > >
>>> > > > > > > Isn't this basically what assign() is for?
>>> > > > > > >
>>> > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil <
>>> > robin.a...@gmail.com>
>>> > > > > > wrote:
>>> > > > > > > > Now the big perf bottle neck is immutability
>>> > > > > > > >
>>> > > > > > > > Say for plus its doing vector.clone() before doing anything
>>> > else.
>>> > > > > > > > There should be both immutable and mutable plus functions
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>


[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Changed Euclidean distance measure to use v2 to iterate and v1 to access

> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
> MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Improvements in TanimotoDistanceMeasure

> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Really fast now.

> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
2 second canopy clustering over reuters :D


On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil  wrote:

> This really doesnt work for, i cant modify any vectors inside distance
> measure. So i have wrote a subtract inside manhattan distance itself. Works
> great for now
>
>
> On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix wrote:
>
>> currentVector.assign(otherVector, minus) takes the other vector, and
>> subtracts
>> it from currentVector, which mutates currentVector.  If currentVector is
>> DenseVector,
>> this is already optimized.  It could be optimized if currentVector is
>> RandomAccessSparse.
>>
>>  -jake
>>
>> On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil  wrote:
>>
>> > Just to be clear, this does:
>> > currentVector-otherVector ?
>> >
>> > currentVector.assign(otherVector, Functions.minus);
>> >
>> >
>> >
>> > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix 
>> > wrote:
>> >
>> > > to do subtractFrom, you can instead just do
>> > >
>> > >  Vector.assign(otherVector, Functions.minus);
>> > >
>> > > The problem is that while DenseVector has an optimization here: if the
>> > > BinaryFunction passed in is additive (it's an instance of PlusMult),
>> > > sparse iteration over "otherVector" is executed, applying the binary
>> > > function and mutating self.  AbstractVector should have this
>> optimization
>> > > in general, as it would be useful in RandomAccessSparseVector
>> (although
>> > > not terribly useful in SequentialAccessSparseVector, but still better
>> > than
>> > > current).
>> > >
>> > >  -jake
>> > >
>> > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil 
>> > wrote:
>> > >
>> > > > I just had to change it at one place(and the tests pass, which is
>> > scary).
>> > > > Canopy is really fast now :). Still could be pushed
>> > > > Now the bottleneck is minus
>> > > >
>> > > > maybe a subtractFrom on the lines of addTo? or a mutable negate
>> > function
>> > > > for
>> > > > vector, before adding to
>> > > >
>> > > > Robin
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix > >
>> > > > wrote:
>> > > >
>> > > > > I use it (addTo) in decomposer, for exactly this performance
>> issue.
>> > > > > Changing
>> > > > > plus into addTo requires care, because since plus() leaves
>> arguments
>> > > > > immutable,
>> > > > > there may be code which *assumes* that this is the case, and doing
>> > > > addTo()
>> > > > > leaves side effects which might not be expected.  This bit me hard
>> on
>> > > svd
>> > > > > migration, because I had other assumptions about mutability in
>> there.
>> > > > >
>> > > > >  -jake
>> > > > >
>> > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil > >
>> > > > wrote:
>> > > > >
>> > > > > > ah! Its not being used anywhere :). Should we make that a big
>> task
>> > > > before
>> > > > > > 0.3 ? Sweep through code(mainly clustering) and change all these
>> > > > things.
>> > > > > >
>> > > > > > Robin
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen 
>> > wrote:
>> > > > > >
>> > > > > > > Isn't this basically what assign() is for?
>> > > > > > >
>> > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil <
>> > robin.a...@gmail.com>
>> > > > > > wrote:
>> > > > > > > > Now the big perf bottle neck is immutability
>> > > > > > > >
>> > > > > > > > Say for plus its doing vector.clone() before doing anything
>> > else.
>> > > > > > > > There should be both immutable and mutable plus functions
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
This really doesnt work for, i cant modify any vectors inside distance
measure. So i have wrote a subtract inside manhattan distance itself. Works
great for now

On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix  wrote:

> currentVector.assign(otherVector, minus) takes the other vector, and
> subtracts
> it from currentVector, which mutates currentVector.  If currentVector is
> DenseVector,
> this is already optimized.  It could be optimized if currentVector is
> RandomAccessSparse.
>
>  -jake
>
> On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil  wrote:
>
> > Just to be clear, this does:
> > currentVector-otherVector ?
> >
> > currentVector.assign(otherVector, Functions.minus);
> >
> >
> >
> > On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix 
> > wrote:
> >
> > > to do subtractFrom, you can instead just do
> > >
> > >  Vector.assign(otherVector, Functions.minus);
> > >
> > > The problem is that while DenseVector has an optimization here: if the
> > > BinaryFunction passed in is additive (it's an instance of PlusMult),
> > > sparse iteration over "otherVector" is executed, applying the binary
> > > function and mutating self.  AbstractVector should have this
> optimization
> > > in general, as it would be useful in RandomAccessSparseVector (although
> > > not terribly useful in SequentialAccessSparseVector, but still better
> > than
> > > current).
> > >
> > >  -jake
> > >
> > > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil 
> > wrote:
> > >
> > > > I just had to change it at one place(and the tests pass, which is
> > scary).
> > > > Canopy is really fast now :). Still could be pushed
> > > > Now the bottleneck is minus
> > > >
> > > > maybe a subtractFrom on the lines of addTo? or a mutable negate
> > function
> > > > for
> > > > vector, before adding to
> > > >
> > > > Robin
> > > >
> > > >
> > > >
> > > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix 
> > > > wrote:
> > > >
> > > > > I use it (addTo) in decomposer, for exactly this performance issue.
> > > > > Changing
> > > > > plus into addTo requires care, because since plus() leaves
> arguments
> > > > > immutable,
> > > > > there may be code which *assumes* that this is the case, and doing
> > > > addTo()
> > > > > leaves side effects which might not be expected.  This bit me hard
> on
> > > svd
> > > > > migration, because I had other assumptions about mutability in
> there.
> > > > >
> > > > >  -jake
> > > > >
> > > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil 
> > > > wrote:
> > > > >
> > > > > > ah! Its not being used anywhere :). Should we make that a big
> task
> > > > before
> > > > > > 0.3 ? Sweep through code(mainly clustering) and change all these
> > > > things.
> > > > > >
> > > > > > Robin
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen 
> > wrote:
> > > > > >
> > > > > > > Isn't this basically what assign() is for?
> > > > > > >
> > > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil <
> > robin.a...@gmail.com>
> > > > > > wrote:
> > > > > > > > Now the big perf bottle neck is immutability
> > > > > > > >
> > > > > > > > Say for plus its doing vector.clone() before doing anything
> > else.
> > > > > > > > There should be both immutable and mutable plus functions
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
currentVector.assign(otherVector, minus) takes the other vector, and
subtracts
it from currentVector, which mutates currentVector.  If currentVector is
DenseVector,
this is already optimized.  It could be optimized if currentVector is
RandomAccessSparse.

  -jake

On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil  wrote:

> Just to be clear, this does:
> currentVector-otherVector ?
>
> currentVector.assign(otherVector, Functions.minus);
>
>
>
> On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix 
> wrote:
>
> > to do subtractFrom, you can instead just do
> >
> >  Vector.assign(otherVector, Functions.minus);
> >
> > The problem is that while DenseVector has an optimization here: if the
> > BinaryFunction passed in is additive (it's an instance of PlusMult),
> > sparse iteration over "otherVector" is executed, applying the binary
> > function and mutating self.  AbstractVector should have this optimization
> > in general, as it would be useful in RandomAccessSparseVector (although
> > not terribly useful in SequentialAccessSparseVector, but still better
> than
> > current).
> >
> >  -jake
> >
> > On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil 
> wrote:
> >
> > > I just had to change it at one place(and the tests pass, which is
> scary).
> > > Canopy is really fast now :). Still could be pushed
> > > Now the bottleneck is minus
> > >
> > > maybe a subtractFrom on the lines of addTo? or a mutable negate
> function
> > > for
> > > vector, before adding to
> > >
> > > Robin
> > >
> > >
> > >
> > > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix 
> > > wrote:
> > >
> > > > I use it (addTo) in decomposer, for exactly this performance issue.
> > > > Changing
> > > > plus into addTo requires care, because since plus() leaves arguments
> > > > immutable,
> > > > there may be code which *assumes* that this is the case, and doing
> > > addTo()
> > > > leaves side effects which might not be expected.  This bit me hard on
> > svd
> > > > migration, because I had other assumptions about mutability in there.
> > > >
> > > >  -jake
> > > >
> > > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil 
> > > wrote:
> > > >
> > > > > ah! Its not being used anywhere :). Should we make that a big task
> > > before
> > > > > 0.3 ? Sweep through code(mainly clustering) and change all these
> > > things.
> > > > >
> > > > > Robin
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen 
> wrote:
> > > > >
> > > > > > Isn't this basically what assign() is for?
> > > > > >
> > > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil <
> robin.a...@gmail.com>
> > > > > wrote:
> > > > > > > Now the big perf bottle neck is immutability
> > > > > > >
> > > > > > > Say for plus its doing vector.clone() before doing anything
> else.
> > > > > > > There should be both immutable and mutable plus functions
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll

On Feb 18, 2010, at 4:05 PM, Robin Anil wrote:

> Welcome Drew
> 
> @Grant: No customary introduction? :)

Sorry, forgot that.  Drew, tradition is new committers give a little background 
on themselves.  I can add one tidbit:  I worked w/ Drew way back when at 
TextWise, so I'm glad he showed up here!

> 
> Robin
> 
> On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote:
> 
>> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
>> newest member of the Mahout committer family.  Drew has been contributing
>> some really nice work to Mahout in recent months and I look forward to his
>> continuing involvement with Mahout.
>> 
>> Congrats, Drew!
>> 
>> 
>> -Grant




Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Just to be clear, this does:
currentVector-otherVector ?

currentVector.assign(otherVector, Functions.minus);



On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix  wrote:

> to do subtractFrom, you can instead just do
>
>  Vector.assign(otherVector, Functions.minus);
>
> The problem is that while DenseVector has an optimization here: if the
> BinaryFunction passed in is additive (it's an instance of PlusMult),
> sparse iteration over "otherVector" is executed, applying the binary
> function and mutating self.  AbstractVector should have this optimization
> in general, as it would be useful in RandomAccessSparseVector (although
> not terribly useful in SequentialAccessSparseVector, but still better than
> current).
>
>  -jake
>
> On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil  wrote:
>
> > I just had to change it at one place(and the tests pass, which is scary).
> > Canopy is really fast now :). Still could be pushed
> > Now the bottleneck is minus
> >
> > maybe a subtractFrom on the lines of addTo? or a mutable negate function
> > for
> > vector, before adding to
> >
> > Robin
> >
> >
> >
> > On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix 
> > wrote:
> >
> > > I use it (addTo) in decomposer, for exactly this performance issue.
> > > Changing
> > > plus into addTo requires care, because since plus() leaves arguments
> > > immutable,
> > > there may be code which *assumes* that this is the case, and doing
> > addTo()
> > > leaves side effects which might not be expected.  This bit me hard on
> svd
> > > migration, because I had other assumptions about mutability in there.
> > >
> > >  -jake
> > >
> > > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil 
> > wrote:
> > >
> > > > ah! Its not being used anywhere :). Should we make that a big task
> > before
> > > > 0.3 ? Sweep through code(mainly clustering) and change all these
> > things.
> > > >
> > > > Robin
> > > >
> > > >
> > > >
> > > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen  wrote:
> > > >
> > > > > Isn't this basically what assign() is for?
> > > > >
> > > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
> > > > wrote:
> > > > > > Now the big perf bottle neck is immutability
> > > > > >
> > > > > > Say for plus its doing vector.clone() before doing anything else.
> > > > > > There should be both immutable and mutable plus functions
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
to do subtractFrom, you can instead just do

  Vector.assign(otherVector, Functions.minus);

The problem is that while DenseVector has an optimization here: if the
BinaryFunction passed in is additive (it's an instance of PlusMult),
sparse iteration over "otherVector" is executed, applying the binary
function and mutating self.  AbstractVector should have this optimization
in general, as it would be useful in RandomAccessSparseVector (although
not terribly useful in SequentialAccessSparseVector, but still better than
current).

  -jake

On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil  wrote:

> I just had to change it at one place(and the tests pass, which is scary).
> Canopy is really fast now :). Still could be pushed
> Now the bottleneck is minus
>
> maybe a subtractFrom on the lines of addTo? or a mutable negate function
> for
> vector, before adding to
>
> Robin
>
>
>
> On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix 
> wrote:
>
> > I use it (addTo) in decomposer, for exactly this performance issue.
> > Changing
> > plus into addTo requires care, because since plus() leaves arguments
> > immutable,
> > there may be code which *assumes* that this is the case, and doing
> addTo()
> > leaves side effects which might not be expected.  This bit me hard on svd
> > migration, because I had other assumptions about mutability in there.
> >
> >  -jake
> >
> > On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil 
> wrote:
> >
> > > ah! Its not being used anywhere :). Should we make that a big task
> before
> > > 0.3 ? Sweep through code(mainly clustering) and change all these
> things.
> > >
> > > Robin
> > >
> > >
> > >
> > > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen  wrote:
> > >
> > > > Isn't this basically what assign() is for?
> > > >
> > > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
> > > wrote:
> > > > > Now the big perf bottle neck is immutability
> > > > >
> > > > > Say for plus its doing vector.clone() before doing anything else.
> > > > > There should be both immutable and mutable plus functions
> > > > >
> > > >
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
I just had to change it at one place(and the tests pass, which is scary).
Canopy is really fast now :). Still could be pushed
Now the bottleneck is minus

maybe a subtractFrom on the lines of addTo? or a mutable negate function for
vector, before adding to

Robin



On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix  wrote:

> I use it (addTo) in decomposer, for exactly this performance issue.
> Changing
> plus into addTo requires care, because since plus() leaves arguments
> immutable,
> there may be code which *assumes* that this is the case, and doing addTo()
> leaves side effects which might not be expected.  This bit me hard on svd
> migration, because I had other assumptions about mutability in there.
>
>  -jake
>
> On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil  wrote:
>
> > ah! Its not being used anywhere :). Should we make that a big task before
> > 0.3 ? Sweep through code(mainly clustering) and change all these things.
> >
> > Robin
> >
> >
> >
> > On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen  wrote:
> >
> > > Isn't this basically what assign() is for?
> > >
> > > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
> > wrote:
> > > > Now the big perf bottle neck is immutability
> > > >
> > > > Say for plus its doing vector.clone() before doing anything else.
> > > > There should be both immutable and mutable plus functions
> > > >
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
I use it (addTo) in decomposer, for exactly this performance issue.
Changing
plus into addTo requires care, because since plus() leaves arguments
immutable,
there may be code which *assumes* that this is the case, and doing addTo()
leaves side effects which might not be expected.  This bit me hard on svd
migration, because I had other assumptions about mutability in there.

  -jake

On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil  wrote:

> ah! Its not being used anywhere :). Should we make that a big task before
> 0.3 ? Sweep through code(mainly clustering) and change all these things.
>
> Robin
>
>
>
> On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen  wrote:
>
> > Isn't this basically what assign() is for?
> >
> > On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
> wrote:
> > > Now the big perf bottle neck is immutability
> > >
> > > Say for plus its doing vector.clone() before doing anything else.
> > > There should be both immutable and mutable plus functions
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
ah! Its not being used anywhere :). Should we make that a big task before
0.3 ? Sweep through code(mainly clustering) and change all these things.

Robin



On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen  wrote:

> Isn't this basically what assign() is for?
>
> On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil  wrote:
> > Now the big perf bottle neck is immutability
> >
> > Say for plus its doing vector.clone() before doing anything else.
> > There should be both immutable and mutable plus functions
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
addTo() is mutable plus.

On Thu, Feb 18, 2010 at 1:04 PM, Robin Anil  wrote:

> Now the big perf bottle neck is immutability
>
> Say for plus its doing vector.clone() before doing anything else.
> There should be both immutable and mutable plus functions
>
> Robin
>
>
>
> On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix 
> wrote:
>
> > I dunno, we can file it for whenever, 0.4 and if it turns out it's a
> really
> > easy
> > change we can always commit it for 0.3.
> >
> >  -jake
> >
> > On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil 
> wrote:
> >
> > > File it for 0.3 ?
> > >
> > >
> > > Robin
> > >
> > > On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix 
> > > wrote:
> > >
> > > > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil 
> > > wrote:
> > > >
> > > > > I was trying out SeqAccessSparseVector on Canopy Clustering using
> > > > Manhattan
> > > > > distance. I found performance to be really bad. So I profiled it
> with
> > > > > Yourkit(Thanks a lot for providing us free license)
> > > > >
> > > > > Since i was trying out manhattan distance, there were a lot of A-B
> > > which
> > > > > created a lot of clone operation 5% of the total time
> > > > > there were also so many A+B for adding a point to the canopy to
> > > average.
> > > > > this was also creating a lot of clone operations.  90% of the total
> > > time
> > > > >
> > > >
> > > > SequentialAccessSparseVector should only be used in a read-only
> > fashion.
> > > >  If
> > > > you are creating an average centroid which is sparse, but it is
> > mutating,
> > > > then it should be RandomAccessSparseVector.  The points which are
> being
> > > > used
> > > > to create it can be SequentialAccessSparseVector (if they themselves
> > > never
> > > > change), but then the method called should be
> > > > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
> > > > exploits
> > > > the fast sequential iteration of SeqAcc, and the fast random-access
> > > > mutatability of RandAcc.
> > > >
> > > >
> > > > >
> > > > > So we definitely needs to improve that..
> > > > >
> > > > > For a small hack. I made the cluster centers RandomAccess Vector.
> > > Things
> > > > > are fast again. I dont know whether to commit or not. But something
> > to
> > > > look
> > > > > into in 0.4?
> > > > >
> > > >
> > > > Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch
> so
> > > we
> > > > can see exactly what the change is?
> > > >
> > > >  -jake
> > > >
> > >
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Sean Owen
Isn't this basically what assign() is for?

On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil  wrote:
> Now the big perf bottle neck is immutability
>
> Say for plus its doing vector.clone() before doing anything else.
> There should be both immutable and mutable plus functions
>


Re: Welcome Drew Farris

2010-02-18 Thread Robin Anil
Welcome Drew

@Grant: No customary introduction? :)

Robin

On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll wrote:

> On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
> newest member of the Mahout committer family.  Drew has been contributing
> some really nice work to Mahout in recent months and I look forward to his
> continuing involvement with Mahout.
>
> Congrats, Drew!
>
>
> -Grant


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Grant Ingersoll
If it's as obvious a win as it sounds, I'd say 0.3.  We aren't in lock down yet 
are we?

-Grant

On Feb 18, 2010, at 3:37 PM, Jake Mannix wrote:

> I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
> easy
> change we can always commit it for 0.3.
> 
>  -jake
> 
> On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil  wrote:
> 
>> File it for 0.3 ?
>> 
>> 
>> Robin
>> 
>> On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix 
>> wrote:
>> 
>>> On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil 
>> wrote:
>>> 
 I was trying out SeqAccessSparseVector on Canopy Clustering using
>>> Manhattan
 distance. I found performance to be really bad. So I profiled it with
 Yourkit(Thanks a lot for providing us free license)
 
 Since i was trying out manhattan distance, there were a lot of A-B
>> which
 created a lot of clone operation 5% of the total time
 there were also so many A+B for adding a point to the canopy to
>> average.
 this was also creating a lot of clone operations.  90% of the total
>> time
 
>>> 
>>> SequentialAccessSparseVector should only be used in a read-only fashion.
>>> If
>>> you are creating an average centroid which is sparse, but it is mutating,
>>> then it should be RandomAccessSparseVector.  The points which are being
>>> used
>>> to create it can be SequentialAccessSparseVector (if they themselves
>> never
>>> change), but then the method called should be
>>> SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
>>> exploits
>>> the fast sequential iteration of SeqAcc, and the fast random-access
>>> mutatability of RandAcc.
>>> 
>>> 
 
 So we definitely needs to improve that..
 
 For a small hack. I made the cluster centers RandomAccess Vector.
>> Things
 are fast again. I dont know whether to commit or not. But something to
>>> look
 into in 0.4?
 
>>> 
>>> Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
>> we
>>> can see exactly what the change is?
>>> 
>>> -jake
>>> 
>> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Now the big perf bottle neck is immutability

Say for plus its doing vector.clone() before doing anything else.
There should be both immutable and mutable plus functions

Robin



On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix  wrote:

> I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
> easy
> change we can always commit it for 0.3.
>
>  -jake
>
> On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil  wrote:
>
> > File it for 0.3 ?
> >
> >
> > Robin
> >
> > On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix 
> > wrote:
> >
> > > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil 
> > wrote:
> > >
> > > > I was trying out SeqAccessSparseVector on Canopy Clustering using
> > > Manhattan
> > > > distance. I found performance to be really bad. So I profiled it with
> > > > Yourkit(Thanks a lot for providing us free license)
> > > >
> > > > Since i was trying out manhattan distance, there were a lot of A-B
> > which
> > > > created a lot of clone operation 5% of the total time
> > > > there were also so many A+B for adding a point to the canopy to
> > average.
> > > > this was also creating a lot of clone operations.  90% of the total
> > time
> > > >
> > >
> > > SequentialAccessSparseVector should only be used in a read-only
> fashion.
> > >  If
> > > you are creating an average centroid which is sparse, but it is
> mutating,
> > > then it should be RandomAccessSparseVector.  The points which are being
> > > used
> > > to create it can be SequentialAccessSparseVector (if they themselves
> > never
> > > change), but then the method called should be
> > > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
> > > exploits
> > > the fast sequential iteration of SeqAcc, and the fast random-access
> > > mutatability of RandAcc.
> > >
> > >
> > > >
> > > > So we definitely needs to improve that..
> > > >
> > > > For a small hack. I made the cluster centers RandomAccess Vector.
> > Things
> > > > are fast again. I dont know whether to commit or not. But something
> to
> > > look
> > > > into in 0.4?
> > > >
> > >
> > > Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
> > we
> > > can see exactly what the change is?
> > >
> > >  -jake
> > >
> >
>


Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll
On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the newest 
member of the Mahout committer family.  Drew has been contributing some really 
nice work to Mahout in recent months and I look forward to his continuing 
involvement with Mahout.

Congrats, Drew!


-Grant

[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

converts centers to randomaccess on first creation

> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)
Canopy and Kmeans clustering slows down on using SeqAccVector for center


 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4
 Attachments: MAHOUT-297.patch



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
easy
change we can always commit it for 0.3.

  -jake

On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil  wrote:

> File it for 0.3 ?
>
>
> Robin
>
> On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix 
> wrote:
>
> > On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil 
> wrote:
> >
> > > I was trying out SeqAccessSparseVector on Canopy Clustering using
> > Manhattan
> > > distance. I found performance to be really bad. So I profiled it with
> > > Yourkit(Thanks a lot for providing us free license)
> > >
> > > Since i was trying out manhattan distance, there were a lot of A-B
> which
> > > created a lot of clone operation 5% of the total time
> > > there were also so many A+B for adding a point to the canopy to
> average.
> > > this was also creating a lot of clone operations.  90% of the total
> time
> > >
> >
> > SequentialAccessSparseVector should only be used in a read-only fashion.
> >  If
> > you are creating an average centroid which is sparse, but it is mutating,
> > then it should be RandomAccessSparseVector.  The points which are being
> > used
> > to create it can be SequentialAccessSparseVector (if they themselves
> never
> > change), but then the method called should be
> > SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
> > exploits
> > the fast sequential iteration of SeqAcc, and the fast random-access
> > mutatability of RandAcc.
> >
> >
> > >
> > > So we definitely needs to improve that..
> > >
> > > For a small hack. I made the cluster centers RandomAccess Vector.
> Things
> > > are fast again. I dont know whether to commit or not. But something to
> > look
> > > into in 0.4?
> > >
> >
> > Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
> we
> > can see exactly what the change is?
> >
> >  -jake
> >
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
File it for 0.3 ?


Robin

On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix  wrote:

> On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil  wrote:
>
> > I was trying out SeqAccessSparseVector on Canopy Clustering using
> Manhattan
> > distance. I found performance to be really bad. So I profiled it with
> > Yourkit(Thanks a lot for providing us free license)
> >
> > Since i was trying out manhattan distance, there were a lot of A-B which
> > created a lot of clone operation 5% of the total time
> > there were also so many A+B for adding a point to the canopy to average.
> > this was also creating a lot of clone operations.  90% of the total time
> >
>
> SequentialAccessSparseVector should only be used in a read-only fashion.
>  If
> you are creating an average centroid which is sparse, but it is mutating,
> then it should be RandomAccessSparseVector.  The points which are being
> used
> to create it can be SequentialAccessSparseVector (if they themselves never
> change), but then the method called should be
> SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
> exploits
> the fast sequential iteration of SeqAcc, and the fast random-access
> mutatability of RandAcc.
>
>
> >
> > So we definitely needs to improve that..
> >
> > For a small hack. I made the cluster centers RandomAccess Vector. Things
> > are fast again. I dont know whether to commit or not. But something to
> look
> > into in 0.4?
> >
>
> Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
> can see exactly what the change is?
>
>  -jake
>


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil  wrote:

> I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
> distance. I found performance to be really bad. So I profiled it with
> Yourkit(Thanks a lot for providing us free license)
>
> Since i was trying out manhattan distance, there were a lot of A-B which
> created a lot of clone operation 5% of the total time
> there were also so many A+B for adding a point to the canopy to average.
> this was also creating a lot of clone operations.  90% of the total time
>

SequentialAccessSparseVector should only be used in a read-only fashion.  If
you are creating an average centroid which is sparse, but it is mutating,
then it should be RandomAccessSparseVector.  The points which are being used
to create it can be SequentialAccessSparseVector (if they themselves never
change), but then the method called should be
SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this exploits
the fast sequential iteration of SeqAcc, and the fast random-access
mutatability of RandAcc.


>
> So we definitely needs to improve that..
>
> For a small hack. I made the cluster centers RandomAccess Vector. Things
> are fast again. I dont know whether to commit or not. But something to look
> into in 0.4?
>

Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
can see exactly what the change is?

  -jake


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Sorry about the attachment. see it here. http://yfrog.com/4epicture1pfp


On Fri, Feb 19, 2010 at 1:25 AM, Robin Anil  wrote:

> I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
> distance. I found performance to be really bad. So I profiled it with
> Yourkit(Thanks a lot for providing us free license)
>
> Since i was trying out manhattan distance, there were a lot of A-B which
> created a lot of clone operation 5% of the total time
> there were also so many A+B for adding a point to the canopy to average.
> this was also creating a lot of clone operations.  90% of the total time
>
> So we definitely needs to improve that..
>
> For a small hack. I made the cluster centers RandomAccess Vector. Things
> are fast again. I dont know whether to commit or not. But something to look
> into in 0.4?
>
> Robin
>
>
>


Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
distance. I found performance to be really bad. So I profiled it with
Yourkit(Thanks a lot for providing us free license)

Since i was trying out manhattan distance, there were a lot of A-B which
created a lot of clone operation 5% of the total time
there were also so many A+B for adding a point to the canopy to average.
this was also creating a lot of clone operations.  90% of the total time

So we definitely needs to improve that..

For a small hack. I made the cluster centers RandomAccess Vector. Things are
fast again. I dont know whether to commit or not. But something to look into
in 0.4?

Robin


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Ted Dunning
Note the different version number here.

I think that Anish has somehow gotten stuck on an old version.  Anish, can
you do a clean checkout and build?

On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil  wrote:

> I am building Revision: 911405 on a Mac, and things work fine for me. I am
> assuming same is the case for sean(mac)
>
> ...
>
>
> On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah  wrote:
>
> ...
> > $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
> > Checked out revision 911364.
> >
>



-- 
Ted Dunning, CTO
DeepDyve


[jira] Closed: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil closed MAHOUT-296.
-


> TestClassifier takes correctLabel from filename instead of from the key
> ---
>
> Key: MAHOUT-296
> URL: https://issues.apache.org/jira/browse/MAHOUT-296
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-296.
---

Resolution: Fixed

> TestClassifier takes correctLabel from filename instead of from the key
> ---
>
> Key: MAHOUT-296
> URL: https://issues.apache.org/jira/browse/MAHOUT-296
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-296 started by Robin Anil.

> TestClassifier takes correctLabel from filename instead of from the key
> ---
>
> Key: MAHOUT-296
> URL: https://issues.apache.org/jira/browse/MAHOUT-296
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)
TestClassifier takes correctLabel from filename instead of from the key
---

 Key: MAHOUT-296
 URL: https://issues.apache.org/jira/browse/MAHOUT-296
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Robin Anil
I am building Revision: 911405 on a Mac, and things work fine for me. I am
assuming same is the case for sean(mac)

One reason i assume and error could come is that on windows the output
directories doesnt get deleted if the filesystem locks it(for reason i
cannot fathom). Could you try deleting testdata and output directories which
are formed and try again. Those directories could have some data used by the
Kmeans test and is deleted before the canopy test.

Other than that if you are building the mahout for usage  do: mvn clean
install -DskipTests=true

Robin


On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah  wrote:

> I tried again by first syncing from the trunk and running mvn install using
> the following
> and getting the same test failures! I am running this on Windows 7 machine
> using Cigwin.
>
> $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
> Checked out revision 911364.
>
> $ mvn clean install
> .. lots of junit test successes and the following failure
>
> Results :
>
> Failed tests:
>  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
>
>  
> testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
> sClustering)
>
> Tests run: 338, Failures: 2, Errors: 0, Skipped: 0
>
> On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil  wrote:
>
> > Yeah me neither. Could you try syncing from the trunk
> >
> >
> >
> > On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen  wrote:
> >
> > > I'm not seeing any such failures myself, from head.
> > >
> > > In case Robin just fixed something, try again from head?
> > >
> > >
> > > On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah 
> wrote:
> > > > Hi,
> > > >
> > > > I am new to Mahout and going through the initial steps of setting the
> > > > development
> > > > environment on my machine. I checked out the latest code from trunk
> and
> > > > seeing
> > > > the following failed tests when I ran mvn clean install:
> > > >
> > >
> >
>


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
I tried again by first syncing from the trunk and running mvn install using
the following
and getting the same test failures! I am running this on Windows 7 machine
using Cigwin.

$ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
Checked out revision 911364.

$ mvn clean install
.. lots of junit test successes and the following failure

Results :

Failed tests:
  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
sClustering)

Tests run: 338, Failures: 2, Errors: 0, Skipped: 0

On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil  wrote:

> Yeah me neither. Could you try syncing from the trunk
>
>
>
> On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen  wrote:
>
> > I'm not seeing any such failures myself, from head.
> >
> > In case Robin just fixed something, try again from head?
> >
> >
> > On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah  wrote:
> > > Hi,
> > >
> > > I am new to Mahout and going through the initial steps of setting the
> > > development
> > > environment on my machine. I checked out the latest code from trunk and
> > > seeing
> > > the following failed tests when I ran mvn clean install:
> > >
> >
>


Re: Fuzzy K Means

2010-02-18 Thread Robin Anil
Yeah, by killing the job in between, i find all the centers to be same :(
 Which is the main problem


On Thu, Feb 18, 2010 at 5:51 PM, Jeff Eastman wrote:

> It sounds like k-means is looping because point memberships are oscillating
> between two stable states. Try increasing the convergence delta and you will
> likely terminate.
>
>
> Robin Anil wrote:
>
>> Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add
>> point
>> to a single canopy instead of adding it to all canopies. This would help a
>> lot on large datasets. There is no point of adding to all canopies, you
>> will
>> get approximate clustering anyways
>>
>> I have cleaned up most of SoftCluster. Still the error exists. It seems to
>> be looping forever now. I will post a patch on the issue take please take
>> a
>> look
>>
>> Robin
>>
>> On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman > >wrote:
>>
>>
>>
>>> Robin Anil wrote:
>>>
>>>
>>>
 Hadoop reuses the *same* instance whenever it uses readFields and I've


> been
> bitten more than once by assuming otherwise.
>
>
>
>
 Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see
 the
 where the writable is causing the error.
 Best is if we could have some test data and make a check to see if the
 algorithm is working.





>>> Good hunting. I notice that some of the code in the fuzzy MR unit test
>>> has
>>> been commented out but have not looked into it further.
>>>
>>> I assume also you have sorted out the canopy issue you were having?
>>>
>>> Jeff
>>>
>>>
>>>
>>
>>
>>
>
>


Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
It sounds like k-means is looping because point memberships are 
oscillating between two stable states. Try increasing the convergence 
delta and you will likely terminate.


Robin Anil wrote:

Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add point
to a single canopy instead of adding it to all canopies. This would help a
lot on large datasets. There is no point of adding to all canopies, you will
get approximate clustering anyways

I have cleaned up most of SoftCluster. Still the error exists. It seems to
be looping forever now. I will post a patch on the issue take please take a
look

Robin

On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman wrote:

  

Robin Anil wrote:



Hadoop reuses the *same* instance whenever it uses readFields and I've
  

been
bitten more than once by assuming otherwise.




Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see
the
where the writable is causing the error.
Best is if we could have some test data and make a check to see if the
algorithm is working.



  

Good hunting. I notice that some of the code in the fuzzy MR unit test has
been commented out but have not looked into it further.

I assume also you have sorted out the canopy issue you were having?

Jeff




  




Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman

+1 Looks like what I did too.

Robin Anil wrote:

I am pasting the patch for SoftCluster here..

Index:
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
===
---
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
(revision
910924)
+++
core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
(working
copy)
@@ -21,21 +21,14 @@
 import java.io.DataOutput;
 import java.io.IOException;

-import org.apache.hadoop.io.Writable;
+import org.apache.mahout.clustering.ClusterBase;
 import org.apache.mahout.math.AbstractVector;
-import org.apache.mahout.math.RandomAccessSparseVector;
 import org.apache.mahout.math.Vector;
 import org.apache.mahout.math.VectorWritable;
 import org.apache.mahout.math.function.SquareRootFunction;

-public class SoftCluster implements Writable {
-
-  // this cluster's clusterId
-  private int clusterId;
-
-  // the current center
-  private Vector center = new RandomAccessSparseVector(0);
-
+public class SoftCluster extends ClusterBase{
+
   // the current centroid is lazy evaluated and may be null
   private Vector centroid = null;

@@ -90,7 +83,7 @@

   @Override
   public void write(DataOutput out) throws IOException {
-out.writeInt(clusterId);
+out.writeInt(this.getId());
 out.writeBoolean(converged);
 Vector vector = computeCentroid();
 VectorWritable.writeVector(out, vector);
@@ -98,13 +91,13 @@

   @Override
   public void readFields(DataInput in) throws IOException {
-clusterId = in.readInt();
+this.setId(in.readInt());
 converged = in.readBoolean();
 VectorWritable temp = new VectorWritable();
 temp.readFields(in);
-center = temp.get();
+this.setCenter(temp.get());
 this.pointProbSum = 0;
-this.weightedPointTotal = center.like();
+this.weightedPointTotal = getCenter().like();
   }

   /**
@@ -112,6 +105,7 @@
*
* @return the new centroid
*/
+  @Override
   public Vector computeCentroid() {
 if (pointProbSum == 0) {
   return weightedPointTotal;
@@ -132,7 +126,7 @@
*  the center point
*/
   public SoftCluster(Vector center) {
-this.center = center;
+setCenter(center);
 this.pointProbSum = 0;

 this.weightedPointTotal = center.like();
@@ -145,8 +139,8 @@
*  the center point
*/
   public SoftCluster(Vector center, int clusterId) {
-this.clusterId = clusterId;
-this.center = center;
+this.setId(clusterId);
+this.setCenter(center);
 this.pointProbSum = 0;
 this.weightedPointTotal = center.like();
   }
@@ -154,7 +148,7 @@
   /** Construct a new softcluster with the given clusterID */
   public SoftCluster(String clusterId) {

-this.clusterId = Integer.parseInt(clusterId.substring(1));
+this.setId(Integer.parseInt(clusterId.substring(1)));
 this.pointProbSum = 0;
 // this.weightedPointTotal = center.like();
 this.converged = clusterId.charAt(0) == 'V';
@@ -162,14 +156,15 @@

   @Override
   public String toString() {
-return getIdentifier() + " - " + center.asFormatString();
+return getIdentifier() + " - " + getCenter().asFormatString();
   }

+  @Override
   public String getIdentifier() {
 if (converged) {
-  return "V" + clusterId;
+  return "V" + this.getId();
 } else {
-  return "C" + clusterId;
+  return "C" + this.getId();
 }
   }

@@ -212,7 +207,7 @@
 centroid = null;
 pointProbSum += ptProb;
 if (weightedPointTotal == null) {
-  weightedPointTotal = point.clone().times(ptProb);
+  weightedPointTotal = point.times(ptProb);
 } else {
   weightedPointTotal = weightedPointTotal.plus(point.times(ptProb));
 }
@@ -234,19 +229,15 @@
 }
   }

-  public Vector getCenter() {
-return center;
-  }
-
   public double getPointProbSum() {
 return pointProbSum;
   }

   /** Compute the centroid and set the center to it. */
   public void recomputeCenter() {
-center = computeCentroid();
+this.setCenter(computeCentroid());
 pointProbSum = 0;
-weightedPointTotal = center.like();
+weightedPointTotal = getCenter().like();
   }

   public Vector getWeightedPointTotal() {
@@ -265,8 +256,9 @@
 this.converged = converged;
   }

-  public int getClusterId() {
-return clusterId;
+  @Override
+  public String asFormatString() {
+return formatCluster(this);
   }

 }

  




Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
Very similar, especially when you consider that k-means only adds the 
whole point value to the single, closest cluster (i.e. 
weightedPointTotal += 1), whereas fuzzy adds it partially to all. I 
don't think the other clustering routines require/expect numPoints to be 
an integer and the instvar could probably be generalized to double 
weightedPointTotal without impact.


Perhaps better to consider that change separately, as there are a number 
of tests which compare getNumPoints() with an integer value and would 
have to be adjusted. Likely it would be just adding an (int) cast as the 
values in non-fuzzy tests would always be whole numbers.



Pallavi Palleti wrote:
Yes. But not the total number of points. So, the numpoints from 
ClusterBase will not be used in SoftCluster. numpoints is specific to 
Kmeans similar to weightedpoint total for fuzzy kmeans.


Robin Anil wrote:

the center is still the averaged out centroid right?
weightedtotalvector/totalprobWeight



On Wed, Feb 17, 2010 at 5:10 PM, Pallavi Palleti <
pallavi.pall...@corp.aol.com> wrote:

 
I haven't yet gone thru ClusterDumper. However, ClusterBase would be 
having
number of points to average out (pointTotal/numPoints as per kmeans) 
where
as SoftCluster will have weighted point total. So, I am wondering 
how can we

reuse ClusterBase here?


Thanks
Pallavi

Robin Anil wrote:

   

yes. So that cluster dumper can print it out.

On Wed, Feb 17, 2010 at 5:02 PM, Pallavi Palleti <
pallavi.pall...@corp.aol.com> wrote:



 

Hi Robin,

when you meant by reusing ClusterBase, are you planning to extend
ClusterBase in SoftCluster? For example, SoftCluster extends 
ClusterBase?


Thanks
Pallavi


Robin Anil wrote:



   
I have been trying to convert FuzzyKMeans SoftCluster(which 
should be

ideally be named FuzzyKmeansCluster) to use the ClusterBase.

I am getting* the same center* for all the clusters. To aid the
conversion
all i did was remove the center vector from the SoftCluster class 
and

reuse
the same from the ClusterBase. These are essentially making no 
change in

the
tests which passes correctly.

So I am questioning whether the implementation keeps the average 
center

at
all ? Anyone who has used FuzzyKMeans experiencing this?


Robin





  
  


  






Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Robin Anil
Yeah me neither. Could you try syncing from the trunk



On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen  wrote:

> I'm not seeing any such failures myself, from head.
>
> In case Robin just fixed something, try again from head?
>
>
> On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah  wrote:
> > Hi,
> >
> > I am new to Mahout and going through the initial steps of setting the
> > development
> > environment on my machine. I checked out the latest code from trunk and
> > seeing
> > the following failed tests when I ran mvn clean install:
> >
>


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Sean Owen
I'm not seeing any such failures myself, from head.

In case Robin just fixed something, try again from head?


On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah  wrote:
> Hi,
>
> I am new to Mahout and going through the initial steps of setting the
> development
> environment on my machine. I checked out the latest code from trunk and
> seeing
> the following failed tests when I ran mvn clean install:
>