Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Sean Owen
I'm not seeing any such failures myself, from head.

In case Robin just fixed something, try again from head?


On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah avsha...@gmail.com wrote:
 Hi,

 I am new to Mahout and going through the initial steps of setting the
 development
 environment on my machine. I checked out the latest code from trunk and
 seeing
 the following failed tests when I ran mvn clean install:



Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
Very similar, especially when you consider that k-means only adds the 
whole point value to the single, closest cluster (i.e. 
weightedPointTotal += 1), whereas fuzzy adds it partially to all. I 
don't think the other clustering routines require/expect numPoints to be 
an integer and the instvar could probably be generalized to double 
weightedPointTotal without impact.


Perhaps better to consider that change separately, as there are a number 
of tests which compare getNumPoints() with an integer value and would 
have to be adjusted. Likely it would be just adding an (int) cast as the 
values in non-fuzzy tests would always be whole numbers.



Pallavi Palleti wrote:
Yes. But not the total number of points. So, the numpoints from 
ClusterBase will not be used in SoftCluster. numpoints is specific to 
Kmeans similar to weightedpoint total for fuzzy kmeans.


Robin Anil wrote:

the center is still the averaged out centroid right?
weightedtotalvector/totalprobWeight



On Wed, Feb 17, 2010 at 5:10 PM, Pallavi Palleti 
pallavi.pall...@corp.aol.com wrote:

 
I haven't yet gone thru ClusterDumper. However, ClusterBase would be 
having
number of points to average out (pointTotal/numPoints as per kmeans) 
where
as SoftCluster will have weighted point total. So, I am wondering 
how can we

reuse ClusterBase here?


Thanks
Pallavi

Robin Anil wrote:

   

yes. So that cluster dumper can print it out.

On Wed, Feb 17, 2010 at 5:02 PM, Pallavi Palleti 
pallavi.pall...@corp.aol.com wrote:



 

Hi Robin,

when you meant by reusing ClusterBase, are you planning to extend
ClusterBase in SoftCluster? For example, SoftCluster extends 
ClusterBase?


Thanks
Pallavi


Robin Anil wrote:



   
I have been trying to convert FuzzyKMeans SoftCluster(which 
should be

ideally be named FuzzyKmeansCluster) to use the ClusterBase.

I am getting* the same center* for all the clusters. To aid the
conversion
all i did was remove the center vector from the SoftCluster class 
and

reuse
the same from the ClusterBase. These are essentially making no 
change in

the
tests which passes correctly.

So I am questioning whether the implementation keeps the average 
center

at
all ? Anyone who has used FuzzyKMeans experiencing this?


Robin





  
  


  






Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
It sounds like k-means is looping because point memberships are 
oscillating between two stable states. Try increasing the convergence 
delta and you will likely terminate.


Robin Anil wrote:

Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add point
to a single canopy instead of adding it to all canopies. This would help a
lot on large datasets. There is no point of adding to all canopies, you will
get approximate clustering anyways

I have cleaned up most of SoftCluster. Still the error exists. It seems to
be looping forever now. I will post a patch on the issue take please take a
look

Robin

On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman j...@windwardsolutions.comwrote:

  

Robin Anil wrote:



Hadoop reuses the *same* instance whenever it uses readFields and I've
  

been
bitten more than once by assuming otherwise.




Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see
the
where the writable is causing the error.
Best is if we could have some test data and make a check to see if the
algorithm is working.



  

Good hunting. I notice that some of the code in the fuzzy MR unit test has
been commented out but have not looked into it further.

I assume also you have sorted out the canopy issue you were having?

Jeff




  




Re: Fuzzy K Means

2010-02-18 Thread Robin Anil
Yeah, by killing the job in between, i find all the centers to be same :(
 Which is the main problem


On Thu, Feb 18, 2010 at 5:51 PM, Jeff Eastman j...@windwardsolutions.comwrote:

 It sounds like k-means is looping because point memberships are oscillating
 between two stable states. Try increasing the convergence delta and you will
 likely terminate.


 Robin Anil wrote:

 Yeah, Canopy issue is sorted out. Was thinking of adding a flag to add
 point
 to a single canopy instead of adding it to all canopies. This would help a
 lot on large datasets. There is no point of adding to all canopies, you
 will
 get approximate clustering anyways

 I have cleaned up most of SoftCluster. Still the error exists. It seems to
 be looping forever now. I will post a patch on the issue take please take
 a
 look

 Robin

 On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman j...@windwardsolutions.com
 wrote:



 Robin Anil wrote:



 Hadoop reuses the *same* instance whenever it uses readFields and I've


 been
 bitten more than once by assuming otherwise.




 Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see
 the
 where the writable is causing the error.
 Best is if we could have some test data and make a check to see if the
 algorithm is working.





 Good hunting. I notice that some of the code in the fuzzy MR unit test
 has
 been commented out but have not looked into it further.

 I assume also you have sorted out the canopy issue you were having?

 Jeff










Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
I tried again by first syncing from the trunk and running mvn install using
the following
and getting the same test failures! I am running this on Windows 7 machine
using Cigwin.

$ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
Checked out revision 911364.

$ mvn clean install
.. lots of junit test successes and the following failure

Results :

Failed tests:
  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
sClustering)

Tests run: 338, Failures: 2, Errors: 0, Skipped: 0

On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil robin.a...@gmail.com wrote:

 Yeah me neither. Could you try syncing from the trunk



 On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen sro...@gmail.com wrote:

  I'm not seeing any such failures myself, from head.
 
  In case Robin just fixed something, try again from head?
 
 
  On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah avsha...@gmail.com wrote:
   Hi,
  
   I am new to Mahout and going through the initial steps of setting the
   development
   environment on my machine. I checked out the latest code from trunk and
   seeing
   the following failed tests when I ran mvn clean install:
  
 



Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Robin Anil
I am building Revision: 911405 on a Mac, and things work fine for me. I am
assuming same is the case for sean(mac)

One reason i assume and error could come is that on windows the output
directories doesnt get deleted if the filesystem locks it(for reason i
cannot fathom). Could you try deleting testdata and output directories which
are formed and try again. Those directories could have some data used by the
Kmeans test and is deleted before the canopy test.

Other than that if you are building the mahout for usage  do: mvn clean
install -DskipTests=true

Robin


On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah avsha...@gmail.com wrote:

 I tried again by first syncing from the trunk and running mvn install using
 the following
 and getting the same test failures! I am running this on Windows 7 machine
 using Cigwin.

 $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
 Checked out revision 911364.

 $ mvn clean install
 .. lots of junit test successes and the following failure

 Results :

 Failed tests:
  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)

  
 testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
 sClustering)

 Tests run: 338, Failures: 2, Errors: 0, Skipped: 0

 On Thu, Feb 18, 2010 at 5:40 AM, Robin Anil robin.a...@gmail.com wrote:

  Yeah me neither. Could you try syncing from the trunk
 
 
 
  On Thu, Feb 18, 2010 at 4:08 PM, Sean Owen sro...@gmail.com wrote:
 
   I'm not seeing any such failures myself, from head.
  
   In case Robin just fixed something, try again from head?
  
  
   On Wed, Feb 17, 2010 at 11:04 PM, Anish Shah avsha...@gmail.com
 wrote:
Hi,
   
I am new to Mahout and going through the initial steps of setting the
development
environment on my machine. I checked out the latest code from trunk
 and
seeing
the following failed tests when I ran mvn clean install:
   
  
 



[jira] Created: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)
TestClassifier takes correctLabel from filename instead of from the key
---

 Key: MAHOUT-296
 URL: https://issues.apache.org/jira/browse/MAHOUT-296
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-296 started by Robin Anil.

 TestClassifier takes correctLabel from filename instead of from the key
 ---

 Key: MAHOUT-296
 URL: https://issues.apache.org/jira/browse/MAHOUT-296
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-296.
---

Resolution: Fixed

 TestClassifier takes correctLabel from filename instead of from the key
 ---

 Key: MAHOUT-296
 URL: https://issues.apache.org/jira/browse/MAHOUT-296
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (MAHOUT-296) TestClassifier takes correctLabel from filename instead of from the key

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil closed MAHOUT-296.
-


 TestClassifier takes correctLabel from filename instead of from the key
 ---

 Key: MAHOUT-296
 URL: https://issues.apache.org/jira/browse/MAHOUT-296
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Ted Dunning
Note the different version number here.

I think that Anish has somehow gotten stuck on an old version.  Anish, can
you do a clean checkout and build?

On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil robin.a...@gmail.com wrote:

 I am building Revision: 911405 on a Mac, and things work fine for me. I am
 assuming same is the case for sean(mac)

 ...


 On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah avsha...@gmail.com wrote:

 ...
  $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
  Checked out revision 911364.
 




-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Sorry about the attachment. see it here. http://yfrog.com/4epicture1pfp


On Fri, Feb 19, 2010 at 1:25 AM, Robin Anil robin.a...@gmail.com wrote:

 I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
 distance. I found performance to be really bad. So I profiled it with
 Yourkit(Thanks a lot for providing us free license)

 Since i was trying out manhattan distance, there were a lot of A-B which
 created a lot of clone operation 5% of the total time
 there were also so many A+B for adding a point to the canopy to average.
 this was also creating a lot of clone operations.  90% of the total time

 So we definitely needs to improve that..

 For a small hack. I made the cluster centers RandomAccess Vector. Things
 are fast again. I dont know whether to commit or not. But something to look
 into in 0.4?

 Robin





Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com wrote:

 I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
 distance. I found performance to be really bad. So I profiled it with
 Yourkit(Thanks a lot for providing us free license)

 Since i was trying out manhattan distance, there were a lot of A-B which
 created a lot of clone operation 5% of the total time
 there were also so many A+B for adding a point to the canopy to average.
 this was also creating a lot of clone operations.  90% of the total time


SequentialAccessSparseVector should only be used in a read-only fashion.  If
you are creating an average centroid which is sparse, but it is mutating,
then it should be RandomAccessSparseVector.  The points which are being used
to create it can be SequentialAccessSparseVector (if they themselves never
change), but then the method called should be
SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this exploits
the fast sequential iteration of SeqAcc, and the fast random-access
mutatability of RandAcc.



 So we definitely needs to improve that..

 For a small hack. I made the cluster centers RandomAccess Vector. Things
 are fast again. I dont know whether to commit or not. But something to look
 into in 0.4?


Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
can see exactly what the change is?

  -jake


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
File it for 0.3 ?


Robin

On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix jake.man...@gmail.com wrote:

 On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com wrote:

  I was trying out SeqAccessSparseVector on Canopy Clustering using
 Manhattan
  distance. I found performance to be really bad. So I profiled it with
  Yourkit(Thanks a lot for providing us free license)
 
  Since i was trying out manhattan distance, there were a lot of A-B which
  created a lot of clone operation 5% of the total time
  there were also so many A+B for adding a point to the canopy to average.
  this was also creating a lot of clone operations.  90% of the total time
 

 SequentialAccessSparseVector should only be used in a read-only fashion.
  If
 you are creating an average centroid which is sparse, but it is mutating,
 then it should be RandomAccessSparseVector.  The points which are being
 used
 to create it can be SequentialAccessSparseVector (if they themselves never
 change), but then the method called should be
 SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
 exploits
 the fast sequential iteration of SeqAcc, and the fast random-access
 mutatability of RandAcc.


 
  So we definitely needs to improve that..
 
  For a small hack. I made the cluster centers RandomAccess Vector. Things
  are fast again. I dont know whether to commit or not. But something to
 look
  into in 0.4?
 

 Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
 can see exactly what the change is?

  -jake



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
easy
change we can always commit it for 0.3.

  -jake

On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil robin.a...@gmail.com wrote:

 File it for 0.3 ?


 Robin

 On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix jake.man...@gmail.com
 wrote:

  On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com
 wrote:
 
   I was trying out SeqAccessSparseVector on Canopy Clustering using
  Manhattan
   distance. I found performance to be really bad. So I profiled it with
   Yourkit(Thanks a lot for providing us free license)
  
   Since i was trying out manhattan distance, there were a lot of A-B
 which
   created a lot of clone operation 5% of the total time
   there were also so many A+B for adding a point to the canopy to
 average.
   this was also creating a lot of clone operations.  90% of the total
 time
  
 
  SequentialAccessSparseVector should only be used in a read-only fashion.
   If
  you are creating an average centroid which is sparse, but it is mutating,
  then it should be RandomAccessSparseVector.  The points which are being
  used
  to create it can be SequentialAccessSparseVector (if they themselves
 never
  change), but then the method called should be
  SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
  exploits
  the fast sequential iteration of SeqAcc, and the fast random-access
  mutatability of RandAcc.
 
 
  
   So we definitely needs to improve that..
  
   For a small hack. I made the cluster centers RandomAccess Vector.
 Things
   are fast again. I dont know whether to commit or not. But something to
  look
   into in 0.4?
  
 
  Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
 we
  can see exactly what the change is?
 
   -jake
 



[jira] Created: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)
Canopy and Kmeans clustering slows down on using SeqAccVector for center


 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4
 Attachments: MAHOUT-297.patch



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

converts centers to randomaccess on first creation

 Canopy and Kmeans clustering slows down on using SeqAccVector for center
 

 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-297.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll
On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the newest 
member of the Mahout committer family.  Drew has been contributing some really 
nice work to Mahout in recent months and I look forward to his continuing 
involvement with Mahout.

Congrats, Drew!


-Grant

Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Now the big perf bottle neck is immutability

Say for plus its doing vector.clone() before doing anything else.
There should be both immutable and mutable plus functions

Robin



On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix jake.man...@gmail.com wrote:

 I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
 easy
 change we can always commit it for 0.3.

  -jake

 On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil robin.a...@gmail.com wrote:

  File it for 0.3 ?
 
 
  Robin
 
  On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com
  wrote:
  
I was trying out SeqAccessSparseVector on Canopy Clustering using
   Manhattan
distance. I found performance to be really bad. So I profiled it with
Yourkit(Thanks a lot for providing us free license)
   
Since i was trying out manhattan distance, there were a lot of A-B
  which
created a lot of clone operation 5% of the total time
there were also so many A+B for adding a point to the canopy to
  average.
this was also creating a lot of clone operations.  90% of the total
  time
   
  
   SequentialAccessSparseVector should only be used in a read-only
 fashion.
If
   you are creating an average centroid which is sparse, but it is
 mutating,
   then it should be RandomAccessSparseVector.  The points which are being
   used
   to create it can be SequentialAccessSparseVector (if they themselves
  never
   change), but then the method called should be
   SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
   exploits
   the fast sequential iteration of SeqAcc, and the fast random-access
   mutatability of RandAcc.
  
  
   
So we definitely needs to improve that..
   
For a small hack. I made the cluster centers RandomAccess Vector.
  Things
are fast again. I dont know whether to commit or not. But something
 to
   look
into in 0.4?
   
  
   Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
  we
   can see exactly what the change is?
  
-jake
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Grant Ingersoll
If it's as obvious a win as it sounds, I'd say 0.3.  We aren't in lock down yet 
are we?

-Grant

On Feb 18, 2010, at 3:37 PM, Jake Mannix wrote:

 I dunno, we can file it for whenever, 0.4 and if it turns out it's a really
 easy
 change we can always commit it for 0.3.
 
  -jake
 
 On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil robin.a...@gmail.com wrote:
 
 File it for 0.3 ?
 
 
 Robin
 
 On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com
 wrote:
 
 I was trying out SeqAccessSparseVector on Canopy Clustering using
 Manhattan
 distance. I found performance to be really bad. So I profiled it with
 Yourkit(Thanks a lot for providing us free license)
 
 Since i was trying out manhattan distance, there were a lot of A-B
 which
 created a lot of clone operation 5% of the total time
 there were also so many A+B for adding a point to the canopy to
 average.
 this was also creating a lot of clone operations.  90% of the total
 time
 
 
 SequentialAccessSparseVector should only be used in a read-only fashion.
 If
 you are creating an average centroid which is sparse, but it is mutating,
 then it should be RandomAccessSparseVector.  The points which are being
 used
 to create it can be SequentialAccessSparseVector (if they themselves
 never
 change), but then the method called should be
 SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
 exploits
 the fast sequential iteration of SeqAcc, and the fast random-access
 mutatability of RandAcc.
 
 
 
 So we definitely needs to improve that..
 
 For a small hack. I made the cluster centers RandomAccess Vector.
 Things
 are fast again. I dont know whether to commit or not. But something to
 look
 into in 0.4?
 
 
 Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so
 we
 can see exactly what the change is?
 
 -jake
 
 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
addTo() is mutable plus.

On Thu, Feb 18, 2010 at 1:04 PM, Robin Anil robin.a...@gmail.com wrote:

 Now the big perf bottle neck is immutability

 Say for plus its doing vector.clone() before doing anything else.
 There should be both immutable and mutable plus functions

 Robin



 On Fri, Feb 19, 2010 at 2:07 AM, Jake Mannix jake.man...@gmail.com
 wrote:

  I dunno, we can file it for whenever, 0.4 and if it turns out it's a
 really
  easy
  change we can always commit it for 0.3.
 
   -jake
 
  On Thu, Feb 18, 2010 at 12:29 PM, Robin Anil robin.a...@gmail.com
 wrote:
 
   File it for 0.3 ?
  
  
   Robin
  
   On Fri, Feb 19, 2010 at 1:56 AM, Jake Mannix jake.man...@gmail.com
   wrote:
  
On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil robin.a...@gmail.com
   wrote:
   
 I was trying out SeqAccessSparseVector on Canopy Clustering using
Manhattan
 distance. I found performance to be really bad. So I profiled it
 with
 Yourkit(Thanks a lot for providing us free license)

 Since i was trying out manhattan distance, there were a lot of A-B
   which
 created a lot of clone operation 5% of the total time
 there were also so many A+B for adding a point to the canopy to
   average.
 this was also creating a lot of clone operations.  90% of the total
   time

   
SequentialAccessSparseVector should only be used in a read-only
  fashion.
 If
you are creating an average centroid which is sparse, but it is
  mutating,
then it should be RandomAccessSparseVector.  The points which are
 being
used
to create it can be SequentialAccessSparseVector (if they themselves
   never
change), but then the method called should be
SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this
exploits
the fast sequential iteration of SeqAcc, and the fast random-access
mutatability of RandAcc.
   
   

 So we definitely needs to improve that..

 For a small hack. I made the cluster centers RandomAccess Vector.
   Things
 are fast again. I dont know whether to commit or not. But something
  to
look
 into in 0.4?

   
Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch
 so
   we
can see exactly what the change is?
   
 -jake
   
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
ah! Its not being used anywhere :). Should we make that a big task before
0.3 ? Sweep through code(mainly clustering) and change all these things.

Robin



On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com wrote:

 Isn't this basically what assign() is for?

 On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil robin.a...@gmail.com wrote:
  Now the big perf bottle neck is immutability
 
  Say for plus its doing vector.clone() before doing anything else.
  There should be both immutable and mutable plus functions
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
I use it (addTo) in decomposer, for exactly this performance issue.
Changing
plus into addTo requires care, because since plus() leaves arguments
immutable,
there may be code which *assumes* that this is the case, and doing addTo()
leaves side effects which might not be expected.  This bit me hard on svd
migration, because I had other assumptions about mutability in there.

  -jake

On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com wrote:

 ah! Its not being used anywhere :). Should we make that a big task before
 0.3 ? Sweep through code(mainly clustering) and change all these things.

 Robin



 On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com wrote:

  Isn't this basically what assign() is for?
 
  On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil robin.a...@gmail.com
 wrote:
   Now the big perf bottle neck is immutability
  
   Say for plus its doing vector.clone() before doing anything else.
   There should be both immutable and mutable plus functions
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
I just had to change it at one place(and the tests pass, which is scary).
Canopy is really fast now :). Still could be pushed
Now the bottleneck is minus

maybe a subtractFrom on the lines of addTo? or a mutable negate function for
vector, before adding to

Robin



On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com wrote:

 I use it (addTo) in decomposer, for exactly this performance issue.
 Changing
 plus into addTo requires care, because since plus() leaves arguments
 immutable,
 there may be code which *assumes* that this is the case, and doing addTo()
 leaves side effects which might not be expected.  This bit me hard on svd
 migration, because I had other assumptions about mutability in there.

  -jake

 On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com wrote:

  ah! Its not being used anywhere :). Should we make that a big task before
  0.3 ? Sweep through code(mainly clustering) and change all these things.
 
  Robin
 
 
 
  On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com wrote:
 
   Isn't this basically what assign() is for?
  
   On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil robin.a...@gmail.com
  wrote:
Now the big perf bottle neck is immutability
   
Say for plus its doing vector.clone() before doing anything else.
There should be both immutable and mutable plus functions
   
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
to do subtractFrom, you can instead just do

  Vector.assign(otherVector, Functions.minus);

The problem is that while DenseVector has an optimization here: if the
BinaryFunction passed in is additive (it's an instance of PlusMult),
sparse iteration over otherVector is executed, applying the binary
function and mutating self.  AbstractVector should have this optimization
in general, as it would be useful in RandomAccessSparseVector (although
not terribly useful in SequentialAccessSparseVector, but still better than
current).

  -jake

On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com wrote:

 I just had to change it at one place(and the tests pass, which is scary).
 Canopy is really fast now :). Still could be pushed
 Now the bottleneck is minus

 maybe a subtractFrom on the lines of addTo? or a mutable negate function
 for
 vector, before adding to

 Robin



 On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com
 wrote:

  I use it (addTo) in decomposer, for exactly this performance issue.
  Changing
  plus into addTo requires care, because since plus() leaves arguments
  immutable,
  there may be code which *assumes* that this is the case, and doing
 addTo()
  leaves side effects which might not be expected.  This bit me hard on svd
  migration, because I had other assumptions about mutability in there.
 
   -jake
 
  On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com
 wrote:
 
   ah! Its not being used anywhere :). Should we make that a big task
 before
   0.3 ? Sweep through code(mainly clustering) and change all these
 things.
  
   Robin
  
  
  
   On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com wrote:
  
Isn't this basically what assign() is for?
   
On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil robin.a...@gmail.com
   wrote:
 Now the big perf bottle neck is immutability

 Say for plus its doing vector.clone() before doing anything else.
 There should be both immutable and mutable plus functions

   
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
Just to be clear, this does:
currentVector-otherVector ?

currentVector.assign(otherVector, Functions.minus);



On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix jake.man...@gmail.com wrote:

 to do subtractFrom, you can instead just do

  Vector.assign(otherVector, Functions.minus);

 The problem is that while DenseVector has an optimization here: if the
 BinaryFunction passed in is additive (it's an instance of PlusMult),
 sparse iteration over otherVector is executed, applying the binary
 function and mutating self.  AbstractVector should have this optimization
 in general, as it would be useful in RandomAccessSparseVector (although
 not terribly useful in SequentialAccessSparseVector, but still better than
 current).

  -jake

 On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com wrote:

  I just had to change it at one place(and the tests pass, which is scary).
  Canopy is really fast now :). Still could be pushed
  Now the bottleneck is minus
 
  maybe a subtractFrom on the lines of addTo? or a mutable negate function
  for
  vector, before adding to
 
  Robin
 
 
 
  On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   I use it (addTo) in decomposer, for exactly this performance issue.
   Changing
   plus into addTo requires care, because since plus() leaves arguments
   immutable,
   there may be code which *assumes* that this is the case, and doing
  addTo()
   leaves side effects which might not be expected.  This bit me hard on
 svd
   migration, because I had other assumptions about mutability in there.
  
-jake
  
   On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com
  wrote:
  
ah! Its not being used anywhere :). Should we make that a big task
  before
0.3 ? Sweep through code(mainly clustering) and change all these
  things.
   
Robin
   
   
   
On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com wrote:
   
 Isn't this basically what assign() is for?

 On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil robin.a...@gmail.com
wrote:
  Now the big perf bottle neck is immutability
 
  Say for plus its doing vector.clone() before doing anything else.
  There should be both immutable and mutable plus functions
 

   
  
 



Re: Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll

On Feb 18, 2010, at 4:05 PM, Robin Anil wrote:

 Welcome Drew
 
 @Grant: No customary introduction? :)

Sorry, forgot that.  Drew, tradition is new committers give a little background 
on themselves.  I can add one tidbit:  I worked w/ Drew way back when at 
TextWise, so I'm glad he showed up here!

 
 Robin
 
 On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll gsing...@apache.orgwrote:
 
 On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
 newest member of the Mahout committer family.  Drew has been contributing
 some really nice work to Mahout in recent months and I look forward to his
 continuing involvement with Mahout.
 
 Congrats, Drew!
 
 
 -Grant




Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
currentVector.assign(otherVector, minus) takes the other vector, and
subtracts
it from currentVector, which mutates currentVector.  If currentVector is
DenseVector,
this is already optimized.  It could be optimized if currentVector is
RandomAccessSparse.

  -jake

On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil robin.a...@gmail.com wrote:

 Just to be clear, this does:
 currentVector-otherVector ?

 currentVector.assign(otherVector, Functions.minus);



 On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix jake.man...@gmail.com
 wrote:

  to do subtractFrom, you can instead just do
 
   Vector.assign(otherVector, Functions.minus);
 
  The problem is that while DenseVector has an optimization here: if the
  BinaryFunction passed in is additive (it's an instance of PlusMult),
  sparse iteration over otherVector is executed, applying the binary
  function and mutating self.  AbstractVector should have this optimization
  in general, as it would be useful in RandomAccessSparseVector (although
  not terribly useful in SequentialAccessSparseVector, but still better
 than
  current).
 
   -jake
 
  On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com
 wrote:
 
   I just had to change it at one place(and the tests pass, which is
 scary).
   Canopy is really fast now :). Still could be pushed
   Now the bottleneck is minus
  
   maybe a subtractFrom on the lines of addTo? or a mutable negate
 function
   for
   vector, before adding to
  
   Robin
  
  
  
   On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com
   wrote:
  
I use it (addTo) in decomposer, for exactly this performance issue.
Changing
plus into addTo requires care, because since plus() leaves arguments
immutable,
there may be code which *assumes* that this is the case, and doing
   addTo()
leaves side effects which might not be expected.  This bit me hard on
  svd
migration, because I had other assumptions about mutability in there.
   
 -jake
   
On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com
   wrote:
   
 ah! Its not being used anywhere :). Should we make that a big task
   before
 0.3 ? Sweep through code(mainly clustering) and change all these
   things.

 Robin



 On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com
 wrote:

  Isn't this basically what assign() is for?
 
  On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
 robin.a...@gmail.com
 wrote:
   Now the big perf bottle neck is immutability
  
   Say for plus its doing vector.clone() before doing anything
 else.
   There should be both immutable and mutable plus functions
  
 

   
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
This really doesnt work for, i cant modify any vectors inside distance
measure. So i have wrote a subtract inside manhattan distance itself. Works
great for now

On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix jake.man...@gmail.com wrote:

 currentVector.assign(otherVector, minus) takes the other vector, and
 subtracts
 it from currentVector, which mutates currentVector.  If currentVector is
 DenseVector,
 this is already optimized.  It could be optimized if currentVector is
 RandomAccessSparse.

  -jake

 On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil robin.a...@gmail.com wrote:

  Just to be clear, this does:
  currentVector-otherVector ?
 
  currentVector.assign(otherVector, Functions.minus);
 
 
 
  On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   to do subtractFrom, you can instead just do
  
Vector.assign(otherVector, Functions.minus);
  
   The problem is that while DenseVector has an optimization here: if the
   BinaryFunction passed in is additive (it's an instance of PlusMult),
   sparse iteration over otherVector is executed, applying the binary
   function and mutating self.  AbstractVector should have this
 optimization
   in general, as it would be useful in RandomAccessSparseVector (although
   not terribly useful in SequentialAccessSparseVector, but still better
  than
   current).
  
-jake
  
   On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com
  wrote:
  
I just had to change it at one place(and the tests pass, which is
  scary).
Canopy is really fast now :). Still could be pushed
Now the bottleneck is minus
   
maybe a subtractFrom on the lines of addTo? or a mutable negate
  function
for
vector, before adding to
   
Robin
   
   
   
On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com
wrote:
   
 I use it (addTo) in decomposer, for exactly this performance issue.
 Changing
 plus into addTo requires care, because since plus() leaves
 arguments
 immutable,
 there may be code which *assumes* that this is the case, and doing
addTo()
 leaves side effects which might not be expected.  This bit me hard
 on
   svd
 migration, because I had other assumptions about mutability in
 there.

  -jake

 On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com
wrote:

  ah! Its not being used anywhere :). Should we make that a big
 task
before
  0.3 ? Sweep through code(mainly clustering) and change all these
things.
 
  Robin
 
 
 
  On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com
  wrote:
 
   Isn't this basically what assign() is for?
  
   On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
  robin.a...@gmail.com
  wrote:
Now the big perf bottle neck is immutability
   
Say for plus its doing vector.clone() before doing anything
  else.
There should be both immutable and mutable plus functions
   
  
 

   
  
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
2 second canopy clustering over reuters :D


On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil robin.a...@gmail.com wrote:

 This really doesnt work for, i cant modify any vectors inside distance
 measure. So i have wrote a subtract inside manhattan distance itself. Works
 great for now


 On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix jake.man...@gmail.comwrote:

 currentVector.assign(otherVector, minus) takes the other vector, and
 subtracts
 it from currentVector, which mutates currentVector.  If currentVector is
 DenseVector,
 this is already optimized.  It could be optimized if currentVector is
 RandomAccessSparse.

  -jake

 On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil robin.a...@gmail.com wrote:

  Just to be clear, this does:
  currentVector-otherVector ?
 
  currentVector.assign(otherVector, Functions.minus);
 
 
 
  On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   to do subtractFrom, you can instead just do
  
Vector.assign(otherVector, Functions.minus);
  
   The problem is that while DenseVector has an optimization here: if the
   BinaryFunction passed in is additive (it's an instance of PlusMult),
   sparse iteration over otherVector is executed, applying the binary
   function and mutating self.  AbstractVector should have this
 optimization
   in general, as it would be useful in RandomAccessSparseVector
 (although
   not terribly useful in SequentialAccessSparseVector, but still better
  than
   current).
  
-jake
  
   On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com
  wrote:
  
I just had to change it at one place(and the tests pass, which is
  scary).
Canopy is really fast now :). Still could be pushed
Now the bottleneck is minus
   
maybe a subtractFrom on the lines of addTo? or a mutable negate
  function
for
vector, before adding to
   
Robin
   
   
   
On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix jake.man...@gmail.com
 
wrote:
   
 I use it (addTo) in decomposer, for exactly this performance
 issue.
 Changing
 plus into addTo requires care, because since plus() leaves
 arguments
 immutable,
 there may be code which *assumes* that this is the case, and doing
addTo()
 leaves side effects which might not be expected.  This bit me hard
 on
   svd
 migration, because I had other assumptions about mutability in
 there.

  -jake

 On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil robin.a...@gmail.com
 
wrote:

  ah! Its not being used anywhere :). Should we make that a big
 task
before
  0.3 ? Sweep through code(mainly clustering) and change all these
things.
 
  Robin
 
 
 
  On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com
  wrote:
 
   Isn't this basically what assign() is for?
  
   On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
  robin.a...@gmail.com
  wrote:
Now the big perf bottle neck is immutability
   
Say for plus its doing vector.clone() before doing anything
  else.
There should be both immutable and mutable plus functions
   
  
 

   
  
 





[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Really fast now.

 Canopy and Kmeans clustering slows down on using SeqAccVector for center
 

 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-297.patch, MAHOUT-297.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Improvements in TanimotoDistanceMeasure

 Canopy and Kmeans clustering slows down on using SeqAccVector for center
 

 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Changed Euclidean distance measure to use v2 to iterate and v1 to access

 Canopy and Kmeans clustering slows down on using SeqAccVector for center
 

 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
 MAHOUT-297.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
I have made all changes take a look. Same could be done for fuzzy kmeans,
dirichlet and lda. Havent had time to look at internals yet.



On Fri, Feb 19, 2010 at 3:35 AM, Robin Anil robin.a...@gmail.com wrote:

 2 second canopy clustering over reuters :D



 On Fri, Feb 19, 2010 at 3:33 AM, Robin Anil robin.a...@gmail.com wrote:

 This really doesnt work for, i cant modify any vectors inside distance
 measure. So i have wrote a subtract inside manhattan distance itself. Works
 great for now


 On Fri, Feb 19, 2010 at 3:10 AM, Jake Mannix jake.man...@gmail.comwrote:

 currentVector.assign(otherVector, minus) takes the other vector, and
 subtracts
 it from currentVector, which mutates currentVector.  If currentVector is
 DenseVector,
 this is already optimized.  It could be optimized if currentVector is
 RandomAccessSparse.

  -jake

 On Thu, Feb 18, 2010 at 1:29 PM, Robin Anil robin.a...@gmail.com
 wrote:

  Just to be clear, this does:
  currentVector-otherVector ?
 
  currentVector.assign(otherVector, Functions.minus);
 
 
 
  On Fri, Feb 19, 2010 at 2:57 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
   to do subtractFrom, you can instead just do
  
Vector.assign(otherVector, Functions.minus);
  
   The problem is that while DenseVector has an optimization here: if
 the
   BinaryFunction passed in is additive (it's an instance of PlusMult),
   sparse iteration over otherVector is executed, applying the binary
   function and mutating self.  AbstractVector should have this
 optimization
   in general, as it would be useful in RandomAccessSparseVector
 (although
   not terribly useful in SequentialAccessSparseVector, but still better
  than
   current).
  
-jake
  
   On Thu, Feb 18, 2010 at 1:19 PM, Robin Anil robin.a...@gmail.com
  wrote:
  
I just had to change it at one place(and the tests pass, which is
  scary).
Canopy is really fast now :). Still could be pushed
Now the bottleneck is minus
   
maybe a subtractFrom on the lines of addTo? or a mutable negate
  function
for
vector, before adding to
   
Robin
   
   
   
On Fri, Feb 19, 2010 at 2:43 AM, Jake Mannix 
 jake.man...@gmail.com
wrote:
   
 I use it (addTo) in decomposer, for exactly this performance
 issue.
 Changing
 plus into addTo requires care, because since plus() leaves
 arguments
 immutable,
 there may be code which *assumes* that this is the case, and
 doing
addTo()
 leaves side effects which might not be expected.  This bit me
 hard on
   svd
 migration, because I had other assumptions about mutability in
 there.

  -jake

 On Thu, Feb 18, 2010 at 1:09 PM, Robin Anil 
 robin.a...@gmail.com
wrote:

  ah! Its not being used anywhere :). Should we make that a big
 task
before
  0.3 ? Sweep through code(mainly clustering) and change all
 these
things.
 
  Robin
 
 
 
  On Fri, Feb 19, 2010 at 2:36 AM, Sean Owen sro...@gmail.com
  wrote:
 
   Isn't this basically what assign() is for?
  
   On Thu, Feb 18, 2010 at 9:04 PM, Robin Anil 
  robin.a...@gmail.com
  wrote:
Now the big perf bottle neck is immutability
   
Say for plus its doing vector.clone() before doing anything
  else.
There should be both immutable and mutable plus functions
   
  
 

   
  
 






Re: Welcome Drew Farris

2010-02-18 Thread Drew Farris
Hi Grant, fellow Mahouts,

Thanks for the chance to join the team. I really look forward to
contributing my skills to the project and learning a great deal as
well.

So, a little bit about myself;

It all started with an Apple //+ back in 1982. Growing up, I never
thought I'd do something serious with computers. In college I studied
Computer Graphics in the Art School, Architecture and ended up getting
a Masters in Information Resource Management on top of that.

Since then I've been a software developer and who has brushed up
against information retrieval, search and NLP for many years. I got my
start in search and content management working as a web-developer for
a newspaper in the early days of the Internet.

As Grant mentioned, I've worked at TextWise for a number of years. The
company grew out of a NLP-oriented research group headed by Liz Liddy
at Syracuse University and continues to focus on the commercial
applications of text-oriented technologies albeit with a more
statistical orientation as of late.

While a TextWise, I've worked on projects ranging everything from
cross-language IR to contextual advertising. Mostly I've been involved
in developing the glue that holds the core algorithms together,
helping them scale and combining the various moving parts of an system
into a cohesive whole. I've had a chance to do everything from web
crawling, document processing, database, visualization, web-app and
distributed systems work. To that end, I've worked on an off with
Lucene, Nutch, and many other projects from the Apache ecosystem for
years.

Reading Programming Collective Intelligence a couple years back
really solidified my interest in machine learning algorithms. After
building a number of different systems to process large amounts of
content, the ability to quickly and effortlessly scale things up with
hadoop/mapreduce really appeals to me. Thje Mahout project is
wonderful to me in that it combines the these things I'm interested in
personally, has relevance to the things I do for work and has a really
outstanding group of people working on it.

I'm looking forward to working with you all,

Drew

On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil robin.a...@gmail.com wrote:
 Welcome Drew

 @Grant: No customary introduction? :)

 Robin

 On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll gsing...@apache.orgwrote:

 On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
 newest member of the Mahout committer family.  Drew has been contributing
 some really nice work to Mahout in recent months and I look forward to his
 continuing involvement with Mahout.

 Congrats, Drew!


 -Grant



Re: Welcome Drew Farris

2010-02-18 Thread Ted Dunning
We have already enjoyed working with you and look forward to more of it.
Good to have you on board.

On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris drew.far...@gmail.com wrote:

 I'm looking forward to working with you all,




-- 
Ted Dunning, CTO
DeepDyve


[jira] Updated: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-02-18 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-297:
--

Attachment: MAHOUT-297.patch

Last one had a correctness problem in manhattan. This one fixes it.



 Canopy and Kmeans clustering slows down on using SeqAccVector for center
 

 Key: MAHOUT-297
 URL: https://issues.apache.org/jira/browse/MAHOUT-297
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4

 Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
 MAHOUT-297.patch, MAHOUT-297.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
Ted,

I checked out revision 911542 (after removing the mahout-trunk from my local
machine) and
tried again and still getting the same 2 failures upon running mvn clean
install!

Anish

On Thu, Feb 18, 2010 at 11:51 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Note the different version number here.

 I think that Anish has somehow gotten stuck on an old version.  Anish, can
 you do a clean checkout and build?

 On Thu, Feb 18, 2010 at 6:16 AM, Robin Anil robin.a...@gmail.com wrote:

  I am building Revision: 911405 on a Mac, and things work fine for me. I
 am
  assuming same is the case for sean(mac)
 
  ...
 
 
  On Thu, Feb 18, 2010 at 6:11 PM, Anish Shah avsha...@gmail.com wrote:
 
  ...
   $ svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
   Checked out revision 911364.
  
 



 --
 Ted Dunning, CTO
 DeepDyve



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Robin Anil
TODO: sum of minus to be optimised without having to hold the intermediate
vector.


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
Yes.  addTo is just a specialization of a very, very common case.

On Thu, Feb 18, 2010 at 1:06 PM, Sean Owen sro...@gmail.com wrote:

 Isn't this basically what assign() is for?




-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
Actually, this makes the case that we should have something like:

 microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction,
vectorA, vectorB)

The name should be changed after its rhetorical effect has worn off.  As the
Chukwa guys tend to say, its turtles all the way down.  We can have
map-reduce inside map-reduce.

On Thu, Feb 18, 2010 at 3:41 PM, Robin Anil robin.a...@gmail.com wrote:

 TODO: sum of minus to be optimised without having to hold the intermediate
 vector.




-- 
Ted Dunning, CTO
DeepDyve


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Ted Dunning
Darn.  That uses up all of my ideas.

I would vote for a platform issue.

On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah avsha...@gmail.com wrote:

 I checked out revision 911542 (after removing the mahout-trunk from my
 local
 machine) and
 tried again and still getting the same 2 failures upon running mvn clean
 install!




-- 
Ted Dunning, CTO
DeepDyve


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
On Thu, Feb 18, 2010 at 3:58 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Actually, this makes the case that we should have something like:

 microMapReduce(aggregatorFunction, aggregatorUnit, binaryMapFunction,
 vectorA, vectorB)


What would this method mean?  aggregatorUnit means what?  What would this
be a method on?

The reason why we need a specialized function is to do things in a nicely
mutating way: Hadoop M/R is functional in the lispy-sensen: read-only
immutable
objects (once on the filesystem).

The only thing more we need than what we have now is in the assign method -
currently we have it with a map, with reduce being the identity (with
replacement -
the calling object becomes the output of the reduce -ie the output of the
map):

  Vector.assign(Vector other, BinaryFunction map) {
// implemented effectively as follows in AbstractVector
for(int i=0;isize();i++)
  setQuick(i, map.apply(getQuick(i), other.getQuick(i));
return this;
  }

Something more powerful (and sparse-efficient) would be:

  Vector.assign(Vector v, BinaryFunction map,  BinaryFunction reduce,
boolean s) {
 IteratorElement it = sparse ? other.iterateNonZero() :
other.iterateAll();
 while(it.hasNext()) {
   Element e = it.next();
   int i = e.index();
   e.set(i, map.apply(getQuick(i), e.get()));
 }
 // do stuff with the reduce - what exactly?
 return this;
  }

(is the reduce necessary?)


  -jake


Re: Welcome Drew Farris

2010-02-18 Thread Jake Mannix
Welcome Drew!  I've been using your excellent colloc code quite a bit
in testing my svd stuff (produces nicely bigger vectors out of text!),
looking
forward to more cool stuff (NLP package!  Bring it on! :) ).

  -jake

On Thu, Feb 18, 2010 at 2:27 PM, Drew Farris drew.far...@gmail.com wrote:

 Hi Grant, fellow Mahouts,

 Thanks for the chance to join the team. I really look forward to
 contributing my skills to the project and learning a great deal as
 well.

 So, a little bit about myself;

 It all started with an Apple //+ back in 1982. Growing up, I never
 thought I'd do something serious with computers. In college I studied
 Computer Graphics in the Art School, Architecture and ended up getting
 a Masters in Information Resource Management on top of that.

 Since then I've been a software developer and who has brushed up
 against information retrieval, search and NLP for many years. I got my
 start in search and content management working as a web-developer for
 a newspaper in the early days of the Internet.

 As Grant mentioned, I've worked at TextWise for a number of years. The
 company grew out of a NLP-oriented research group headed by Liz Liddy
 at Syracuse University and continues to focus on the commercial
 applications of text-oriented technologies albeit with a more
 statistical orientation as of late.

 While a TextWise, I've worked on projects ranging everything from
 cross-language IR to contextual advertising. Mostly I've been involved
 in developing the glue that holds the core algorithms together,
 helping them scale and combining the various moving parts of an system
 into a cohesive whole. I've had a chance to do everything from web
 crawling, document processing, database, visualization, web-app and
 distributed systems work. To that end, I've worked on an off with
 Lucene, Nutch, and many other projects from the Apache ecosystem for
 years.

 Reading Programming Collective Intelligence a couple years back
 really solidified my interest in machine learning algorithms. After
 building a number of different systems to process large amounts of
 content, the ability to quickly and effortlessly scale things up with
 hadoop/mapreduce really appeals to me. Thje Mahout project is
 wonderful to me in that it combines the these things I'm interested in
 personally, has relevance to the things I do for work and has a really
 outstanding group of people working on it.

 I'm looking forward to working with you all,

 Drew

 On Thu, Feb 18, 2010 at 4:05 PM, Robin Anil robin.a...@gmail.com wrote:
  Welcome Drew
 
  @Grant: No customary introduction? :)
 
  Robin
 
  On Fri, Feb 19, 2010 at 2:33 AM, Grant Ingersoll gsing...@apache.org
 wrote:
 
  On behalf of the Lucene PMC, I'm happy to announce Drew Farris as the
  newest member of the Mahout committer family.  Drew has been
 contributing
  some really nice work to Mahout in recent months and I look forward to
 his
  continuing involvement with Mahout.
 
  Congrats, Drew!
 
 
  -Grant
 



Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Ted Dunning
On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix jake.man...@gmail.com wrote:

 What would this method mean?  aggregatorUnit means what?  What would this
 be a method on?


This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

 The reason why we need a specialized function is to do things in a nicely
 mutating way: Hadoop M/R is functional in the lispy-sensen: read-only
 immutable objects (once on the filesystem).


We definitely need that too.


  The only thing more we need than what we have now is in the assign method
 -
 currently we have it with a map, with reduce being the identity (with
 replacement -
 the calling object becomes the output of the reduce -ie the output of the
 map):


That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.


Re: New to Mahout - question about the failed test cases

2010-02-18 Thread Anish Shah
I have created https://issues.apache.org/jira/browse/MAHOUT-298 to track
this.

On Thu, Feb 18, 2010 at 6:59 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Darn.  That uses up all of my ideas.

 I would vote for a platform issue.

 On Thu, Feb 18, 2010 at 3:41 PM, Anish Shah avsha...@gmail.com wrote:

  I checked out revision 911542 (after removing the mahout-trunk from my
  local
  machine) and
  tried again and still getting the same 2 failures upon running mvn clean
  install!
 



 --
 Ted Dunning, CTO
 DeepDyve



[jira] Created: (MAHOUT-298) 2 test case fails while trying to mvn clean install after checking out revision 911542 of trunk

2010-02-18 Thread Anish Shah (JIRA)
2 test case fails while trying to mvn clean install after checking out revision 
911542 of trunk
---

 Key: MAHOUT-298
 URL: https://issues.apache.org/jira/browse/MAHOUT-298
 Project: Mahout
  Issue Type: Test
  Components: Clustering
Affects Versions: 0.3
 Environment: Windows 7 with Cygwin
Reporter: Anish Shah
Priority: Minor


I checked out revision 911542 from trunk and seeing the following failed tests 
when I ran mvn clean install:

Results :

Failed tests:
  testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmean
sClustering)

Tests run: 338, Failures: 2, Errors: 0, Skipped: 0

[INFO] 
[ERROR] BUILD FAILURE
[INFO] 
[INFO] There are test failures.

I looked in the surefire-reports and see the following details on the failures:

testKMeansMRJob(org.apache.mahout.clustering.kmeans.TestKmeansClustering)  Time 
elapsed: 11.169 sec   FAILURE!
junit.framework.AssertionFailedError: clusters[3] expected:4 but was:2
at junit.framework.Assert.fail(Assert.java:47)
...

testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  Time elapsed: 3.35 sec   FAILURE!
junit.framework.AssertionFailedError: num points[0] expected:4 but was:1
at junit.framework.Assert.fail(Assert.java:47)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Welcome Drew Farris

2010-02-18 Thread Drew Farris
On Thu, Feb 18, 2010 at 7:45 PM, Jake Mannix jake.man...@gmail.com wrote:
 Welcome Drew!  I've been using your excellent colloc code quite a bit
 in testing my svd stuff (produces nicely bigger vectors out of text!),
 looking
 forward to more cool stuff (NLP package!  Bring it on! :) ).


Heh, great to hear! There's lots more stuff I'd like to get in there,
now I only need to figure how to squeeze 48 hours of consciousness
into a day.


Re: Profiling SequentialAccessSparseVector

2010-02-18 Thread Jake Mannix
Don't we already have generalized scalar aggregation?  I thought I committed
that a while back.  Its very useful for inner products, distances, and
stats.

Vector accumulation using a BinaryFunction as a map just needs to be made
more efficient (sparsity and random accessibility taken into account), but
works.

The only remaining piece is something like accumulate(Vector v,
BinaryFunction map, BinaryFunction aggregator) - a method on Matrix, which
aggregates partial map() combinations af each row with the input Vector, and
returns a Vector.  This generalizes times(Vector).  I guess
Matrix.assign(Vector v, BinaryFunction map) could be useful for mutating a
matrix, but on HDFS would operate by making new sequencefiles.

  -jake

On Feb 18, 2010 5:11 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Thu, Feb 18, 2010 at 4:43 PM, Jake Mannix jake.man...@gmail.com wrote:
 What would this metho...
This method would apply the mapFunction to each corresponding pair of
elements from the two vectors and then aggregate the results using the
aggregatorFunction.

The unit is the unit of the aggregator and would only be needed if the
vectors have no entries.  We could probably do without it.

This could be a static function or could be a method on vectorA.  Putting
the method on vectorA would probably be better because it could drive many
common optimizations.

Examples of this pattern include sum-squared-difference (agg = plus, map =
compose(sqr, minus)), dot (agg = plus, map = times).

This can be composed with a temporary output vector or sometimes by mutating
one of the operands.  This is not as desirable as just accumulating the
results on the fly, however.

The reason why we need a specialized function is to do things in a nicely 
mutating way: Hadoop M...
We definitely need that too.

 The only thing more we need than what we have now is in the assign method
 -  currently we ha...
That can work, but very often requires an extra copy of the vector as in the
distance case that Robin brought up.  The contract there says neither
operand can be changed which forces a vector copy in the current API.  A
mapReduce operation in addition to a map would allow us to avoid that
important case.


Re: Welcome Drew Farris

2010-02-18 Thread Grant Ingersoll

On Feb 18, 2010, at 8:32 PM, Drew Farris wrote:

  There's lots more stuff I'd like to get in there,
 now I only need to figure how to squeeze 48 hours of consciousness
 into a day.

I believe there is a compression algorithm for that.


[jira] Created: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)
Collocations: improve performance by making Gram BinaryComparable
-

 Key: MAHOUT-299
 URL: https://issues.apache.org/jira/browse/MAHOUT-299
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Fix For: 0.3


Robin's profiling indicated that a large portion of a run was spent in 
readFields() in Gram due to the deserialization occuring as a part of Gram 
comparions for sorting. He pointed me to BinaryComparable and the 
implementation in Text.

Like Text, in this new implementation, Gram stores its string in binary form. 
When encoding the string at construction time we allocate an extra character's 
worth of data to hold the Gram type information. When sorting Grams, the binary 
arrays are compared instead of deserializing and comparing fields.

 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-299:
---

Attachment: MAHOUT-299.patch

Patch as described above:

Included other cleanups:

* Gram is no longer mutable, except in the case of readFields of course.
* Added explicit NGRAM type, remove constructors that implicitly set type.
* Added unit tests for constuctors, writability. One should be added for 
sortability/comparison.
* Better unigram handling in the mappers/reducers (no need to setType on these 
anymore)
* Switched to adjustOrPutValue when accumulating frequencies in 
OpenObjectIntHashMaps

Also, NGramCollector, NGramCollectorTest should be removed from the repo. They 
are no longer relevant. Applying this patch with -E will empty and erase these 
files, but it's up to svn to do the rest.



 Collocations: improve performance by making Gram BinaryComparable
 -

 Key: MAHOUT-299
 URL: https://issues.apache.org/jira/browse/MAHOUT-299
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Fix For: 0.3

 Attachments: MAHOUT-299.patch


 Robin's profiling indicated that a large portion of a run was spent in 
 readFields() in Gram due to the deserialization occuring as a part of Gram 
 comparions for sorting. He pointed me to BinaryComparable and the 
 implementation in Text.
 Like Text, in this new implementation, Gram stores its string in binary form. 
 When encoding the string at construction time we allocate an extra 
 character's worth of data to hold the Gram type information. When sorting 
 Grams, the binary arrays are compared instead of deserializing and comparing 
 fields.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-299) Collocations: improve performance by making Gram BinaryComparable

2010-02-18 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-299:
---

Status: Patch Available  (was: Open)

 Collocations: improve performance by making Gram BinaryComparable
 -

 Key: MAHOUT-299
 URL: https://issues.apache.org/jira/browse/MAHOUT-299
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor
 Fix For: 0.3

 Attachments: MAHOUT-299.patch


 Robin's profiling indicated that a large portion of a run was spent in 
 readFields() in Gram due to the deserialization occuring as a part of Gram 
 comparions for sorting. He pointed me to BinaryComparable and the 
 implementation in Text.
 Like Text, in this new implementation, Gram stores its string in binary form. 
 When encoding the string at construction time we allocate an extra 
 character's worth of data to hold the Gram type information. When sorting 
 Grams, the binary arrays are compared instead of deserializing and comparing 
 fields.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Welcome Drew Farris

2010-02-18 Thread deneche abdelhakim
Welcome Drew

=D

On Fri, Feb 19, 2010 at 5:02 AM, Grant Ingersoll gsing...@apache.org wrote:

 On Feb 18, 2010, at 8:32 PM, Drew Farris wrote:

  There's lots more stuff I'd like to get in there,
 now I only need to figure how to squeeze 48 hours of consciousness
 into a day.

 I believe there is a compression algorithm for that.



[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

2010-02-18 Thread zhao zhendong (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
-

Attachment: SVMonMahout0.5.1.patch

MapReduce/MapReduceUtil.java
should have been mapreduce/MapReduceUtil.java
the folders are NOT in camel case. I still see camel casing everywhere.
 Done. Change MapReduce - mapreduce, ParallelAlgorithms - 
 parallelalgorithms and SequentialAlgorithms - sequentialalgorithms

+  public static final String DEFAULT_HDFS_SERVER = hdfs://localhost:12009;
+  // For HBASE
+  public static final String DEFAULT_HBASE_SERVER = localhost:6;
These are read from the hadoop conf and hbase configuraiton file. Mahout 
shouldnt be doing any sort of configuration internally.
 Hard coding in hadoop and hbase configuration have been removed. The Default 
 HDFS and Hbase setting in SVMParameters only for MapReduce application 
 runtime default setting.

No System.out.Println use the Logger log instead
 Done.

HDFSConfig.java, HDFSReader.java - do away with any hdfs configuration in the 
code. As i said Opening a FileSystem using the Configuration object would 
in-turn decide between local fs or hdfs based on the execution context
 Yeap, the Sequential algorithms use this principle you mentioned, it 
 determines which file system it should choose according to parameter hdfs 
 is given or not in training and prediction procedures. HDFSReader only 
 server Sequential Algorithms but not for parallel algorithms based on 
 Map/Reduce framework. 


 Implementation of sequential SVM solver based on Pegasos
 

 Key: MAHOUT-232
 URL: https://issues.apache.org/jira/browse/MAHOUT-232
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Affects Versions: 0.4
Reporter: zhao zhendong
 Fix For: 0.4

 Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, 
 SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, SVMDataset.patch, 
 SVMonMahout0.5.1.patch, SVMonMahout0.5.patch


 After discussed with guys in this community, I decided to re-implement a 
 Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
 line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
 support HDFS. 
 Sequential SVM based on Pegasos.
 Maxim zhao (zhaozhendong at gmail dot com)
 ---
 Currently, this package provides (Features):
 ---
 1. Sequential SVM linear solver, include training and testing.
 2. Support general file system and HDFS right now.
 3. Supporting large-scale data set training.
 Because of the Pegasos only need to sample certain samples, this package 
 supports to pre-fetch
 the certain size (e.g. max iteration) of samples to memory.
 For example: if the size of data set has 100,000,000 samples, due to the 
 default maximum iteration is 10,000,
 as the result, this package only random load 10,000 samples to memory.
 4. Sequential Data set testing, then the package can support large-scale data 
 set both on training and testing.
 5. Supporting parallel classification (only testing phrase) based on 
 Map-Reduce framework.
 6. Supoorting Multi-classfication based on Map-Reduce framework (whole 
 parallelized version).
 7. Supporting Regression.
 ---
 TODO:
 ---
 1. Multi-classification Probability Prediction
 2. Performance Testing
 ---
 Usage:
 ---
 
 Classification:
 
 
 @@ Training: @@
 
 SVMPegasosTraining.java
 The default argument is:
 -tr ../examples/src/test/resources/svmdataset/train.dat -m 
 ../examples/src/test/resources/svmdataset/SVM.model
 ~~
 @ For the case that training data set on HDFS:@
 ~~
 1 Assure that your training data set has been submitted to hdfs
 hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
 2 revise the argument:
 -tr /user/hadoop/train.dat -m 
 ../examples/src/test/resources/svmdataset/SVM.model -hdfs 
 hdfs://localhost:12009
 ~~
 @ Multi-class Training [Based on MapReduce Framework]:@
 ~~
 bin/hadoop jar mahout-core-0.3-SNAPSHOT.job 
 org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver
  -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m