Re: Streaming KMeans clustering

2013-12-27 Thread Dan Filimon
Hi everyone!

So for the two issues:

1. Mapper slowness: this is basically an issue with the searcher being
used. The default is ProjectionSearch which was doing a good job. If the
bottleneck is indeed remove or searchFirst, that sort of point outs a
limitation in the basic algorithm (unless it turns out there's something
super dumb going on).

2. Reducer OOM: for this job, if we have m mappers, clustering n points
into k clusters, each mapper should get roughly  n  / m points to cluster,
and produce k log (n / m) centroids. The total number of points that the
reducer gets is m * k * log (n / m).

As you can see, this means that this really depends on the particular data
set we're working with. Suppose k is n / 10 and you have m = 10 mappers.
That gets you 10 * n / 10 * log (n / 10) ~ n log n points that the reducer
has to cluster and really it makes this approach totally useless because
you'll have more points at the end than at the beginning.

In any case, if the number of reducer centroids (the m * k * log (n / m))
is acceptable, there's an option to run another StreamingKMeans in the
reducer: there's the reduceStreamingKMeans flag in the driver.

However, I feel that if you see yourself needing this flag, it probably
shows that this MapReduce approach is not what you want and you should just
run StreamingKMeans directly.

I think in retrospect, that there should be code that checks for this in
the driver and spits out a warning. :)

Thoughts?

(Happy Holidays to everyone too! :D)



On Fri, Dec 27, 2013 at 9:59 AM, Sotiris Salloumis  wrote:

> Hi Suneel,
>
> Is it possible to upload debug or log messages from the OOM exceptions you
> have seen to take a look on them?
>
> Regards
> Sotiris
>
>
> On Thu, Dec 26, 2013 at 8:19 PM, Suneel Marthi  >wrote:
>
> > I would push the code freeze until this is resolved (and the reason I had
> > been holding off). This is something that should have been raised for 0.8
> > release and I dob;t think we should defer this to the next one.
> >
> > I heard people outside of dev@ and user@ who have tried running
> Streaming
> > KMeans (from 0.8) on their Production clusters on large datasets and had
> > seen the job crash in the Reduce phase due to OOM errors (this is with
> > -Xmx2GB).
> >
> >
> >
> >
> >
> >
> > On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <
> > isa...@apache.org> wrote:
> >
> > On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:
> >
> > > Its when you increase the no. of documents and the size of each
> > >  document (add more dimensions) that you start seeing performance
> issues
> > which are:
> > > a)The Mappers take long to complete and its either the
> searcher.remove()
> > or searcher.searchFirst() calls (will check again in my next attempt)
> that
> > seems to be the bottleneck.
> > > b) Once the Mappers complete (after several hours) the Reducer dies
> with
> > an OOM exception (despite having set -Xmx2G).
> >
> > Given that there seem to be a couple of people experiencing issues I
> think
> > it makes sense to create a JIRA issue here to track progress - either
> code
> > improvements or better documentation on how to run this implementation.
> >
> > @Suneel: Does it make sense to push code freeze to after fixing this or
> > should this be communicated as a known defect in the release notes?
> >
> >
> > Isabel
>


[jira] [Resolved] (MAHOUT-1256) Improve the CSV handling code to get vectors

2013-12-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1256.
-

   Resolution: Won't Fix
Fix Version/s: 0.9

Too vague, have a patch lying around to improve reading CSV files as vectors, 
but it's untested and quite hacky.
Since nobody wanted this and we have an upcoming release, dropping this 
completely.

> Improve the CSV handling code to get vectors
> 
>
> Key: MAHOUT-1256
> URL: https://issues.apache.org/jira/browse/MAHOUT-1256
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.9
>
>
> Minor additions to iterate through a CSV file directly (as long as it's only 
> numbers).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Work stopped] (MAHOUT-1256) Improve the CSV handling code to get vectors

2013-12-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1256 stopped by Dan Filimon.

> Improve the CSV handling code to get vectors
> 
>
> Key: MAHOUT-1256
> URL: https://issues.apache.org/jira/browse/MAHOUT-1256
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>    Assignee: Dan Filimon
>Priority: Minor
>
> Minor additions to iterate through a CSV file directly (as long as it's only 
> numbers).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Work started] (MAHOUT-1256) Improve the CSV handling code to get vectors

2013-12-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1256 started by Dan Filimon.

> Improve the CSV handling code to get vectors
> 
>
> Key: MAHOUT-1256
> URL: https://issues.apache.org/jira/browse/MAHOUT-1256
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>    Assignee: Dan Filimon
>Priority: Minor
>
> Minor additions to iterate through a CSV file directly (as long as it's only 
> numbers).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1312) LocalitySensitiveHashSearch.search does not respect search result limit

2013-08-13 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738200#comment-13738200
 ] 

Dan Filimon commented on MAHOUT-1312:
-

Thanks for pointing this out! I'll have a look at it.
Do you have a reproducible example / suggestions of where this happened?

> LocalitySensitiveHashSearch.search does not respect search result limit
> ---
>
> Key: MAHOUT-1312
> URL: https://issues.apache.org/jira/browse/MAHOUT-1312
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Stevo Slavic
>Assignee: Dan Filimon
>Priority: Minor
>
> According to documented {{org.apache.mahout.math.neighborhood.Searcher}}, 
> {{public abstract List> search(Vector query, int 
> limit)}} contract, {{limit}} should be the number of results to return.
> {{LocalitySensitiveHashSearch}} implements {{Searcher}} but does not respect 
> that contract, as it can return more results than the given limit.
> This issue was encountered while debugging MAHOUT-1302.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


IntWritable vs VarIntWritable (same for Long)

2013-06-20 Thread Dan Filimon
Hi!

Can anyone explain what the difference between IntWritable and
VarIntWritable is?
When should I use each?


Re: Vectors with 64bit indices?

2013-06-19 Thread Dan Filimon
I understand that it might not be worth it to have all vectors have 64bit
indices all the time.

My use case is very sparse vectors, that have IPs in them for example.
A particular vector will likely have at most 20 IPs, but since Java doesn't
have unsigned, even IPs need to be hashed (and they're 32bit! :).

As for actual hashes, for the data I'm working with, I see 0.4% of the data
being lost because of collisions. Granted, this is not much, but there's
also not that much data...
It may also be true that the hash function I'm using isn't the best (Java's
hashCode() (on an unrelated note, any suggestions for a better one?).

Sean, is this structure you're using available anywhere?

What I'm proposing would not change all the Vectors (at least not at once).
For instance, I'm thinking of a SequentialAccesSparseLongVector class. This
would be straightforward, as we only need an OrderedLongDoubleMapping
(which could be templatized like other math containers) and a long size.
It could still support the same interface as Vector but add a few more
functions... This would be my hack. :)

As for a more long-term solution, perhaps also templatizing Vector to
support ints and longs and have the primitive classes generated at compile
time?


On Wed, Jun 19, 2013 at 9:22 PM, Jake Mannix  wrote:

> long keys are super useful for rows in a matrix (ids for documents), and
> basically free in terms of memory (only one per document), but then for
> symmetry we really do need them in the columns (keying on e.g. termId),
> which is a not-insubstantial cost, but possibly worth it.
>
> Our vectors would be (16* numNonZeroEntries) bytes in footprint.  That's
> pretty hefty, but not too much more than 12.
>
> There are arguments that most of the time, we don't need double values
> either.  Sometimes, we don't need values at all (boolean data), but we
> could certainly have special-purpose Vectors which carry no values and yet
> return 1d for when the key is present.
>
> But changing over all of our keys to long is a pretty big change.  Is it
> worth it?
>
>
> On Wed, Jun 19, 2013 at 10:25 AM, Sean Owen  wrote:
>
> > I use 64-bit keys for vector-like data structures, and indeed you may
> > pay a cost in extra RAM, but it has a lot of benefits in simplicity
> > mostly, and making the probability of hash collisions ignorable even
> > at huge scale. I think it's worthwhile overall.
> >
> > On Wed, Jun 19, 2013 at 6:16 PM, Robin Anil 
> wrote:
> > > 
> > > Which joker thought of removing uint from Java?
> > > 
> > >
> > > Dan, the cost of moving to 64 bit for the index is extra RAM usage. My
> > > experiments show that 32 bits is enough to hash down billions of
> > features.
> > > Do we ever need such Quadrillions of features? Can Machine learning
> truly
> > > work at that scale. Think about these.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Wed, Jun 19, 2013 at 5:16 AM, Dan Filimon <
> > dangeorge.fili...@gmail.com>wrote:
> > >
> > >> Also, this is particularly problematic because indices can't be
> > negative so
> > >> only 2^31 elements are actually possible.
> > >>
> > >>
> > >> On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon <
> > dangeorge.fili...@gmail.com
> > >> >wrote:
> > >>
> > >> > Hi everyone!
> > >> >
> > >> > The current Vector API only supports 32bit maximum indices for
> > Vectors.
> > >> >
> > >> > I feel that 64bits would be more appropriate especially because the
> > >> > indices are likely to be hash values of other data and 32bit will
> > result
> > >> in
> > >> > quite a few collisions.
> > >> >
> > >> > Also, for some jobs, notably ItemSimilarityJob, this restriction
> means
> > >> > that we need a special id to index map where we'll collide anyway.
> > >> >
> > >> > What do you think about adding support for 64bit indices?
> > >> > Is anyone at all interested?
> > >> >
> > >>
> >
>
>
>
> --
>
>   -jake
>


Re: Vectors with 64bit indices?

2013-06-19 Thread Dan Filimon
Also, this is particularly problematic because indices can't be negative so
only 2^31 elements are actually possible.


On Wed, Jun 19, 2013 at 1:15 PM, Dan Filimon wrote:

> Hi everyone!
>
> The current Vector API only supports 32bit maximum indices for Vectors.
>
> I feel that 64bits would be more appropriate especially because the
> indices are likely to be hash values of other data and 32bit will result in
> quite a few collisions.
>
> Also, for some jobs, notably ItemSimilarityJob, this restriction means
> that we need a special id to index map where we'll collide anyway.
>
> What do you think about adding support for 64bit indices?
> Is anyone at all interested?
>


Vectors with 64bit indices?

2013-06-19 Thread Dan Filimon
Hi everyone!

The current Vector API only supports 32bit maximum indices for Vectors.

I feel that 64bits would be more appropriate especially because the indices
are likely to be hash values of other data and 32bit will result in quite a
few collisions.

Also, for some jobs, notably ItemSimilarityJob, this restriction means that
we need a special id to index map where we'll collide anyway.

What do you think about adding support for 64bit indices?
Is anyone at all interested?


Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Dan Filimon
I think you can get what you need through the --maxPrefsForUser flag.
Any user with more than that many will only keep a random sample of that size.



On Jun 18, 2013, at 23:27, Ted Dunning  wrote:

> I was reading the RowSimilarityJob and it doesn't appear that it does
> down-sampling on the original data to minimize the performance impact of
> perversely prolific users.
> 
> The issue is that if a single user has 100,000 items in their history, we
> learn nothing more than if we picked 300 of those while the former would
> result in processing 10 billion cooccurrences and the latter would result
> in 100,000.  This factor of 10,000 is so large that it can make a big
> difference in performance.
> 
> I had thought that the code had this down-sampling in place.
> 
> If not, I can add row based down-sampling quite easily.


[jira] [Commented] (MAHOUT-1254) Final round of cleanup for StreamingKMeans

2013-06-16 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684772#comment-13684772
 ] 

Dan Filimon commented on MAHOUT-1254:
-

I fixed that, in another commit. I broke the build when applying a git
patch. :(
Thank you for the header!





> Final round of cleanup for StreamingKMeans
> --
>
> Key: MAHOUT-1254
> URL: https://issues.apache.org/jira/browse/MAHOUT-1254
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Assignee: Dan Filimon
> Fix For: 0.8
>
> Attachments: skm.patch
>
>
> Did a bit of tweaking on StreamingKMeans, driver, mapper and reducer to share 
> more code and make it nicer.
> Need to put this in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1255) Change BallKMeans weighting to use log(weight)

2013-06-16 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1255.
-

   Resolution: Fixed
Fix Version/s: 0.8

> Change BallKMeans weighting to use log(weight)
> --
>
> Key: MAHOUT-1255
> URL: https://issues.apache.org/jira/browse/MAHOUT-1255
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>    Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
>
> Weirdness is happening in the reducer when doing k-means++ sampling by 
> weights.
> Change from multiplying the probability with the weight to multiplying by 2 * 
> log(weight).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1255) Change BallKMeans weighting to use log(weight)

2013-06-16 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684694#comment-13684694
 ] 

Dan Filimon commented on MAHOUT-1255:
-

Might be something else, but I've been unable to track what exactly over the 
weekend.
So, I'm committing this.

> Change BallKMeans weighting to use log(weight)
> --
>
> Key: MAHOUT-1255
> URL: https://issues.apache.org/jira/browse/MAHOUT-1255
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
>Priority: Minor
>
> Weirdness is happening in the reducer when doing k-means++ sampling by 
> weights.
> Change from multiplying the probability with the weight to multiplying by 2 * 
> log(weight).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1254) Final round of cleanup for StreamingKMeans

2013-06-16 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1254.
-

   Resolution: Fixed
Fix Version/s: 0.8

> Final round of cleanup for StreamingKMeans
> --
>
> Key: MAHOUT-1254
> URL: https://issues.apache.org/jira/browse/MAHOUT-1254
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>    Assignee: Dan Filimon
> Fix For: 0.8
>
> Attachments: skm.patch
>
>
> Did a bit of tweaking on StreamingKMeans, driver, mapper and reducer to share 
> more code and make it nicer.
> Need to put this in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1261) TasteHadoopUtils.idToIndex can return an int that has size Integer.MAX_VALUE

2013-06-13 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1261:
---

 Summary: TasteHadoopUtils.idToIndex can return an int that has 
size Integer.MAX_VALUE
 Key: MAHOUT-1261
 URL: https://issues.apache.org/jira/browse/MAHOUT-1261
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.8
Reporter: Dan Filimon
Assignee: Sean Owen
Priority: Minor


I'm running ItemSimilarityJob on a very large (~600M by 4B) matrix that's very 
sparse (total set of associations is 630MB).

The job fails because of an IndexException in ToUserVectorsReducer.
TasteHadoopUtils.idToIndex(long id) hashes a long with:
0x7fff & Longs.hashCode(id) (line 
o.a.m.cf.taste.hadoop.TasteHadoopUtils:57).

For some id (I don't know what value), the result returned is Integer.MAX_VALUE.
This cannot be set in the userVector because the cardinality of that is also 
Integer.MAX_VALUE and it throws an exception.

So, the issue is that values from 0 to INT_MAX are returned by idToIndex but 
the vector only has 0 to INT_MAX - 1 possible entries.
It's a nasty little off-by-one bug.

I'm thinking of just % size when setting.

[~ssc] & everyone else, thoughts? :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1254) Final round of cleanup for StreamingKMeans

2013-06-12 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1254:


Attachment: skm.patch

This should go on RB, but I can't get it uploaded (it's from SVN).

> Final round of cleanup for StreamingKMeans
> --
>
> Key: MAHOUT-1254
> URL: https://issues.apache.org/jira/browse/MAHOUT-1254
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
> Attachments: skm.patch
>
>
> Did a bit of tweaking on StreamingKMeans, driver, mapper and reducer to share 
> more code and make it nicer.
> Need to put this in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-12 Thread Dan Filimon
It turns out that my initial estimate of the time it takes to finish these
issues was overly optimistic.
I'm squashed between work and writing my thesis and unforeseen merging
issues.

So, I hate to say this, but could we please postpone this release till
Monday?


On Wed, Jun 12, 2013 at 1:11 PM, Grant Ingersoll wrote:

> Sounds good.
>
> On Jun 11, 2013, at 4:36 PM, Dan Filimon 
> wrote:
>
> > Sorry to rain on everyone's party, but I opened a few more issues I need
> to
> > take of before 0.8 final that I had forgotten about.
> > M-1253 to M-1256.
> >
> > I have code for all of these (that I tested, incidentally, that's the
> code
> > I used for the experiments in the talk :), just need to merge it in and I
> > wanted to have issues to mark as done to keep track of things.
> >
> > Should not take long and I should be done by Thursday.
> > Also, would anyone like to review the code on ReviewBoard? :)
> >
> >
> > On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll  >wrote:
> >
> >> I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
> >> Thursday, I can roll an RC on Thursday.
> >>
> >> -Grant
> >>
> >> On Jun 11, 2013, at 8:56 AM, Grant Ingersoll 
> wrote:
> >>
> >>> Down to 4 issues!  I would say what they are, but JIRA is flaking out
> >> again.
> >>>
> >>> My instinct is that 1030 and 1233 can be pushed.  Suneel has been
> >> working hard to get M-833 in.  Not sure on M-1214, Robin?
> >>>
> >>> -G
> >>>
> >>> On Jun 9, 2013, at 6:10 PM, Grant Ingersoll 
> wrote:
> >>>
> >>>>
> >>>> On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 
> >> wrote:
> >>>>>>
> >>>>>> M-1067 -- Dmitriy  --  This is an enhancement, should we push?
> >>>>
> >>>> Looks like this was committed already.
> >>>>
> >>>
> >>>
> >>
> >> 
> >> Grant Ingersoll | @gsingers
> >> http://www.lucidworks.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> 
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>


Reviewboard patches from SVN

2013-06-12 Thread Dan Filimon
I'm having trouble uploaded SVN formatted patches.

The issue in question is:
https://issues.apache.org/jira/browse/MAHOUT-1253

My comment was:
"
I'm trying to add this patch to ReviewBoard.

One of the changes is editing driver.classes.props.
The problem is that this file is being changed. When generating the patch,
the revision it wrote was 1491936.

Then when submitting it to ReviewBoard, it fails saying there's no file
with that revision number.

I then ran 'svn update' and the revision number increased. I thought this
was because my repo was stale and tried again, editing the patch by hand.

Still failed at revision 1492089.
I edited it again and now it's at revision 1492090.

Needless to say, it won't ever upload the patch and it seems to keep
increasing the revision number.

ReviewBoard seems to be doing something it shouldn't, always updating the
repo in some way.

This isn't a problem with the mahout-git repo though.
"

Is anyone else experiencing issues with this?
Also, could anyone check what ReviewBoard is doing? This seems quite wrong.


[jira] [Updated] (MAHOUT-1253) Add experiment tools for StreamingKMeans

2013-06-12 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1253:


Attachment: exp.patch2

I'm trying to add this patch to ReviewBoard.

One of the changes is editing driver.classes.props.
The problem is that this file is being changed. When generating the patch, the 
revision it wrote was 1491936.

Then when submitting it to ReviewBoard, it fails saying there's no file with 
that revision number.

I then ran 'svn update' and the revision number increased. I thought this was 
because my repo was stale and tried again, editing the patch by hand.

Still failed at revision 1492089.
I edited it again and now it's at revision 1492090.

Needless to say, it won't ever upload the patch and it seems to keep increasing 
the revision number.

ReviewBoard seems to be doing something it shouldn't, always updating the repo 
in some way.

This isn't a problem with the mahout-git repo though.

> Add experiment tools for StreamingKMeans
> 
>
> Key: MAHOUT-1253
> URL: https://issues.apache.org/jira/browse/MAHOUT-1253
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Assignee: Dan Filimon
> Attachments: exp.patch2
>
>
> Merge in this patch https://reviews.apache.org/r/11302/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 progress

2013-06-11 Thread Dan Filimon
Sorry to rain on everyone's party, but I opened a few more issues I need to
take of before 0.8 final that I had forgotten about.
M-1253 to M-1256.

I have code for all of these (that I tested, incidentally, that's the code
I used for the experiments in the talk :), just need to merge it in and I
wanted to have issues to mark as done to keep track of things.

Should not take long and I should be done by Thursday.
Also, would anyone like to review the code on ReviewBoard? :)


On Tue, Jun 11, 2013 at 5:09 PM, Grant Ingersoll wrote:

> I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by
> Thursday, I can roll an RC on Thursday.
>
> -Grant
>
> On Jun 11, 2013, at 8:56 AM, Grant Ingersoll  wrote:
>
> > Down to 4 issues!  I would say what they are, but JIRA is flaking out
> again.
> >
> > My instinct is that 1030 and 1233 can be pushed.  Suneel has been
> working hard to get M-833 in.  Not sure on M-1214, Robin?
> >
> > -G
> >
> > On Jun 9, 2013, at 6:10 PM, Grant Ingersoll  wrote:
> >
> >>
> >> On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 
> wrote:
> 
>  M-1067 -- Dmitriy  --  This is an enhancement, should we push?
> >>
> >> Looks like this was committed already.
> >>
> >
> >
>
> 
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>


[jira] [Created] (MAHOUT-1256) Improve the CSV handling code to get vectors

2013-06-11 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1256:
---

 Summary: Improve the CSV handling code to get vectors
 Key: MAHOUT-1256
 URL: https://issues.apache.org/jira/browse/MAHOUT-1256
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dan Filimon
Assignee: Dan Filimon
Priority: Minor


Minor additions to iterate through a CSV file directly (as long as it's only 
numbers).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1255) Change BallKMeans weighting to use log(weight)

2013-06-11 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1255:
---

 Summary: Change BallKMeans weighting to use log(weight)
 Key: MAHOUT-1255
 URL: https://issues.apache.org/jira/browse/MAHOUT-1255
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dan Filimon
Assignee: Dan Filimon
Priority: Minor


Weirdness is happening in the reducer when doing k-means++ sampling by weights.

Change from multiplying the probability with the weight to multiplying by 2 * 
log(weight).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1254) Final round of cleanup for StreamingKMeans

2013-06-11 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1254:
---

 Summary: Final round of cleanup for StreamingKMeans
 Key: MAHOUT-1254
 URL: https://issues.apache.org/jira/browse/MAHOUT-1254
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dan Filimon
Assignee: Dan Filimon


Did a bit of tweaking on StreamingKMeans, driver, mapper and reducer to share 
more code and make it nicer.

Need to put this in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1253) Add experiment tools for StreamingKMeans

2013-06-11 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1253:
---

 Summary: Add experiment tools for StreamingKMeans
 Key: MAHOUT-1253
 URL: https://issues.apache.org/jira/browse/MAHOUT-1253
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Dan Filimon
Assignee: Dan Filimon


Merge in this patch https://reviews.apache.org/r/11302/


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Welcome new committers Gokhan Capan and Stevo Slavic

2013-06-10 Thread Dan Filimon
Congratulations to the both of you! :)
It's great to have you on board!


On Tue, Jun 11, 2013 at 3:58 AM, Stevo Slavić  wrote:

> Thanks Grant, Suneel and rest of the team,
>
> I'm a Java software developer and OSS enthusiast from Serbia with 7 years
> of professional experience in IT industry.
> Together with teams I've been part of, I have designed, built and
> successfully delivered multiple applications and websites from various
> business domains (online media, e-government, telecommunications,
> e-commerce). In both small and large enterprise scale apps, open source
> technologies and communities around them were and remain to be one of the
> key components and ingredients for success.
>
> It's always a great pleasure for me to give back to OSS projects that I
> use, through submitting patches or just being good community member.
> So far I've contributed to and been involved the most on Spring framework
> and other associated projects from the Spring portfolio.
>
> Back in April last year I rediscovered my passion and interest in machine
> learning, AI and computer science in general through prof. Andrew Ng's
> Coursera
> machine learning MOOC  which I
> successfully
> completed . Going from ML theory to
> practice, through the mist of Big Data hype, lead me to the greatness of
> Apache Mahout project.
>
> You all do me great honor by accepting me into the team, team of
> exceptional individuals yet great team players, with such positive and
> creative atmosphere.
> My contributions to the project so far were rather limited, and in near
> future they are likely to remain so as I still have lots to learn first.
> At least in the beginning, more than anything else I expect that I'll be
> able to contribute to the project by making it even more approachable to
> general audience of IT practitioners like myself through actively promoting
> it, supporting users on the mailing list to my best, and working on the
> documentation. Level of commitment will surely increase with time.
>
> I thank you all once more for this wonderful opportunity, and wish us and
> the project lots of success!
>
> Kind regards,
> Stevo Slavic.
>
>
> On Tue, Jun 11, 2013 at 1:10 AM, Suneel Marthi  >wrote:
>
> > Congrats Gokhan and Stevo!!
> >
> >
> >
> >
> > 
> >  From: Grant Ingersoll 
> > To: "dev@mahout.apache.org" 
> > Sent: Monday, June 10, 2013 5:04 PM
> > Subject: Welcome new committers Gokhan Capan and Stevo Slavic
> >
> >
> > Please join me in congratulating Mahout's newest committers, Gokhan Capan
> > and Stevo Slavic, both of whom have been contributing to Mahout for some
> > time now.
> >
> > Gokhan, Stevo, new committer tradition is to give a brief background on
> > yourself, so you have the floor!
> >
> > Congrats,
> > Grant
> >
>


Re: Clustering Examples

2013-06-10 Thread Dan Filimon
Yes, in fact I have quite a few examples classes that I was meaning to
integrate. :)
I'll get to it this evening or tomorrow (my time).


On Mon, Jun 10, 2013 at 8:07 AM, Robin Anil  wrote:

>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Jun 9, 2013 at 4:18 PM, Grant Ingersoll wrote:
>
>> Dan,
>>
>> Any chance we can get the Streaming K-Means stuff worked into the
>> examples (cluster-reuters, etc.) before 0.8?
>>
>> -Grant
>>
>
>


[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679108#comment-13679108
 ] 

Dan Filimon commented on MAHOUT-1211:
-

I'm also fixing the Closeables.close(writer, true) calls. [~smarthi] said that 
for writers, these should be "false" to not swallow the exception.

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon reassigned MAHOUT-1211:
---

Assignee: Dan Filimon  (was: Grant Ingersoll)

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>    Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Suggested 0.8 Code Freeze Date

2013-06-03 Thread Dan Filimon
+1


On Jun 3, 2013, at 0:26, Grant Ingersoll  wrote:

> I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs.
> 
> If they aren't in by then, they will get pushed, unless they are blockers.
> 
> After that, I will create the release candidates.
> 
> -Grant


[jira] [Assigned] (MAHOUT-958) NullPointerException in RepresentativePointsMapper when running cluster-reuters.sh example with kmeans

2013-06-02 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon reassigned MAHOUT-958:
--

Assignee: Dan Filimon  (was: Grant Ingersoll)

> NullPointerException in RepresentativePointsMapper when running 
> cluster-reuters.sh example with kmeans
> --
>
> Key: MAHOUT-958
> URL: https://issues.apache.org/jira/browse/MAHOUT-958
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.6
> Environment: {code}
> > uname -a
> Linux 3.2.1-3.fc16.x86_64 #1 SMP Mon Jan 23 15:36:17 UTC 2012 x86_64 x86_64 
> x86_64 GNU/Linux
> {code}
> {code}
> > java -version
> java version "1.7.0_02"
> Java(TM) SE Runtime Environment (build 1.7.0_02-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 22.0-b10, mixed mode)
> {code}
> Hadoop Version: 0.20.203.0, r1099333
>Reporter: Rares Vernica
>Assignee: Dan Filimon
> Fix For: 0.8
>
> Attachments: MAHOUT-958.patch
>
>
> {code}
> > svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/mahout/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1235544
> Node Kind: directory
> Schedule: normal
> Last Changed Author: tdunning
> Last Changed Rev: 1231800
> Last Changed Date: 2012-01-15 16:01:38 -0800 (Sun, 15 Jan 2012)
> {code}
> {code}
> > ./examples/bin/cluster-reuters.sh
> ...
> 1. kmeans clustering
> ...
> Inter-Cluster Density: NaN
> Intra-Cluster Density: 0.0
> CDbw Inter-Cluster Density: 0.0
> CDbw Intra-Cluster Density: NaN
> CDbw Separation: 0.0
> 12/01/24 16:08:47 INFO clustering.ClusterDumper: Wrote 20 clusters
> 12/01/24 16:08:47 INFO driver.MahoutDriver: Program took 126749 ms (Minutes: 
> 2.11248335)
> {code}
> All five "{{Representative Points Driver}}" jobs fail.
> {code}
> 2012-01-24 16:07:11,555 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded 
> the native-hadoop library
> 2012-01-24 16:07:11,881 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
> 79691776/99614720
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: record buffer 
> = 262144/327680
> 2012-01-24 16:07:11,956 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName vernica for UID 1000 from the native implementation
> 2012-01-24 16:07:11,981 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.NullPointerException
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.mapPoint(RepresentativePointsMapper.java:73)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:60)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:40)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>   at org.apache.hadoop.mapred.Child.main(Child.java:253)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1237) Total cost isn't computed properly

2013-06-02 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1237.
-

Resolution: Fixed

Committed revision 1488777.

> Total cost isn't computed properly
> --
>
> Key: MAHOUT-1237
> URL: https://issues.apache.org/jira/browse/MAHOUT-1237
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Assignee: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
>
> The problem is that it adds up cluster weights instead of computing the sum 
> of all the distances.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1237) Total cost isn't computed properly

2013-06-02 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1237:
---

 Summary: Total cost isn't computed properly
 Key: MAHOUT-1237
 URL: https://issues.apache.org/jira/browse/MAHOUT-1237
 Project: Mahout
  Issue Type: Bug
Reporter: Dan Filimon
Assignee: Dan Filimon
Priority: Minor


The problem is that it adds up cluster weights instead of computing the sum of 
all the distances.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1237) Total cost isn't computed properly

2013-06-02 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1237:


Affects Version/s: 0.8

> Total cost isn't computed properly
> --
>
> Key: MAHOUT-1237
> URL: https://issues.apache.org/jira/browse/MAHOUT-1237
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Assignee: Dan Filimon
>Priority: Minor
>
> The problem is that it adds up cluster weights instead of computing the sum 
> of all the distances.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1224) Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

2013-06-02 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1224.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1488766.

> Add the option of running a StreamingKMeans pass in the Reducer before 
> BallKMeans
> -
>
> Key: MAHOUT-1224
> URL: https://issues.apache.org/jira/browse/MAHOUT-1224
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Assignee: Dan Filimon
> Fix For: 0.8
>
>
> Sometimes, the number of points passed to the reducer from the mappers in the 
> StreamingKMeansDriver job is too large to fit into memory.
> In that case, applying another StreamingKMeans pass can collapse the mapper 
> intermediate clusters to a more manageable size to be clustered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1223) Point skipped in StreamingKMeans when iterating through centroids from a reducer

2013-05-21 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1223.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1484747.

> Point skipped in StreamingKMeans when iterating through centroids from a 
> reducer
> 
>
> Key: MAHOUT-1223
> URL: https://issues.apache.org/jira/browse/MAHOUT-1223
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
>
> When calling StreamingKMeans in the reducer (to collapse the number of 
> clusters to they can fit into memory), the clustering is done on the Hadoop 
> reducer iterable.
> Currently, the first Centroid is added directly as a special case and then is 
> skipped when iterating through the main loop.
> However, Hadoop reducer iterables cannot be rewound therefore causing SKM to 
> skip one point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1222) Fix total weight in FastProjectionSearch

2013-05-21 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1222.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1484697.

> Fix total weight in FastProjectionSearch
> 
>
> Key: MAHOUT-1222
> URL: https://issues.apache.org/jira/browse/MAHOUT-1222
> Project: Mahout
>  Issue Type: Bug
>        Reporter: Dan Filimon
> Fix For: 0.8
>
>
> Sometimes when removing a Vector that's in pendingAdditions, the wrong Vector 
> gets removed.
> This happens because the closest Vector is removed rather than the one that's 
> equal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1223) Point skipped in StreamingKMeans when iterating through centroids from a reducer

2013-05-20 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662084#comment-13662084
 ] 

Dan Filimon commented on MAHOUT-1223:
-

Patch: https://reviews.apache.org/r/11242/

> Point skipped in StreamingKMeans when iterating through centroids from a 
> reducer
> 
>
> Key: MAHOUT-1223
> URL: https://issues.apache.org/jira/browse/MAHOUT-1223
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>Priority: Minor
>
> When calling StreamingKMeans in the reducer (to collapse the number of 
> clusters to they can fit into memory), the clustering is done on the Hadoop 
> reducer iterable.
> Currently, the first Centroid is added directly as a special case and then is 
> skipped when iterating through the main loop.
> However, Hadoop reducer iterables cannot be rewound therefore causing SKM to 
> skip one point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1224) Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

2013-05-20 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1224:
---

 Summary: Add the option of running a StreamingKMeans pass in the 
Reducer before BallKMeans
 Key: MAHOUT-1224
 URL: https://issues.apache.org/jira/browse/MAHOUT-1224
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.8
Reporter: Dan Filimon


Sometimes, the number of points passed to the reducer from the mappers in the 
StreamingKMeansDriver job is too large to fit into memory.

In that case, applying another StreamingKMeans pass can collapse the mapper 
intermediate clusters to a more manageable size to be clustered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1222) Fix total weight in FastProjectionSearch

2013-05-20 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662071#comment-13662071
 ] 

Dan Filimon commented on MAHOUT-1222:
-

https://reviews.apache.org/r/11240/

> Fix total weight in FastProjectionSearch
> 
>
> Key: MAHOUT-1222
> URL: https://issues.apache.org/jira/browse/MAHOUT-1222
> Project: Mahout
>  Issue Type: Bug
>    Reporter: Dan Filimon
>
> Sometimes when removing a Vector that's in pendingAdditions, the wrong Vector 
> gets removed.
> This happens because the closest Vector is removed rather than the one that's 
> equal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1222) Fix total weight in FastProjectionSearch

2013-05-20 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662071#comment-13662071
 ] 

Dan Filimon edited comment on MAHOUT-1222 at 5/20/13 3:40 PM:
--

Patch: https://reviews.apache.org/r/11240/

  was (Author: dfilimon):
https://reviews.apache.org/r/11240/
  
> Fix total weight in FastProjectionSearch
> 
>
> Key: MAHOUT-1222
> URL: https://issues.apache.org/jira/browse/MAHOUT-1222
> Project: Mahout
>  Issue Type: Bug
>    Reporter: Dan Filimon
>
> Sometimes when removing a Vector that's in pendingAdditions, the wrong Vector 
> gets removed.
> This happens because the closest Vector is removed rather than the one that's 
> equal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1222) Fix total weight in FastProjectionSearch

2013-05-20 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1222:
---

 Summary: Fix total weight in FastProjectionSearch
 Key: MAHOUT-1222
 URL: https://issues.apache.org/jira/browse/MAHOUT-1222
 Project: Mahout
  Issue Type: Bug
Reporter: Dan Filimon


Sometimes when removing a Vector that's in pendingAdditions, the wrong Vector 
gets removed.
This happens because the closest Vector is removed rather than the one that's 
equal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1223) Point skipped in StreamingKMeans when iterating through centroids from a reducer

2013-05-20 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1223:
---

 Summary: Point skipped in StreamingKMeans when iterating through 
centroids from a reducer
 Key: MAHOUT-1223
 URL: https://issues.apache.org/jira/browse/MAHOUT-1223
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Dan Filimon
Priority: Minor


When calling StreamingKMeans in the reducer (to collapse the number of clusters 
to they can fit into memory), the clustering is done on the Hadoop reducer 
iterable.

Currently, the first Centroid is added directly as a special case and then is 
skipped when iterating through the main loop.
However, Hadoop reducer iterables cannot be rewound therefore causing SKM to 
skip one point.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1224) Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

2013-05-20 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662088#comment-13662088
 ] 

Dan Filimon commented on MAHOUT-1224:
-

Patch: https://reviews.apache.org/r/11243/

> Add the option of running a StreamingKMeans pass in the Reducer before 
> BallKMeans
> -
>
> Key: MAHOUT-1224
> URL: https://issues.apache.org/jira/browse/MAHOUT-1224
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> Sometimes, the number of points passed to the reducer from the mappers in the 
> StreamingKMeansDriver job is too large to fit into memory.
> In that case, applying another StreamingKMeans pass can collapse the mapper 
> intermediate clusters to a more manageable size to be clustered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-20 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661850#comment-13661850
 ] 

Dan Filimon commented on MAHOUT-1217:
-

Awesome, thank you Suneel!

> Nearest neighbor searchers sometimes fail to remove points
> --
>
> Key: MAHOUT-1217
> URL: https://issues.apache.org/jira/browse/MAHOUT-1217
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
> and its updated version added.
> When removing points in a searcher that are already there, sometimes the 
> searcher fails to return the closest point (the one being searched for) 
> causing a RuntimeException.
> This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
> CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-20 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1217.
-

   Resolution: Fixed
Fix Version/s: 0.8

> Nearest neighbor searchers sometimes fail to remove points
> --
>
> Key: MAHOUT-1217
> URL: https://issues.apache.org/jira/browse/MAHOUT-1217
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Fix For: 0.8
>
>
> When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
> and its updated version added.
> When removing points in a searcher that are already there, sometimes the 
> searcher fails to return the closest point (the one being searched for) 
> causing a RuntimeException.
> This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
> CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1219) LSHSearcher not always faster than BruteSearcher

2013-05-20 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1219.
-

Resolution: Cannot Reproduce

Will probably need to change this searcher implementation, but this is a 
priority for later.

> LSHSearcher not always faster than BruteSearcher
> 
>
> Key: MAHOUT-1219
> URL: https://issues.apache.org/jira/browse/MAHOUT-1219
> Project: Mahout
>  Issue Type: Test
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
>
> This is a known issue and the performance of LocalitySensitiveHashSearch 
> needs to be further investigated.
> Currently, the one "benchmark" that does this, SearchQualityTest is too 
> variable to be informative.
> So, I'm removing LSHSearcher from SearchQualityTest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1219) LSHSearcher not always faster than BruteSearcher

2013-05-17 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660823#comment-13660823
 ] 

Dan Filimon commented on MAHOUT-1219:
-

The StreamingKMeansTest is now also wonky because of LSH.

> LSHSearcher not always faster than BruteSearcher
> 
>
> Key: MAHOUT-1219
> URL: https://issues.apache.org/jira/browse/MAHOUT-1219
> Project: Mahout
>  Issue Type: Test
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
>
> This is a known issue and the performance of LocalitySensitiveHashSearch 
> needs to be further investigated.
> Currently, the one "benchmark" that does this, SearchQualityTest is too 
> variable to be informative.
> So, I'm removing LSHSearcher from SearchQualityTest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1219) LSHSearcher not always faster than BruteSearcher

2013-05-17 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1219:
---

 Summary: LSHSearcher not always faster than BruteSearcher
 Key: MAHOUT-1219
 URL: https://issues.apache.org/jira/browse/MAHOUT-1219
 Project: Mahout
  Issue Type: Test
Affects Versions: 0.8
Reporter: Dan Filimon
Priority: Minor


This is a known issue and the performance of LocalitySensitiveHashSearch needs 
to be further investigated.
Currently, the one "benchmark" that does this, SearchQualityTest is too 
variable to be informative.

So, I'm removing LSHSearcher from SearchQualityTest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-17 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660533#comment-13660533
 ] 

Dan Filimon commented on MAHOUT-1217:
-

Possible fix: https://reviews.apache.org/r/11219

> Nearest neighbor searchers sometimes fail to remove points
> --
>
> Key: MAHOUT-1217
> URL: https://issues.apache.org/jira/browse/MAHOUT-1217
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
> and its updated version added.
> When removing points in a searcher that are already there, sometimes the 
> searcher fails to return the closest point (the one being searched for) 
> causing a RuntimeException.
> This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
> CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1216) Add locality sensitive hashing and a LocalitySensitiveHash searcher

2013-05-17 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1216.
-

Resolution: Fixed

Committed revision 1483702.

> Add locality sensitive hashing and a LocalitySensitiveHash searcher
> ---
>
> Key: MAHOUT-1216
> URL: https://issues.apache.org/jira/browse/MAHOUT-1216
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>
> This issue tackles the LocalitySensitiveHashSearch, that was initially 
> supposed to be part of MAHOUT-1156.
> It adds HashedVector, the class that adds the LSH to vectors, a new searcher 
> (although a better implementation is possible) and adds support in the 
> existing tests and new StreamingKMeans infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1216) Add locality sensitive hashing and a LocalitySensitiveHash searcher

2013-05-17 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660498#comment-13660498
 ] 

Dan Filimon commented on MAHOUT-1216:
-

Committed revision 1483702.

> Add locality sensitive hashing and a LocalitySensitiveHash searcher
> ---
>
> Key: MAHOUT-1216
> URL: https://issues.apache.org/jira/browse/MAHOUT-1216
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> This issue tackles the LocalitySensitiveHashSearch, that was initially 
> supposed to be part of MAHOUT-1156.
> It adds HashedVector, the class that adds the LSH to vectors, a new searcher 
> (although a better implementation is possible) and adds support in the 
> existing tests and new StreamingKMeans infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-17 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660454#comment-13660454
 ] 

Dan Filimon commented on MAHOUT-1217:
-

So ProjectionSearch is indeed working properly then?

> Nearest neighbor searchers sometimes fail to remove points
> --
>
> Key: MAHOUT-1217
> URL: https://issues.apache.org/jira/browse/MAHOUT-1217
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
> and its updated version added.
> When removing points in a searcher that are already there, sometimes the 
> searcher fails to return the closest point (the one being searched for) 
> causing a RuntimeException.
> This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
> CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1218) Streamimg k-means fails when the number of clusters specified is <= estimated map clusters

2013-05-17 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660434#comment-13660434
 ] 

Dan Filimon commented on MAHOUT-1218:
-

This is still an issue I think.

If k is the number of clusters, n the number of points and m is the number of 
mappers, there should be (k log n) clusters after the map phase.
BUT, the meaning of the -km parameter is the number of clusters to generate 
from ONE mapper.

So, if it's set to the recommended (k log n) for each mapper we'll end up with 
m (k log n) clusters which is way more than we need. This causes the reducer to 
fail because it can't fit all the points in memory.

For a given mapper, assuming the splits are equal, if there are (n / m) points 
to cluster, the number of clusters it should create is (k log (n / m)).

For the concrete case we were playing with, n = 4*10^6, k = 20, m = 114 (and 
the log is actually ln).

- So, if we follow the recommendation, and have km = k log n for every mapper, 
that's 305 clusters per mapper. So, in total the reducer has to cluster about 
34660 points.
- If we go for, km = k log (n/m) for every mapper, that's 204 clusters per 
mapper. Which gets us about 24K points to cluster at the end.
- If we go even further, and suggest km = (k log n)/m for every mapper, that's 
about 2 clusters per mapper. That's probably too little to capture the 
clustering of those points, but chances are it'll still do a good job if km = 
100.

And in any case, for any value of km, it is just an initial estimate, so we 
don't care about what _exactly_ the value is.
As we're streaming points through the mapper, the number of clusters gets 
increased to k * log (numProcessedDatapoints).
It might even make sense to have fewer than that many clusters at the beginning 
to ensure there aren't too many collapses.

So, I think it's safe to say that the precondition in the driver is a bit 
excessive. Maybe pop up a warning, but I don't think it should prevent the job 
from running. It's not like a small km is violating any assumptions.

Somewhat offtopic:
Looking at the paper again, what is the base of the logarithm exactly? It makes 
a difference in these calculations, and in the code we're using log base e. I 
would have expected the base to be 2, but honestly, since it's just O(k log n), 
the actual base doesn't matter in the asymptotic notation.
Still, what is the right value for the code?

> Streamimg k-means fails when the number of clusters specified is <= estimated 
> map clusters
> --
>
> Key: MAHOUT-1218
> URL: https://issues.apache.org/jira/browse/MAHOUT-1218
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.8
>
>
> Running Streaming k-means with CosineDistanceMeasure, Fast Projection Search, 
> number of clusters k= 60, number of estimated map clsuters -km = 60.
> {Code}
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid number 
> of estimated map clusters; There must be more than the final number of 
> clusters (k log n vs k)
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.configureOptionsForWorkers(StreamingKMeansDriver.java:327)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.configureOptionsForWorkers(StreamingKMeansDriver.java:280)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:227)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>   at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:472)
> {Code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1217) Nearest neighbor searchers sometimes fail to remove points

2013-05-16 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1217:
---

 Summary: Nearest neighbor searchers sometimes fail to remove points
 Key: MAHOUT-1217
 URL: https://issues.apache.org/jira/browse/MAHOUT-1217
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.8
Reporter: Dan Filimon


When updating a Centroid in StreamingKMeans, the Centroid needs to be removed 
and its updated version added.

When removing points in a searcher that are already there, sometimes the 
searcher fails to return the closest point (the one being searched for) causing 
a RuntimeException.

This has been observed for TF-IDF vectors with SquaredEuclideanDistance and 
CosineDistance and FastProjectionSearch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1216) Add locality sensitive hashing and a LocalitySensitiveHash searcher

2013-05-16 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659351#comment-13659351
 ] 

Dan Filimon commented on MAHOUT-1216:
-

Patch available at https://reviews.apache.org/r/11193/

> Add locality sensitive hashing and a LocalitySensitiveHash searcher
> ---
>
> Key: MAHOUT-1216
> URL: https://issues.apache.org/jira/browse/MAHOUT-1216
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> This issue tackles the LocalitySensitiveHashSearch, that was initially 
> supposed to be part of MAHOUT-1156.
> It adds HashedVector, the class that adds the LSH to vectors, a new searcher 
> (although a better implementation is possible) and adds support in the 
> existing tests and new StreamingKMeans infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1216) Add locality sensitive hashing and a LocalitySensitiveHash searcher

2013-05-16 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1216:
---

 Summary: Add locality sensitive hashing and a 
LocalitySensitiveHash searcher
 Key: MAHOUT-1216
 URL: https://issues.apache.org/jira/browse/MAHOUT-1216
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.8
Reporter: Dan Filimon


This issue tackles the LocalitySensitiveHashSearch, that was initially supposed 
to be part of MAHOUT-1156.

It adds HashedVector, the class that adds the LSH to vectors, a new searcher 
(although a better implementation is possible) and adds support in the existing 
tests and new StreamingKMeans infrastructure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1181) Adding StreamingKMeans MapReduce classes

2013-05-15 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1181.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1482907.

> Adding StreamingKMeans MapReduce classes
> 
>
> Key: MAHOUT-1181
> URL: https://issues.apache.org/jira/browse/MAHOUT-1181
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Fix For: 0.8
>
> Attachments: MAHOUT_1181.patch, MAHOUT_1181_props.patch, 
> MAHOUT_1181_test.patch
>
>
> This patch implements the MapReduce version of StreamingKMeans for 
> MAHOUT-1154.
> It adds 5 new classes:
> - CentroidWritable: class representing a centroid that can be written to a 
> SeqFile
> - StreamingKMeansDriver: class implementing AbstractJob that is the entry 
> point to the mapreduction
> - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) 
> clustering the points one by one
> - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a 
> number of times and picking the clustering with the lowest total clustering 
> cost.
> The cost is determined by randomly splitting the incoming centroids into a 
> "training" and "test" set, computing the centroids on the training set and 
> the cost on the test set. The intent is to see whether the centroids actually 
> describe the distribution of the points or not.
> - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher 
> from a Configuration.
> Additionally, there is a test class StreamingKMeansTestMR that tests the 
> mapper, reducer and mapper and reducer together using MRUnit.
> !!!
> Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a 
> dependency. We depend on snapshot 1.0 which is not yet released (it will be 
> very soon), hence the updated pom.xml is not provided for now.
> !!!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: MAHOUT-1181: Adds StreamingKMeans MapReduce classes

2013-05-12 Thread Dan Filimon
Guys, please can anyone have a look at this patch? I'd really like to
merge. :)


On Sat, May 11, 2013 at 10:03 AM, Dan Filimon
wrote:

>This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10193/
>   Review request for mahout, Ted Dunning, Jake Mannix, Sebastian
> Schelter, Suneel Marthi, and Robin Anil.
> By Dan Filimon.
>
> *Updated May 11, 2013, 7:03 a.m.*
> Changes
>
> Ping, please have a look at the map/reduce classes.
>
>   Description
>
> This depends (loosely) on https://reviews.apache.org/r/10194/
>
> This patch implements the MapReduce version of StreamingKMeans for 
> MAHOUT-1154.
>
> It adds 5 new classes:
> - CentroidWritable: class representing a centroid that can be written to a 
> SeqFile
> - StreamingKMeansDriver: class implementing AbstractJob that is the entry 
> point to the mapreduction
> - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) 
> clustering the points one by one
> - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a 
> number of times and picking the clustering with the lowest total clustering 
> cost.
> The cost is determined by randomly splitting the incoming centroids into a 
> "training" and "test" set, computing the centroids on the training set and 
> the cost on the test set. The intent is to see whether the centroids actually 
> describe the distribution of the points or not.
> - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher 
> from a Configuration.
>
> Additionally, there is a test class StreamingKMeansTestMR that tests the 
> mapper, reducer and mapper and reducer together using MRUnit.
>
> !!!
> Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a 
> dependency. We depend on snapshot 1.0 which is not yet released (it will be 
> very soon), hence the updated pom.xml is not provided for now.
> !!!
>
>   Testing
>
> See StreamingKMeansTestMR for the tests. These are all performed on data 
> sample from a "hypercube" distribution (there are multinormal distributions 
> in each vertex of the cube).
> Additionally there are ongoing tests on the 20 newsgroups data set (and some 
> more are on the way).
>
>   Diffs
>
>- core/src/main/java/org/apache/mahout/clustering/ClusteringUtils.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/cluster/BallKMeans.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java
>(PRE-CREATION)
>- 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java
>(PRE-CREATION)
>- 
> core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java
>(PRE-CREATION)
>- src/conf/driver.classes.default.props (ac45eef)
>
> View Diff <https://reviews.apache.org/r/10193/diff/>
>


Re: Streaming KMeans distance cutoff

2013-05-09 Thread Dan Filimon
I haven't noticed, but it makes me feel somewhat (irrationally :) better
knowing that the points don't come through in the same order they
previously came in.
I thought of maybe having a flag, but I'm kind of split on the issue.

Even if they aren't shuffled, we need to copy them to another list before
collapsing anyway so we'd still be looping through them once.


On Thu, May 9, 2013 at 10:09 PM, Andy Twigg  wrote:

> Hi Dan,
>
> Sure. I took a quick look just now and it looks good. Did you notice that
> shuffling before collapsing was helping, hence keeping it in? It didn't
> make much difference for me.
>
> Andy
>
>
>
> On 9 May 2013 16:05, Dan Filimon  wrote:
>
>> Andy, would you like to review the final version of the clustering code
>> before it goes in [1]?
>> [1] https://reviews.apache.org/r/10194/
>>
>> Ted, it's pretty much done. Okay it and I'll commit.
>>
>>
>> On Wed, May 8, 2013 at 11:57 PM, Ted Dunning wrote:
>>
>>> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <
>>> dangeorge.fili...@gmail.com>wrote:
>>>
>>> > > > I think it avoids the need of the special way we handle the
>>> increase of
>>> > > > distanceCutoff by beta in another if.
>>> > > >
>>> > >
>>> > > Sure.  Sounds right and all.
>>> > >
>>> > > But experiment will tell better.
>>> >
>>>
>>> yes.
>>>
>>> But I definitely saw cases where the same cutoff caused the centroid
>>> count
>>> to decrease.  In my mind, continuing to increase the cutoff in those
>>> cases
>>> is a bad thing.  A smaller cutoff is more conservative in that it will
>>> preserve more data in the sketch.  Until we see it preserving too much
>>> data, we don't need to increase the cutoff.
>>>
>>
>> I kept the overshoot just to be safe in the CL.
>>
>> > > > ... They
>>> > > > actually call it a "facility cost" rather than a distance,
>>> probably for
>>> > > > this reason.
>>> >
>>>
>>> Btw... the reason that they call it a facility cost is because they are
>>> referring to a different literature.  With k-means, k is traditionally
>>> fixed.  With facility assignment, it is traditionally not.  The problems
>>> are otherwise quite similar.  The reason for the difference in
>>> nomenclature
>>> is because the facility assignment stuff comes from operations research,
>>> not computer science.
>>>
>>
>> Ah, well that explains it. :)
>>
>> ... I'm uncomfortable with the distanceCutoff growing too high, but I'll
>>> > just
>>> > put the blame on that one on the data.
>>> >
>>>
>>> I am uncomfortable as well.
>>>
>>> This is one reason I would like to only increase the distanceCutoff when
>>> a
>>> small value proves ineffective.
>>
>>
>> Alright, this is the version that's going in.
>>
>>
>>>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
>>> KMeans on
>>> > other data sets (similar kinds of clusters and good looking Dunn and
>>> > Davies-Bouldin indices).
>>> >
>>>
>>> You hide this gem in a long email!!!
>>>
>>> Good news.
>>
>>
>> Yeah. :)
>> It's comparable to Mahout KMeans quality wise, and very tweakable.
>> The speed improvements should be apparent on large data sets that we run
>> on Hadoop.
>>
>> > >
>>> >
>>> > > The estimate we give it at the beginning is only valid as long as not
>>> > > > enough datapoints have been processed to go over k log n.
>>> > > >
>>> > >
>>> > > Are we talking about clusterOvershoot here?  Or the numClusters
>>> > over-ride?
>>> >
>>> >
>>> > We collapse the clusters when the number of actual centroids is over
>>> > clusterOvershoot * numClusters.
>>> > I'm thinking that since numClusters increases anyway, clusterOvershoot
>>> > means we end up with more clusters than we need (not bad per se, but
>>> trying
>>> > to get rid of variables).
>>> >
>>>
>>> I view it as numClusters is the minimum number of clusters that we want
>>> to
>>> see.  ClusterOverShoot says that we can go a ways above the minimum, but
>>> we
>>> hopefully will just collapse back down to the minimum or above.
>>>
>>>
>>>
>>> > > Well, we have seen cases where the over-shoot needed to be >1.
>>>  Those may
>>> > > have gone away with better adaptation, but I think that they probably
>>> > still
>>> > > can happen.
>>> > >
>>> >
>>> > Sorry, what do you mean by adaptation here?
>>> >
>>>
>>> Better adjustment and use of the distanceCutoff.  This should make the
>>> collapse in the recursive clustering be less dramatic and more
>>> predictable.
>>>  That will make the system require less over-shoot.
>>>
>>
>>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.tw...@cs.ox.ac.uk | +447799647538
>


Re: Streaming KMeans distance cutoff

2013-05-09 Thread Dan Filimon
Andy, would you like to review the final version of the clustering code
before it goes in [1]?
[1] https://reviews.apache.org/r/10194/

Ted, it's pretty much done. Okay it and I'll commit.


On Wed, May 8, 2013 at 11:57 PM, Ted Dunning  wrote:

> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon  >wrote:
>
> > > > I think it avoids the need of the special way we handle the increase
> of
> > > > distanceCutoff by beta in another if.
> > > >
> > >
> > > Sure.  Sounds right and all.
> > >
> > > But experiment will tell better.
> >
>
> yes.
>
> But I definitely saw cases where the same cutoff caused the centroid count
> to decrease.  In my mind, continuing to increase the cutoff in those cases
> is a bad thing.  A smaller cutoff is more conservative in that it will
> preserve more data in the sketch.  Until we see it preserving too much
> data, we don't need to increase the cutoff.
>

I kept the overshoot just to be safe in the CL.

> > > ... They
> > > > actually call it a "facility cost" rather than a distance, probably
> for
> > > > this reason.
> >
>
> Btw... the reason that they call it a facility cost is because they are
> referring to a different literature.  With k-means, k is traditionally
> fixed.  With facility assignment, it is traditionally not.  The problems
> are otherwise quite similar.  The reason for the difference in nomenclature
> is because the facility assignment stuff comes from operations research,
> not computer science.
>

Ah, well that explains it. :)

... I'm uncomfortable with the distanceCutoff growing too high, but I'll
> > just
> > put the blame on that one on the data.
> >
>
> I am uncomfortable as well.
>
> This is one reason I would like to only increase the distanceCutoff when a
> small value proves ineffective.


Alright, this is the version that's going in.


>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
> KMeans on
> > other data sets (similar kinds of clusters and good looking Dunn and
> > Davies-Bouldin indices).
> >
>
> You hide this gem in a long email!!!
>
> Good news.


Yeah. :)
It's comparable to Mahout KMeans quality wise, and very tweakable.
The speed improvements should be apparent on large data sets that we run on
Hadoop.

> >
> >
> > > The estimate we give it at the beginning is only valid as long as not
> > > > enough datapoints have been processed to go over k log n.
> > > >
> > >
> > > Are we talking about clusterOvershoot here?  Or the numClusters
> > over-ride?
> >
> >
> > We collapse the clusters when the number of actual centroids is over
> > clusterOvershoot * numClusters.
> > I'm thinking that since numClusters increases anyway, clusterOvershoot
> > means we end up with more clusters than we need (not bad per se, but
> trying
> > to get rid of variables).
> >
>
> I view it as numClusters is the minimum number of clusters that we want to
> see.  ClusterOverShoot says that we can go a ways above the minimum, but we
> hopefully will just collapse back down to the minimum or above.
>
>
>
> > > Well, we have seen cases where the over-shoot needed to be >1.  Those
> may
> > > have gone away with better adaptation, but I think that they probably
> > still
> > > can happen.
> > >
> >
> > Sorry, what do you mean by adaptation here?
> >
>
> Better adjustment and use of the distanceCutoff.  This should make the
> collapse in the recursive clustering be less dramatic and more predictable.
>  That will make the system require less over-shoot.
>


Re: Streaming KMeans distance cutoff

2013-05-08 Thread Dan Filimon
On Wed, May 8, 2013 at 8:09 PM, Ted Dunning  wrote:

> On Wed, May 8, 2013 at 10:00 AM, Dan Filimon  >wrote:
>
> > On Wed, May 8, 2013 at 7:48 PM, Ted Dunning 
> wrote:
> >
> > > Inline
> > >
> > >
> > > >> He told me two things:
> > > >> - that we should multiply the distance / distanceCutoff ratio by the
> > > >> weight
> > > >> of the point we're trying to cluster so as to avoid collapsing
> larger
> > > >> clusters
> > > >>
> > > >
> > > This makes sense, but I have no idea what the effect will be.
> > >
> >
> > I think it avoids the need of the special way we handle the increase of
> > distanceCutoff by beta in another if.
> >
>
> Sure.  Sounds right and all.
>
> But experiment will tell better.
>
>
> > > >  - the initial cutoff they use is 1 / numClusters basically
> > > >>
> > > >
> > > This is just wrong.
> > >
> > > It is fine in theoretical settings with synthetic clusters of unit
> > > characteristic scale, but is not invariant over uniform scaling so it
> > can't
> > > be correct.
> >
> >
> > But the thing is it's not really a distance at all. I've seen it increase
> > to far beyond what a maximum distance between clusters should be.
> > So, for example, in a case where the maximum distance is 2, it went all
> the
> > way up to 14. :)
> >
> > It's only a distance conceptually but it's more like a "factor". They
> > actually call it a "facility cost" rather than a distance, probably for
> > this reason.
> >
>
> It has units of distance and thus should be invariant with respect to
> uniform rescaling.  No way around that.  That means that it has to be based
> on a measurement from the data.
>
> And since it is a probabilistic bound, it is OK if it exceeds other
> distances.  14 versus 2 does seem a bit high, but conceptually exceeding
> the distance is not a problem.  I think that the 14 versus 2 problem is
> something else.
>

Okay, yes, it should be a distance. You're right that especially given that
it's in a ratio, the result should be unit-less.
I'm uncomfortable with the distanceCutoff growing too high, but I'll just
put the blame on that one on the data.

StreamingKMeans + BallKMeans gave good results compared to Mahout KMeans on
other data sets (similar kinds of clusters and good looking Dunn and
Davies-Bouldin indices).


> > > >> Additionally, clusterOvershoot, the thing we're using to delay
> > distance
> > > >> cutoff increases also seems somewhat unnecessary. Why use it and
> get a
> > > lot
> > > >> more centroids than what we asked for.
> > > >>
> > > >
> > > Well, it needs to be there until we get more run-time.  Getting more
> > > centroids doesn't hurt and getting too few definitely does hurt.  Thus
> we
> > > need some insurance.
> >
> >
> > Well, it would get as many clusters as it needs to. And we increase
> > numClusters as we go until it's clusterLogFactor *
> numProcessedDatapoints.
> > I understand the point that we should get more rather than less
> centroids,
> > but k log n (what we're getting at the end) seems fine.
> >
>
> The key is not the number at the end.  It is that the number never drops
> below enough centroids.


Hmm, I guess you're referring to some actual cases of this that you've
observed. On the data I tried, it seemed like collapsing did fairly little
to reduce the number of clusters, especially without increasing the
distance cutoff.

So this would happen when the distance cutoff would grow too big. Still, I
think this is probably taken care of by factoring in the weight of the
point when computing the ratio. That is there so two big clusters don't
merge when they shouldn't.


>
> > The estimate we give it at the beginning is only valid as long as not
> > enough datapoints have been processed to go over k log n.
> >
>
> Are we talking about clusterOvershoot here?  Or the numClusters over-ride?


We collapse the clusters when the number of actual centroids is over
clusterOvershoot * numClusters.
I'm thinking that since numClusters increases anyway, clusterOvershoot
means we end up with more clusters than we need (not bad per se, but trying
to get rid of variables).


>  > I think we already do better than some of the guarantees in the paper.
> > >
> >
> > I think we do the same, and I'm wondering whether the extra knobs do
> > anything or not. I tried fiddling with them a bit and they didn't seem to
> > do much.
> > I also haven't worked through the math for our particular version.
> >
>
> Well, we have seen cases where the over-shoot needed to be >1.  Those may
> have gone away with better adaptation, but I think that they probably still
> can happen.
>

Sorry, what do you mean by adaptation here?


> The distanceCutoff initialization is blatantly important if you try scaling
> the test data by 1e-9.  Without adapting to the data, you will get exactly
> one cluster.
>

Okay, the nail is firmly in the coffin of that idea. :)


Re: Streaming KMeans distance cutoff

2013-05-08 Thread Dan Filimon
On Wed, May 8, 2013 at 7:48 PM, Ted Dunning  wrote:

> Inline
>
>
> On Wed, May 8, 2013 at 8:49 AM, Andy Twigg  wrote:
>
> > both of those make sense to me.
> >
> >
> >
> > On 8 May 2013 16:45, Dan Filimon  wrote:
> >
> >> Hi Ted!
> >>
> >> I recently talked to one of the authors of streaming k-means, Adam
> >> Meyerson
> >> asking about the distance cutoff as I wasn't sure of a right value for
> >> this.
> >>
> >> He told me two things:
> >> - that we should multiply the distance / distanceCutoff ratio by the
> >> weight
> >> of the point we're trying to cluster so as to avoid collapsing larger
> >> clusters
> >>
> >
> This makes sense, but I have no idea what the effect will be.
>

I think it avoids the need of the special way we handle the increase of
distanceCutoff by beta in another if.

We can multiply it by beta every single time with the same results. The
comment said as much:

// In the original algorithm, with distributions with sharp scale
effects, the
// distanceCutoff can grow to an excessive size leading
sub-clustering to collapse
// the centroids set too much. This test prevents increase in
distanceCutoff if
// the current value is doing well at collapsing the clusters.


> >  - the initial cutoff they use is 1 / numClusters basically
> >>
> >
> This is just wrong.
>
> It is fine in theoretical settings with synthetic clusters of unit
> characteristic scale, but is not invariant over uniform scaling so it can't
> be correct.


But the thing is it's not really a distance at all. I've seen it increase
to far beyond what a maximum distance between clusters should be.
So, for example, in a case where the maximum distance is 2, it went all the
way up to 14. :)

It's only a distance conceptually but it's more like a "factor". They
actually call it a "facility cost" rather than a distance, probably for
this reason.


> >
> >> As I tested the code on multiple well known data sets, this got me
> >> thinking
> >> of removing the distanceCutoff all together.
> >> It seems like just another parameter to get right with only limited real
> >> value of fiddling with it.
> >>
> >
> I think it is core to the algorithm.  It is adapted to the data in any
> case.
>
>
> >> Additionally, clusterOvershoot, the thing we're using to delay distance
> >> cutoff increases also seems somewhat unnecessary. Why use it and get a
> lot
> >> more centroids than what we asked for.
> >>
> >
> Well, it needs to be there until we get more run-time.  Getting more
> centroids doesn't hurt and getting too few definitely does hurt.  Thus we
> need some insurance.


Well, it would get as many clusters as it needs to. And we increase
numClusters as we go until it's clusterLogFactor * numProcessedDatapoints.
I understand the point that we should get more rather than less centroids,
but k log n (what we're getting at the end) seems fine.
The estimate we give it at the beginning is only valid as long as not
enough datapoints have been processed to go over k log n.

 >
> >> I want to post a final version for review, but I just wanted to mention
> >> these two things.
> >>
> >> It's not like they "hurt" really, they just don't seem to be helping too
> >> much and I'd rather have something that more closely matches the
> >> theoretical guarantees in the paper.
> >>
> >> What do you think?
> >>
> >
> I think we already do better than some of the guarantees in the paper.
>

I think we do the same, and I'm wondering whether the extra knobs do
anything or not. I tried fiddling with them a bit and they didn't seem to
do much.
I also haven't worked through the math for our particular version.


Streaming KMeans distance cutoff

2013-05-08 Thread Dan Filimon
Hi Ted!

I recently talked to one of the authors of streaming k-means, Adam Meyerson
asking about the distance cutoff as I wasn't sure of a right value for this.

He told me two things:
- that we should multiply the distance / distanceCutoff ratio by the weight
of the point we're trying to cluster so as to avoid collapsing larger
clusters
- the initial cutoff they use is 1 / numClusters basically

As I tested the code on multiple well known data sets, this got me thinking
of removing the distanceCutoff all together.
It seems like just another parameter to get right with only limited real
value of fiddling with it.

Additionally, clusterOvershoot, the thing we're using to delay distance
cutoff increases also seems somewhat unnecessary. Why use it and get a lot
more centroids than what we asked for.

I want to post a final version for review, but I just wanted to mention
these two things.

It's not like they "hurt" really, they just don't seem to be helping too
much and I'd rather have something that more closely matches the
theoretical guarantees in the paper.

What do you think?


[jira] [Updated] (MAHOUT-1156) Adding nearest neighbor Searchers

2013-05-05 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1156:


Description: 
Adding the Searcher, UpdatableSearcher abstract classes defining what a 
nearest-neighbor searcher does.

The following implementations are available in the o.a.m.math.neighborhood 
package:
- BruteSearch
- ProjectionSearch
- FastProjectionSearch
- LocalitySensityHashSearch [oddly broken, NOT included here]

Additionally there are 2 new abstract classes available:
- Searcher
- UpdatableSearcher

This is part of https://issues.apache.org/jira/browse/MAHOUT-1154

There are no more test issues.

Committed revision 1479307.

  was:
Adding the Searcher, UpdatableSearcher abstract classes defining what a 
nearest-neighbor searcher does.

The following implementations are available in the o.a.m.math.neighborhood 
package:
- BruteSearch
- ProjectionSearch
- FastProjectionSearch
- LocalitySensityHashSearch [oddly broken, not included here]

Additionally there are 2 new abstract classes available:
- Searcher
- UpdatableSearcher

This is part of https://issues.apache.org/jira/browse/MAHOUT-1154

There are no more test issues.


> Adding nearest neighbor Searchers
> -
>
> Key: MAHOUT-1156
> URL: https://issues.apache.org/jira/browse/MAHOUT-1156
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Attachments: MAHOUT_1156_with_test.patch
>
>
> Adding the Searcher, UpdatableSearcher abstract classes defining what a 
> nearest-neighbor searcher does.
> The following implementations are available in the o.a.m.math.neighborhood 
> package:
> - BruteSearch
> - ProjectionSearch
> - FastProjectionSearch
> - LocalitySensityHashSearch [oddly broken, NOT included here]
> Additionally there are 2 new abstract classes available:
> - Searcher
> - UpdatableSearcher
> This is part of https://issues.apache.org/jira/browse/MAHOUT-1154
> There are no more test issues.
> Committed revision 1479307.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1156) Adding nearest neighbor Searchers

2013-05-05 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1156.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1479307.

> Adding nearest neighbor Searchers
> -
>
> Key: MAHOUT-1156
> URL: https://issues.apache.org/jira/browse/MAHOUT-1156
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Fix For: 0.8
>
> Attachments: MAHOUT_1156_with_test.patch
>
>
> Adding the Searcher, UpdatableSearcher abstract classes defining what a 
> nearest-neighbor searcher does.
> The following implementations are available in the o.a.m.math.neighborhood 
> package:
> - BruteSearch
> - ProjectionSearch
> - FastProjectionSearch
> - LocalitySensityHashSearch [oddly broken, NOT included here]
> Additionally there are 2 new abstract classes available:
> - Searcher
> - UpdatableSearcher
> This is part of https://issues.apache.org/jira/browse/MAHOUT-1154
> There are no more test issues.
> Committed revision 1479307.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1156) Adding nearest neighbor Searchers

2013-05-05 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1156:


Description: 
Adding the Searcher, UpdatableSearcher abstract classes defining what a 
nearest-neighbor searcher does.

The following implementations are available in the o.a.m.math.neighborhood 
package:
- BruteSearch
- ProjectionSearch
- FastProjectionSearch
- LocalitySensityHashSearch [oddly broken, not included here]

Additionally there are 2 new abstract classes available:
- Searcher
- UpdatableSearcher

This is part of https://issues.apache.org/jira/browse/MAHOUT-1154

There are no more test issues.

  was:
Adding the Searcher, UpdatableSearcher abstract classes defining what a 
nearest-neighbor searcher does.

The following implementations are available in the o.a.m.math.neighborhood 
package:
- BruteSearch
- ProjectionSearch
- FastProjectionSearch [BROKEN! this throws a ConcurrentModificationException 
because sometimes the collection being iterated through for search() is 
modified by calling reindex()! this needs more thought]
- LocalitySensityHashSearch

Additionally there are 2 new abstract classes available:
- Searcher
- UpdatableSearcher

This is part of https://issues.apache.org/jira/browse/MAHOUT-1154

There are no more test issues.


> Adding nearest neighbor Searchers
> -
>
> Key: MAHOUT-1156
> URL: https://issues.apache.org/jira/browse/MAHOUT-1156
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Attachments: MAHOUT_1156_with_test.patch
>
>
> Adding the Searcher, UpdatableSearcher abstract classes defining what a 
> nearest-neighbor searcher does.
> The following implementations are available in the o.a.m.math.neighborhood 
> package:
> - BruteSearch
> - ProjectionSearch
> - FastProjectionSearch
> - LocalitySensityHashSearch [oddly broken, not included here]
> Additionally there are 2 new abstract classes available:
> - Searcher
> - UpdatableSearcher
> This is part of https://issues.apache.org/jira/browse/MAHOUT-1154
> There are no more test issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Locality Sensitive Hash Searcher confusion

2013-05-05 Thread Dan Filimon
Hi Ted,

I was looking at committing the searchers after fixing Sebastian's comments
in [1], but after re-running the tests, LSHSearcher turns out not to
actually be faster than BruteSearcher.

I know writing benchmarks it tricky in Java, but still, this searcher
should be way better than brute search.
So, I stepped through the code and it turns out it never adjusts the
hashLimit for example.
This means that it effectively computes the distance between every point
and the query in addition to projecting which makes it slower.

I never really looked at this class thoroughly so maybe have a look at the
current searcher [2] and see what is going on? Also, what was the paper
that explained this method of dynamically adjusting the hash limit? You
probably sent it to me but I forgot it.

Thanks!

[1] https://reviews.apache.org/r/10195/
[2]
https://github.com/dfilimon/mahout/blob/vector/core/src/main/java/org/apache/mahout/math/neighborhood/LocalitySensitiveHashSearch.java


Accidentally closed issues for 0.8

2013-05-04 Thread Dan Filimon
Oops! I closed some issues that I should have waited for the 0.8 release to
close. They are:
https://issues.apache.org/jira/browse/MAHOUT-1202
https://issues.apache.org/jira/browse/MAHOUT-1155
https://issues.apache.org/jira/browse/MAHOUT-1135

Sorry about that! :)


[jira] [Closed] (MAHOUT-1135) Unify decorated vectors in DecoratedVector

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon closed MAHOUT-1135.
---


> Unify decorated vectors in DecoratedVector
> -
>
> Key: MAHOUT-1135
> URL: https://issues.apache.org/jira/browse/MAHOUT-1135
> Project: Mahout
>  Issue Type: Wish
>  Components: Math
>Affects Versions: 1.0
>    Reporter: Dan Filimon
>Priority: Minor
>  Labels: improvement, vector
>
> I'm finding the current Vector classes in Mahout a bit confusing.
> The vector implementation are just fine, I'm talking more about the decorated 
> vectors:
> WeightedVector
> MatrixSlice
> NamedVector
> I propose using a single DecoratedVector type that can easily be extended.
> For example, right now MatrixSlice doesn't even implement the Vector 
> interface.
> So,
> WeightedVector -> DecoratedVector>
> MatrixSlice -> DecoratedVector
> NamedVector -> DecoratedVector
> We could even keep the names (maybe changing MatrixSlice to something like 
> IndexedVector though?) by extending DecoratedVector.
> I'd be willing to fix this if people think it's a good idea.
> What about it? :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (MAHOUT-1155) Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon closed MAHOUT-1155.
---


> Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)
> -
>
> Key: MAHOUT-1155
> URL: https://issues.apache.org/jira/browse/MAHOUT-1155
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT_1155.patch, MAHOUT_1155.patch
>
>
> There are two changes in this issue:
> - making MatrixSlice a Vector by extending DelegatingVector;
> - making a few changes to the vector cloning code so that when cloning a 
> Centroid, the result is also a Centroid.
> This is part of the changes in 
> https://issues.apache.org/jira/browse/MAHOUT-1154
> The Centroid changes will now be part of the larger changes to Vectors:
> https://issues.apache.org/jira/browse/MAHOUT-1202

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (MAHOUT-1202) Speed up Vector operations

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon closed MAHOUT-1202.
---


> Speed up Vector operations
> --
>
> Key: MAHOUT-1202
> URL: https://issues.apache.org/jira/browse/MAHOUT-1202
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>
> Vector assign() and aggregate() can be significantly improved in some 
> conditions taking into account the different properties of the vectors we're 
> working with.
> This issue relates to the design document at 
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj
> and the patch at
> https://reviews.apache.org/r/10669
> The benchmarks are at
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10
> and while there are a few regressions (which will be fixed later regarding 
> RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
> up the code significantly.
> Part 1, the new function interfaces is merged. [Committed revision 1478853.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1202) Speed up Vector operations

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1202.
-

Resolution: Fixed

Committed revision 1478958.

> Speed up Vector operations
> --
>
> Key: MAHOUT-1202
> URL: https://issues.apache.org/jira/browse/MAHOUT-1202
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>
> Vector assign() and aggregate() can be significantly improved in some 
> conditions taking into account the different properties of the vectors we're 
> working with.
> This issue relates to the design document at 
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj
> and the patch at
> https://reviews.apache.org/r/10669
> The benchmarks are at
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10
> and while there are a few regressions (which will be fixed later regarding 
> RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
> up the code significantly.
> Part 1, the new function interfaces is merged. [Committed revision 1478853.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Really long running tests

2013-05-03 Thread Dan Filimon
After the updates I mentioned in the last e-mail this happens:

[trunk] [~2m 17s]
new 10 7704.7 1.52101e-14 0.0 1.52101e-14
new 30 9395.2 1.52101e-14 0.0 1.52101e-14
new 80 9400.5 1.52101e-14 0.0 1.52101e-14
new 180 12842.9 1.52101e-14 0.0 1.52101e-14
new 380 5654.4 1.52101e-14 0.0 1.52101e-14
new 880 6880.4 1.52101e-14 0.0 1.52101e-14
old 10 36023.5 2.84217e-14 0.0 2.84217e-14
old 30 55919.1 2.84217e-14 0.0 2.84217e-14
old 80 58147.7 2.84217e-14 0.0 2.84217e-14
old 180 54198.4 2.84217e-14 0.0 2.84217e-14
old 380 37995.1 2.84217e-14 0.0 2.84217e-14
old 880 44413.5 2.84217e-14 0.0 2.84217e-14
Speedup is about 6.5 times

[new vector ops] [~1m 35s]
new 10 17086.4 1.44329e-14 0.0 1.44329e-14
new 30 10960.8 1.44329e-14 0.0 1.44329e-14
new 80 11734.8 1.44329e-14 0.0 1.44329e-14
new 180 34976.0 1.44329e-14 0.0 1.44329e-14
new 380 16007.1 1.44329e-14 0.0 1.44329e-14
new 880 13113.1 1.44329e-14 0.0 1.44329e-14
old 10 46757.8 2.65343e-14 0.0 2.65343e-14
old 30 23111.9 2.65343e-14 0.0 2.65343e-14
old 80 16987.3 2.65343e-14 0.0 2.65343e-14
old 180 15822.4 2.65343e-14 0.0 2.65343e-14
old 380 28304.9 2.65343e-14 0.0 2.65343e-14
old 880 15262.6 2.65343e-14 0.0 2.65343e-14
Speedup is about 1.4 times (FAILS)

[new vector ops, before changes] [~3m 7s]
new 10 11642.0 1.14353e-14 0.0 1.14353e-14
new 30 8169.5 1.14353e-14 0.0 1.14353e-14
new 80 8446.0 1.14353e-14 0.0 1.14353e-14
new 180 8429.7 1.14353e-14 0.0 1.14353e-14
new 380 9316.2 1.14353e-14 0.0 1.14353e-14
new 880 10924.3 1.14353e-14 0.0 1.14353e-14
old 10 55476.1 2.59792e-14 0.0 2.59792e-14
old 30 64453.2 2.59792e-14 0.0 2.59792e-14
old 80 59954.5 2.59792e-14 0.0 2.59792e-14
old 180 71600.2 2.59792e-14 0.0 2.59792e-14
old 380 70933.0 2.59792e-14 0.0 2.59792e-14
old 880 63348.3 2.59792e-14 0.0 2.59792e-14
Speedup is about 6.3 times

Which of these is better?
A row is printed out from line 224:

  System.out.printf("%s %d\t%.1f\t%g\t%g\t%g\n", label, n, (t1 - t0) /
1.0e3 / n, maxIdent, maxError, warmup);

and the 3rd column seems to be the time.
It fluctuates and doesn't seem to depend on count...

Which of these 3 runs is better?


On Fri, May 3, 2013 at 7:51 PM, Dan Filimon wrote:

> I think I found out why, for the QR test.
>
> First off, it's stable and not seed dependent (on my machine anyway,
> haven't looked too closely).
>
> Trunk takes about 2 minutes and my new vector branch takes more than 3.
> From what I've seen the problem is twofold:
> - norm1 is still slower in the new code
> - VectorViews suck at iterating through. They create a new
> DecoratorElement for every nonzero (so the index can be adjusted).
> The problem is that when picking the best algorithm, I made
> getIteratorAdvanceCost be the same as the vector being viewed not realizing
> the impact of creating new elements.
>
> I'll get back to you after I:
> - change norm1 to what it used to be
> - tweak the iterator advance cost for vector views
>
> On Fri, May 3, 2013 at 7:23 PM, Ted Dunning  wrote:
>
>> Shouldn't depend on seed.
>>
>> Very odd.
>>
>> Sent from my iPhone
>>
>> On May 3, 2013, at 8:24, Robin Anil  wrote:
>>
>> > QRDecompositionTest: I saw this from time to time. Sometimes it runs in
>> 0.2
>> > seconds sometimes 100s. Seed related?
>> >
>> >
>> >
>> > On Fri, May 3, 2013 at 9:59 AM, Dan Filimon <
>> dangeorge.fili...@gmail.com>wrote:
>> >
>> >> QRDecompositionTest.fasterThanBefore() and most of the tests in
>> >> fpm.pfpgrowth take a really long time to run (FPGrowthSyntheticDataTest
>> >> took 98s on my machine).
>> >>
>> >> Could we do something about these?
>> >> Maybe move fasterThanBefore() into a benchmark and out of the tests
>> (like
>> >> VectorBenchmark) and simplify the fpm.* tests somehow?
>> >>
>> >> Thoughts? Thanks!
>> >>
>>
>
>


Re: Really long running tests

2013-05-03 Thread Dan Filimon
I think I found out why, for the QR test.

First off, it's stable and not seed dependent (on my machine anyway,
haven't looked too closely).

Trunk takes about 2 minutes and my new vector branch takes more than 3.
>From what I've seen the problem is twofold:
- norm1 is still slower in the new code
- VectorViews suck at iterating through. They create a new DecoratorElement
for every nonzero (so the index can be adjusted).
The problem is that when picking the best algorithm, I made
getIteratorAdvanceCost be the same as the vector being viewed not realizing
the impact of creating new elements.

I'll get back to you after I:
- change norm1 to what it used to be
- tweak the iterator advance cost for vector views

On Fri, May 3, 2013 at 7:23 PM, Ted Dunning  wrote:

> Shouldn't depend on seed.
>
> Very odd.
>
> Sent from my iPhone
>
> On May 3, 2013, at 8:24, Robin Anil  wrote:
>
> > QRDecompositionTest: I saw this from time to time. Sometimes it runs in
> 0.2
> > seconds sometimes 100s. Seed related?
> >
> >
> >
> > On Fri, May 3, 2013 at 9:59 AM, Dan Filimon  >wrote:
> >
> >> QRDecompositionTest.fasterThanBefore() and most of the tests in
> >> fpm.pfpgrowth take a really long time to run (FPGrowthSyntheticDataTest
> >> took 98s on my machine).
> >>
> >> Could we do something about these?
> >> Maybe move fasterThanBefore() into a benchmark and out of the tests
> (like
> >> VectorBenchmark) and simplify the fpm.* tests somehow?
> >>
> >> Thoughts? Thanks!
> >>
>


[jira] [Updated] (MAHOUT-1202) Speed up Vector operations

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1202:


Description: 
Vector assign() and aggregate() can be significantly improved in some 
conditions taking into account the different properties of the vectors we're 
working with.

This issue relates to the design document at 
https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj

and the patch at
https://reviews.apache.org/r/10669

The benchmarks are at
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10

and while there are a few regressions (which will be fixed later regarding 
RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
up the code significantly.

Part 1, the new function interfaces is merged. [Committed revision 1478853.]

  was:
Vector assign() and aggregate() can be significantly improved in some 
conditions taking into account the different properties of the vectors we're 
working with.

This issue relates to the design document at 
https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj

and the patch at
https://reviews.apache.org/r/10669

The benchmarks are at
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10

and while there are a few regressions (which will be fixed later regarding 
RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
up the code significantly.


> Speed up Vector operations
> --
>
> Key: MAHOUT-1202
> URL: https://issues.apache.org/jira/browse/MAHOUT-1202
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> Vector assign() and aggregate() can be significantly improved in some 
> conditions taking into account the different properties of the vectors we're 
> working with.
> This issue relates to the design document at 
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj
> and the patch at
> https://reviews.apache.org/r/10669
> The benchmarks are at
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10
> and while there are a few regressions (which will be fixed later regarding 
> RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
> up the code significantly.
> Part 1, the new function interfaces is merged. [Committed revision 1478853.]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1203) Problem in PhD Topic

2013-05-03 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648531#comment-13648531
 ] 

Dan Filimon commented on MAHOUT-1203:
-

Saeed,

It sounds like you're having trouble deciding on a topic for your thesis. The 
best way to get advice about this from our community is to ask on our mailing 
lists [1].

This is however not a bug/issue with Mahout itself so please close this issue 
and send an e-mail to the user mailing list.

[1] http://mahout.apache.org/mailinglists.html#Mahout User List

> Problem in PhD Topic
> 
>
> Key: MAHOUT-1203
> URL: https://issues.apache.org/jira/browse/MAHOUT-1203
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, Clustering, Integration
>Affects Versions: 0.4, 0.5, 0.6, 0.7
> Environment: Ubuntu12.04
>Reporter: saeed iqbal 
>  Labels: newbie
>   Original Estimate: 612h
>  Remaining Estimate: 612h
>
> Recently, i study literature review about cloud computing, hadoop, mahout and 
> machine learning algorithms. Actually i am working on hadoop in my PhD study. 
> But now i confuse about my topic, i can't identify my PhD topic, please guide 
> me (hadoop and mahout).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1155) Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1155.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1478836.

> Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)
> -
>
> Key: MAHOUT-1155
> URL: https://issues.apache.org/jira/browse/MAHOUT-1155
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT_1155.patch, MAHOUT_1155.patch
>
>
> There are two changes in this issue:
> - making MatrixSlice a Vector by extending DelegatingVector;
> - making a few changes to the vector cloning code so that when cloning a 
> Centroid, the result is also a Centroid.
> This is part of the changes in 
> https://issues.apache.org/jira/browse/MAHOUT-1154
> The Centroid changes will now be part of the larger changes to Vectors:
> https://issues.apache.org/jira/browse/MAHOUT-1202

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1155) Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1155:


Summary: Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)  
(was: Make MatrixSlice a Vector and fix Centroid cloning)

> Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202)
> -
>
> Key: MAHOUT-1155
> URL: https://issues.apache.org/jira/browse/MAHOUT-1155
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Attachments: MAHOUT_1155.patch, MAHOUT_1155.patch
>
>
> There are two changes in this issue:
> - making MatrixSlice a Vector by extending DelegatingVector;
> - making a few changes to the vector cloning code so that when cloning a 
> Centroid, the result is also a Centroid.
> This is part of the changes in 
> https://issues.apache.org/jira/browse/MAHOUT-1154
> The Centroid changes will now be part of the larger changes to Vectors:
> https://issues.apache.org/jira/browse/MAHOUT-1202

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1155) Make MatrixSlice a Vector and fix Centroid cloning

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon updated MAHOUT-1155:


Description: 
There are two changes in this issue:
- making MatrixSlice a Vector by extending DelegatingVector;
- making a few changes to the vector cloning code so that when cloning a 
Centroid, the result is also a Centroid.

This is part of the changes in https://issues.apache.org/jira/browse/MAHOUT-1154

The Centroid changes will now be part of the larger changes to Vectors:
https://issues.apache.org/jira/browse/MAHOUT-1202

  was:
There are two changes in this issue:
- making MatrixSlice a Vector by extending DelegatingVector;
- making a few changes to the vector cloning code so that when cloning a 
Centroid, the result is also a Centroid.

This is part of the changes in https://issues.apache.org/jira/browse/MAHOUT-1154


> Make MatrixSlice a Vector and fix Centroid cloning
> --
>
> Key: MAHOUT-1155
> URL: https://issues.apache.org/jira/browse/MAHOUT-1155
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Attachments: MAHOUT_1155.patch, MAHOUT_1155.patch
>
>
> There are two changes in this issue:
> - making MatrixSlice a Vector by extending DelegatingVector;
> - making a few changes to the vector cloning code so that when cloning a 
> Centroid, the result is also a Centroid.
> This is part of the changes in 
> https://issues.apache.org/jira/browse/MAHOUT-1154
> The Centroid changes will now be part of the larger changes to Vectors:
> https://issues.apache.org/jira/browse/MAHOUT-1202

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1135) Unify decorated vectors in DecoratedVector

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1135.
-

Resolution: Won't Fix

Experimented with this a while back, but for the most important use case 
(WeightedVector), it needed a Pair and the overhead of an additional object was 
deemed too prohibitive.

Simply extending DelegatingVector should be enough.

> Unify decorated vectors in DecoratedVector
> -
>
> Key: MAHOUT-1135
> URL: https://issues.apache.org/jira/browse/MAHOUT-1135
> Project: Mahout
>  Issue Type: Wish
>  Components: Math
>Affects Versions: 1.0
>Reporter: Dan Filimon
>Priority: Minor
>  Labels: improvement, vector
>
> I'm finding the current Vector classes in Mahout a bit confusing.
> The vector implementation are just fine, I'm talking more about the decorated 
> vectors:
> WeightedVector
> MatrixSlice
> NamedVector
> I propose using a single DecoratedVector type that can easily be extended.
> For example, right now MatrixSlice doesn't even implement the Vector 
> interface.
> So,
> WeightedVector -> DecoratedVector>
> MatrixSlice -> DecoratedVector
> NamedVector -> DecoratedVector
> We could even keep the names (maybe changing MatrixSlice to something like 
> IndexedVector though?) by extending DecoratedVector.
> I'd be willing to fix this if people think it's a good idea.
> What about it? :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Committing to mahout-git?

2013-05-03 Thread Dan Filimon
Thanks,

I can't directly use the github mirror but applying the formatted patch
worked fine!


On Thu, May 2, 2013 at 8:57 PM, Robin Anil  wrote:

> diffs from git can be applied on svn using
>
> patch -P1 < patch.file
>
> I tried this with your patches. I dont know much about the apache mahout
> git mirror to answer the former question
>


[jira] [Resolved] (MAHOUT-1190) SequentialAccessSparseVector function assignment is very slow and other iterator woes

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1190.
-

Resolution: Duplicate

Moved into this other issue as it grew in scope:
https://issues.apache.org/jira/browse/MAHOUT-1202

> SequentialAccessSparseVector function assignment is very slow and other 
> iterator woes
> -
>
> Key: MAHOUT-1190
> URL: https://issues.apache.org/jira/browse/MAHOUT-1190
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>    Reporter: Dan Filimon
> Attachments: MAHOUT-1190-1.patch, MAHOUT-1190-iterator-fix.patch, 
> MAHOUT-1190-iterator-fix.patch, MAHOUT-1190-iterator-fix.patch, 
> MAHOUT-1190.patch, MAHOUT-1190-seq-dot-product.patch, 
> MAHOUT-1190-seq-dot-product.patch
>
>
> Currently when calling .assign() on a SASV with another vector and a custom 
> function, it will iterate through it and assign every single entry while also 
> referring it by index.
> This makes the process *hugely* expensive. (on a run of BallKMeans on the 20 
> newsgroups data set, profiling reveals that 92% of the runtime was spent 
> updating assigning the vectors).
> Here's a prototype patch:
> https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1202) Speed up Vector operations

2013-05-03 Thread Dan Filimon (JIRA)
Dan Filimon created MAHOUT-1202:
---

 Summary: Speed up Vector operations
 Key: MAHOUT-1202
 URL: https://issues.apache.org/jira/browse/MAHOUT-1202
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Affects Versions: 0.8
Reporter: Dan Filimon


Vector assign() and aggregate() can be significantly improved in some 
conditions taking into account the different properties of the vectors we're 
working with.

This issue relates to the design document at 
https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#heading=h.koi571fvwha3jj

and the patch at
https://reviews.apache.org/r/10669

The benchmarks are at
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc&pli=1#gid=10

and while there are a few regressions (which will be fixed later regarding 
RandomAccessSparseVectors), it improves a lot of benchmarks as well as cleans 
up the code significantly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1117) Vectors are not hashable

2013-05-03 Thread Dan Filimon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648348#comment-13648348
 ] 

Dan Filimon commented on MAHOUT-1117:
-

About this, how would you go about creating a set of Vectors?
Would it be a HashSet or a TreeSet?

> Vectors are not hashable
> 
>
> Key: MAHOUT-1117
> URL: https://issues.apache.org/jira/browse/MAHOUT-1117
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0
>    Reporter: Dan Filimon
>Priority: Minor
>
> No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
> In working on improving clustering in Mahout, Ted Dunning wrote prototype 
> code for Streaming KMeans and Ball KMeans, that I'm working with him on. 
> These need to be used together in the MapReduce version.
> However, in Ball KMeans, we initialize the clusters using a probabilistic 
> approach similar to k-means++. This however requires a 
> Multinomial distribution of the points we want to cluster to 
> pick the centroids.
> Internally, the Multinomial uses a HashMap to keep track of the values it 
> can sample from.
> Since Vectors don't override Object's hashCode(), it is possible to get the 
> same value multiple times in the map (as long as the references differ).
> This is less of an issue because of how we're adding the vectors to the 
> multinomial (we can guarantee that the references will be unique) and once 
> MAHOUT-1116 is resolved the hashing will work okay for our needs.
> It still seems that it would be useful to have hashable vectors.
> What do you think? And what would a hash function look like?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1189) CosineDistanceMeasure doesn't return 0 for two 0 vectors

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1189.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1478733.

> CosineDistanceMeasure doesn't return 0 for two 0 vectors
> 
>
> Key: MAHOUT-1189
> URL: https://issues.apache.org/jira/browse/MAHOUT-1189
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
>
> CosineDistanceMeasure for two equal vectors should always return 0 like for 
> any other distance measure, however it returns 1.
> This patch fixes this issue.
> Also, note that it's not necessarily obvious what the return value should be 
> since the cosine of two 0-length vectors isn't defined.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1180) Multinomial throws ConcurrentModificationException when iterating and setting probabilities

2013-05-03 Thread Dan Filimon (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Filimon resolved MAHOUT-1180.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed revision 1478723.

> Multinomial throws ConcurrentModificationException when iterating and 
> setting probabilities
> --
>
> Key: MAHOUT-1180
> URL: https://issues.apache.org/jira/browse/MAHOUT-1180
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>    Reporter: Dan Filimon
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT_1180.patch, MAHOUT_1180.patch
>
>
> Here's [1] an example of the problem (from BallKMeans, lines 225-232, [2]).
> When iterating through the elements in a Multinomial and updating the 
> probabilities, sometimes newWeight becomes 0 (because of using 
> CosineDistances).
> When setting a weight to 0 in Multinomial, the element is removed from the 
> items hash map while using the hash map for iteration.
> This causes a ConcurrentModificationException.
> [1] https://gist.github.com/dfilimon/5270234
> [2] 
> https://github.com/dfilimon/mahout/blob/skm/core/src/main/java/org/apache/mahout/clustering/streaming/cluster/BallKMeans.java#L225

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Committing to mahout-git?

2013-05-02 Thread Dan Filimon
Hi,

I've put off committing my changes for too long already, and now that I'm
finally doing it, I would *really* love to push changes to the github repo.

Is this possible right now, or do I have to check out an SVN version,
fiddle with the diff and get it working?

Thanks! :)


Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-05-02 Thread Dan Filimon
Robin, I addressed the 7 points you mentioned, except for the one about
empty vectors.
It's unclear what aggregate should return gor empty vectors (either through
an exception or 0) and so I opted for 0 everywhere.

The new tests are VectorBinaryAssignTest, VectorBinaryAssignCostTest,
VectorBinaryAggregateTest, VectorBinaryAggregateCostTest and FunctionTest.

The Times benchmark should run more quickly now – there was a tiny bug in
createOptimizedCopy that asked if "isDense()" rather than
"vector.isDense()" which made it not create the right kind of vector.

Comments-wise, I would *really* like this do be over, I feel like I've
explained the whole iteration to death by now in tons of places.



On Thu, May 2, 2013 at 8:07 PM, Dan Filimon wrote:

>This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10669/
>   Review request for mahout, Ted Dunning, Sebastian Schelter, and Robin
> Anil.
> By Dan Filimon.
>
> *Updated May 2, 2013, 5:07 p.m.*
> Changes
>
> This is the nearly-done version of the code.
>
> The bulk of the changes are in AbstractVector, VectorBinaryAssign, 
> VectorBinaryAggregate and Functions.
>
> As for comments in the new code, I felt the names are fairly self explanatory 
> so didn't comment every class.
> I did however write comments on top describing the changes as well as write 
> tests and link to the design doc.
>
>   Description
>
> This patch contains code cleaning up AbstractVector and making the operations 
> as fast as possible while still having a high level interface.
>
> The main changes are in AbstractVector as well as new methods in 
> DoubleDoubleFunction.
>
>   Testing
>
> The vectors test pass but it's likely that the patch in it's current state is 
> broken as other, unrelated tests (BallKMeans...) are failing.
> Also, my Hadoop conf is broken so I didn't run all the core tests. Anyone?
>
> I can't seem to find the bug, so _please_ have a closer look. It's still a 
> work in progress.
>
> The benchmarks seem comparable (although there are some jarring diferences – 
> Minkowski distance seems a lot slower in new-dan-1 than old-trunk-2). It may 
> be however that this is just variance due to the load of the machine at the 
> time. I'm having trouble interpreting the benchmarks in general, so anyone 
> who could give me a hand is more than welcome.
>
>   Diffs (updated)
>
>- 
> core/src/main/java/org/apache/mahout/common/distance/ChebyshevDistanceMeasure.java
>(0a064c9)
>- 
> core/src/main/java/org/apache/mahout/common/distance/CosineDistanceMeasure.java
>(0c51591)
>- 
> core/src/main/java/org/apache/mahout/common/distance/ManhattanDistanceMeasure.java
>(c98e5f7)
>- 
> core/src/main/java/org/apache/mahout/common/distance/MinkowskiDistanceMeasure.java
>(b650a8d)
>- 
> core/src/main/java/org/apache/mahout/common/distance/TanimotoDistanceMeasure.java
>(d32c42a)
>- core/src/main/java/org/apache/mahout/ep/Mapping.java (900a0b8)
>- 
> core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GivensThinSolver.java
>(7c391ef)
>- 
> core/src/test/java/org/apache/mahout/clustering/iterator/TestClusterClassifier.java
>(62c5acf)
>- 
> core/src/test/java/org/apache/mahout/common/distance/CosineDistanceMeasureTest.java
>(50b03f0)
>- math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
>(e12aa38)
>- math/src/main/java/org/apache/mahout/math/AbstractVector.java
>(090aa7a)
>- math/src/main/java/org/apache/mahout/math/Centroid.java (0c42196)
>- math/src/main/java/org/apache/mahout/math/ConstantVector.java
>(51d67d4)
>- math/src/main/java/org/apache/mahout/math/DelegatingVector.java
>(12220d4)
>- math/src/main/java/org/apache/mahout/math/DenseVector.java (41c356b)
>- 
> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
>(094003b)
>- math/src/main/java/org/apache/mahout/math/MatrixSlice.java (7f79c96)
>- math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
>(af70727)
>- math/src/main/java/org/apache/mahout/math/NamedVector.java (4b7a41d)
>- math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
>(650d82d)
>- math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
>(d1ea93a)
>- math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
>(6f85692)
>- 
> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
>(21982f9)
>- math/src/main/java/org/apache/mahout/math/Vector.java (2f8b417)
>- math/src/main/java/org/apache/mahout

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-30 Thread Dan Filimon
Yeah, we talked about 7 a while back in a chat, thanks for reminding me!

As for Times, that's really weid. That extra code should pick commute the
vectors if one is more dense than the other and the performance of
Seq.fn(Dense) and Dense.fn(Seq) should be the same. And, even weirded, when
I ran the benchmark with a breakpoint on if the lhs is Dense, it never
triggered.

I'm gonna just start cleaning it up and hopefully come up with something
later.


On Tue, Apr 30, 2013 at 10:38 PM, Robin Anil  wrote:

> Before I forget. One more thing
> 7. Add test harness for functions. So if say a function says
> isLikeLeftPlus() == true. Take random values from Double Range -inf to +inf
> to make sure its true for those values.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Tue, Apr 30, 2013 at 2:19 PM, Robin Anil  wrote:
>
>> Tried with your changes pulled in. Other than maybe variance due to my
>> process state (of my macbook). The benchmarks that had the regression don't
>> show any marked improvement.
>>
>> See column Y
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=8
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Tue, Apr 30, 2013 at 12:55 PM, Dan Filimon <
>> dangeorge.fili...@gmail.com> wrote:
>>
>>> Robin, regarding Times, I think it should work the same now. I changed
>>> the swapping condition in AbstractVector.times to something more readable.
>>> As for norm1, it looks like it's exactly the same. I don't see what's
>>> causing the slowdown other than the indirection.
>>>
>>> Could you please try the new version on your machine?
>>>
>>
>>
>


Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-30 Thread Dan Filimon
Robin, regarding Times, I think it should work the same now. I changed the
swapping condition in AbstractVector.times to something more readable.
As for norm1, it looks like it's exactly the same. I don't see what's
causing the slowdown other than the indirection.

Could you please try the new version on your machine?


Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-30 Thread Dan Filimon
I'm looking at the norm1 and times regressions again, maybe there's
something I missed.

I agree with 1 through 3.

About 4, 5, do you think we'd lose too much precision?

About 6, you're giving examples of tests, not different special cases,
right?

As for the names, they're unfortunate, but I picked these after Ted said
the initial ones (isLikeLeftMult was isLikeF0XEquals0...) were even worse.


On Tue, Apr 30, 2013 at 7:01 PM, Robin Anil  wrote:

> I see that the end is tantalizingly near. Few other review comments:
>
> 1) Remove all unused code.
> 2) Do not allow construction of empty vectors. Just makes no sense (Unless
> someone strongly disagrees).
> 3) Comment all classes (AssignNonzerosIterateThisLookupThat etc).
> 4) Change < Constants.EPSILON checks to == 0.0.
> 5) Change > Constants.EPSILON check to !=0  ( I am sure you are using
> Math.abs for such checks so it should be safe)
> 6) Need some individual tests for each of
> (AssignNonzerosIterateThisLookupThat etc. for toy examples, it might as
> well be doing PLUS_ABS or MINUS for the entire thing)
>
> Apart from this I am not too happy with names of isLikeLeftMult,
> isLikeRightPlus etc. But I dont have a good alternate either. Please run
> this by Ted.
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Tue, Apr 30, 2013 at 10:10 AM, Robin Anil  wrote:
>
> > Yes the incrementQuick is a known speed booster (due to half the number
> of
> > key hash generation). You can leave that to me. I can make it faster
> after
> > you check this in. It might require some refactor of the increment quick
> > interface.
> >
> > What about the regressions in SeqSparseVector norm1? and
> Dense.times(Seq)?
> > Can you explain that?
> >
>


Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-30 Thread Dan Filimon
So now, it RandomAccessSparseVector seems to be the most affected.
Pretty much every regression is related to RASV. Could it be that it's
better to handle it as a non-constant time update Vector and have drop the
in-place updates?
Otherwise, the code that implements Minus is pretty much the same as trunk:

  Iterator yi = y.iterateNonZero();
  Vector.Element ye;
  while (yi.hasNext()) {
ye = yi.next();
x.setQuick(ye.index(), f.apply(x.getQuick(ye.index()), ye.get()));
  }
  return x;

The main difference is that it's not going through incrementQuick which for
RASVs skips a few calls.


On Tue, Apr 30, 2013 at 5:45 AM, Robin Anil  wrote:

> After this change the benchmark actually takes about 41 minutes :)
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 29, 2013 at 9:45 PM, Robin Anil  wrote:
>
> > After increasing the lead time to 15 seconds, I believe I am giving
> enough
> > time for JIT to take place. This way we are measuring only the JITed code
> > not the interpreted code. There is a flag in JVM with which you can
> change
> > the threshold -XX:+CompileThreshold (default: 1). I didnt want to
> mess
> > with it to make it as close as possible to the real world.
> >
> > I have also disabled the cluster benchmarks as they are not executed
> > enough times for JIT to compile them. I have to look more at them. for
> now
> > ignore them for your patch.
> >
> > But you can see that variance is greatly reduced (See Sheet Version8)
> >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=8
> >
> > There are only a handful of regressions Norm1 (for Seq and Rand), Minus
> > and Plus (for Rand) and a few others.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 29, 2013 at 11:37 AM, Robin Anil 
> wrote:
> >
> >> I have a few ideas to tweak around with the JIT to make the benchmarks
> >> more stable. I will ping back once I get some time to do something.
> >>
> >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>
> >>
> >> On Mon, Apr 29, 2013 at 10:32 AM, Dan Filimon <
> >> dangeorge.fili...@gmail.com> wrote:
> >>
> >>> That's the weirdness, it's _exactly_ the same code.
> >>>
> >>>
> >>> On Mon, Apr 29, 2013 at 6:31 PM, Robin Anil 
> >>> wrote:
> >>>
> >>> > I will pull your Patch in and take a look a close look at it tonight.
> >>> Whats
> >>> > different in Run3 in Version7 v/s Run2 and Run1. Because Run3 looks
> >>> really
> >>> > good and the regressions of 30% in some are actually not as bad and
> >>> nothing
> >>> > no longer looks ridiculously blocking the patch.
> >>> >
> >>> >
> >>> > On Mon, Apr 29, 2013 at 10:13 AM, Dan Filimon
> >>> > wrote:
> >>> >
> >>> > > I uploaded the latest patch to my vector branch on github [0].
> >>> > >
> >>> > > The latest revision is, as always, here [1], "Version 7
> >>> > (Assign/Aggregate;
> >>> > > 1 sec runtime)". Tests pass for this one except for some aggregate
> >>> tests
> >>> > > (it's not clear if aggregate should throw an exception when the
> >>> vectors
> >>> > are
> >>> > > empty or return 0; I chose return 0).
> >>> > >
> >>> > > There's quite a bit of variance, a lot of it even in code that I
> >>> haven't
> >>> > > changed between runs. So, norm2 shows regressions in version 7 and
> >>> > > improvements in version 6 and the code not having been changed.
> >>> > >
> >>> > > The dot product is also oscillating in version 7, showing different
> >>> > results
> >>> > > in the different runs (of the same algorithm).
> >>> > >
> >>> > > Except for some cleanup, I don't think I can do much better than
> >>> this, so
> >>> > > could someone please also run the benchmarks and I'll update the
> >>> patch on
> >>> > > ReviewBoard?
> >>> > >
> >>> > > [0] https://github.com/dfilimon/mahout/tree/vector
> >>> > > [1]
> >>> > >
> >>> > >
> >

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-29 Thread Dan Filimon
That's the weirdness, it's _exactly_ the same code.


On Mon, Apr 29, 2013 at 6:31 PM, Robin Anil  wrote:

> I will pull your Patch in and take a look a close look at it tonight. Whats
> different in Run3 in Version7 v/s Run2 and Run1. Because Run3 looks really
> good and the regressions of 30% in some are actually not as bad and nothing
> no longer looks ridiculously blocking the patch.
>
>
> On Mon, Apr 29, 2013 at 10:13 AM, Dan Filimon
> wrote:
>
> > I uploaded the latest patch to my vector branch on github [0].
> >
> > The latest revision is, as always, here [1], "Version 7
> (Assign/Aggregate;
> > 1 sec runtime)". Tests pass for this one except for some aggregate tests
> > (it's not clear if aggregate should throw an exception when the vectors
> are
> > empty or return 0; I chose return 0).
> >
> > There's quite a bit of variance, a lot of it even in code that I haven't
> > changed between runs. So, norm2 shows regressions in version 7 and
> > improvements in version 6 and the code not having been changed.
> >
> > The dot product is also oscillating in version 7, showing different
> results
> > in the different runs (of the same algorithm).
> >
> > Except for some cleanup, I don't think I can do much better than this, so
> > could someone please also run the benchmarks and I'll update the patch on
> > ReviewBoard?
> >
> > [0] https://github.com/dfilimon/mahout/tree/vector
> > [1]
> >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7
> >
> >
> > On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning 
> > wrote:
> >
> > > It would *really* help to put a legend on these spreadsheets even if
> you
> > > have described it in the past.
> > >
> > > I take it that these are number of ops and bigger is better, right?
> > >
> > > And where is a description of what patch runs 1 and 2 actually are?
> > >
> > >
> > >
> > > On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon <
> > dangeorge.fili...@gmail.com>wrote:
> > >
> > >> Here's [1] the most recent benchmarks, after refactoring the Assign
> into
> > >> separate classes.
> > >> It's now using a 10 second runtime and a 1 second lead time and these
> > are
> > >> only the benchmarks involving assign().
> > >>
> > >> Oddly enough there are regressions with RandomAccessSparseVectors'
> > plus()
> > >> and minus() despite these working like trunk from what I can tell from
> > the
> > >> tests (VectorBinaryAssignTest).
> > >>
> > >> Also, there's quite a bit of variance for some reason... Clone, Create
> > >>  are just two of the odd ones.
> > >>
> > >> [1]
> > >>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
> > >>
> > >>
> > >> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil 
> > wrote:
> > >>
> > >>> Do in place for rand an dense. Assign a bug to me to speed up rasv
> > >>> GetElement.
> > >>> On Apr 25, 2013 2:56 PM, "Dan Filimon" 
> > >>> wrote:
> > >>>
> > >>> > Nearly done splitting the code up, but I'm not sure what the costs
> > >>> should
> > >>> > ideally be.
> > >>> >
> > >>> > Robin, you proposed: cost of iteration + cost of lookup + cost of
> > >>> update
> > >>> > (if its not in-place)
> > >>> >
> > >>> > This sounds like it's per element, rather than for the entire
> vector.
> > >>> Also,
> > >>> > are we just going to assume that there will be as many updates as
> > >>> non-zeros
> > >>> > or as large as the array is? How does the cost of an update factor
> > in?
> > >>> >
> > >>> > And, is "in-place" just for DenseVectors?
> > >>> >
> > >>> >
> > >>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil 
> > >>> wrote:
> > >>> >
> > >>> > > Depends on the speed of copying a 1M double array v/s doing a
> > calloc
> > >>> +
> > >>> > > copying 1000 non zeros (Assuming java is doing that underneath).
> > >>> > >

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-29 Thread Dan Filimon
I uploaded the latest patch to my vector branch on github [0].

The latest revision is, as always, here [1], "Version 7 (Assign/Aggregate;
1 sec runtime)". Tests pass for this one except for some aggregate tests
(it's not clear if aggregate should throw an exception when the vectors are
empty or return 0; I chose return 0).

There's quite a bit of variance, a lot of it even in code that I haven't
changed between runs. So, norm2 shows regressions in version 7 and
improvements in version 6 and the code not having been changed.

The dot product is also oscillating in version 7, showing different results
in the different runs (of the same algorithm).

Except for some cleanup, I don't think I can do much better than this, so
could someone please also run the benchmarks and I'll update the patch on
ReviewBoard?

[0] https://github.com/dfilimon/mahout/tree/vector
[1]
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=7


On Sun, Apr 28, 2013 at 11:18 PM, Ted Dunning  wrote:

> It would *really* help to put a legend on these spreadsheets even if you
> have described it in the past.
>
> I take it that these are number of ops and bigger is better, right?
>
> And where is a description of what patch runs 1 and 2 actually are?
>
>
>
> On Sun, Apr 28, 2013 at 2:23 AM, Dan Filimon 
> wrote:
>
>> Here's [1] the most recent benchmarks, after refactoring the Assign into
>> separate classes.
>> It's now using a 10 second runtime and a 1 second lead time and these are
>> only the benchmarks involving assign().
>>
>> Oddly enough there are regressions with RandomAccessSparseVectors' plus()
>> and minus() despite these working like trunk from what I can tell from the
>> tests (VectorBinaryAssignTest).
>>
>> Also, there's quite a bit of variance for some reason... Clone, Create
>>  are just two of the odd ones.
>>
>> [1]
>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
>>
>>
>> On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil  wrote:
>>
>>> Do in place for rand an dense. Assign a bug to me to speed up rasv
>>> GetElement.
>>> On Apr 25, 2013 2:56 PM, "Dan Filimon" 
>>> wrote:
>>>
>>> > Nearly done splitting the code up, but I'm not sure what the costs
>>> should
>>> > ideally be.
>>> >
>>> > Robin, you proposed: cost of iteration + cost of lookup + cost of
>>> update
>>> > (if its not in-place)
>>> >
>>> > This sounds like it's per element, rather than for the entire vector.
>>> Also,
>>> > are we just going to assume that there will be as many updates as
>>> non-zeros
>>> > or as large as the array is? How does the cost of an update factor in?
>>> >
>>> > And, is "in-place" just for DenseVectors?
>>> >
>>> >
>>> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil 
>>> wrote:
>>> >
>>> > > Depends on the speed of copying a 1M double array v/s doing a calloc
>>> +
>>> > > copying 1000 non zeros (Assuming java is doing that underneath).
>>> > >
>>> > > --
>>> > > Robin Anil
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
>>> > dangeorge.fili...@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Right, but is clone() generally slower than assigning? That
>>> strikes me
>>> > as
>>> > > > odd; doesn't Java optimize copying the internal structures (there
>>> are
>>> > > > arrays underneath after all)?
>>> > > >
>>> > > >
>>> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil >> >
>>> > > wrote:
>>> > > >
>>> > > >> Seems like for dense clone is slower than like().assign I need to
>>> test
>>> > > it
>>> > > >> with different sizes to be sure. I kept it from the old behavior.
>>> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
>>> dangeorge.fili...@gmail.com>
>>> > > >> wrote:
>>> > > >>
>>> > > >>> Okay, so I should split it further into smaller sub-cases that
>>> handle
>>> > > >>> each Vector type. I tried making so that these match the cases
>>&

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-28 Thread Dan Filimon
Push aggregate() too, but the results still don't look that great.


On Sun, Apr 28, 2013 at 9:10 PM, Dan Filimon wrote:

> Yeah. It's there, but if you wait for a bit before having a look I'll also
> have results from aggregate().
>
>
> On Sun, Apr 28, 2013 at 9:08 PM, Robin Anil  wrote:
>
>> Do you have the code committed in your repo?
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 28, 2013 at 4:23 AM, Dan Filimon > >wrote:
>>
>> > Here's [1] the most recent benchmarks, after refactoring the Assign into
>> > separate classes.
>> > It's now using a 10 second runtime and a 1 second lead time and these
>> are
>> > only the benchmarks involving assign().
>> >
>> > Oddly enough there are regressions with RandomAccessSparseVectors'
>> plus()
>> > and minus() despite these working like trunk from what I can tell from
>> the
>> > tests (VectorBinaryAssignTest).
>> >
>> > Also, there's quite a bit of variance for some reason... Clone, Create
>>  are
>> > just two of the odd ones.
>> >
>> > [1]
>> >
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
>> >
>> >
>> > On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil 
>> wrote:
>> >
>> > > Do in place for rand an dense. Assign a bug to me to speed up rasv
>> > > GetElement.
>> > > On Apr 25, 2013 2:56 PM, "Dan Filimon" 
>> > > wrote:
>> > >
>> > > > Nearly done splitting the code up, but I'm not sure what the costs
>> > should
>> > > > ideally be.
>> > > >
>> > > > Robin, you proposed: cost of iteration + cost of lookup + cost of
>> > update
>> > > > (if its not in-place)
>> > > >
>> > > > This sounds like it's per element, rather than for the entire
>> vector.
>> > > Also,
>> > > > are we just going to assume that there will be as many updates as
>> > > non-zeros
>> > > > or as large as the array is? How does the cost of an update factor
>> in?
>> > > >
>> > > > And, is "in-place" just for DenseVectors?
>> > > >
>> > > >
>> > > > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil 
>> > > wrote:
>> > > >
>> > > > > Depends on the speed of copying a 1M double array v/s doing a
>> calloc
>> > +
>> > > > > copying 1000 non zeros (Assuming java is doing that underneath).
>> > > > >
>> > > > > --
>> > > > > Robin Anil
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
>> > > > dangeorge.fili...@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Right, but is clone() generally slower than assigning? That
>> strikes
>> > > me
>> > > > as
>> > > > > > odd; doesn't Java optimize copying the internal structures
>> (there
>> > are
>> > > > > > arrays underneath after all)?
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <
>> robina...@apache.org
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Seems like for dense clone is slower than like().assign I need
>> to
>> > > test
>> > > > > it
>> > > > > >> with different sizes to be sure. I kept it from the old
>> behavior.
>> > > > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
>> > dangeorge.fili...@gmail.com
>> > > >
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >>> Okay, so I should split it further into smaller sub-cases that
>> > > handle
>> > > > > >>> each Vector type. I tried making so that these match the
>> cases in
>> > > the
>> > > > > >>> document to the extent possible.
>> > > > > >>> You're right. It is ugly and I need to split it up.
>> > > > > >>>

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-28 Thread Dan Filimon
Yeah. It's there, but if you wait for a bit before having a look I'll also
have results from aggregate().


On Sun, Apr 28, 2013 at 9:08 PM, Robin Anil  wrote:

> Do you have the code committed in your repo?
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 28, 2013 at 4:23 AM, Dan Filimon  >wrote:
>
> > Here's [1] the most recent benchmarks, after refactoring the Assign into
> > separate classes.
> > It's now using a 10 second runtime and a 1 second lead time and these are
> > only the benchmarks involving assign().
> >
> > Oddly enough there are regressions with RandomAccessSparseVectors' plus()
> > and minus() despite these working like trunk from what I can tell from
> the
> > tests (VectorBinaryAssignTest).
> >
> > Also, there's quite a bit of variance for some reason... Clone, Create
>  are
> > just two of the odd ones.
> >
> > [1]
> >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4
> >
> >
> > On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil 
> wrote:
> >
> > > Do in place for rand an dense. Assign a bug to me to speed up rasv
> > > GetElement.
> > > On Apr 25, 2013 2:56 PM, "Dan Filimon" 
> > > wrote:
> > >
> > > > Nearly done splitting the code up, but I'm not sure what the costs
> > should
> > > > ideally be.
> > > >
> > > > Robin, you proposed: cost of iteration + cost of lookup + cost of
> > update
> > > > (if its not in-place)
> > > >
> > > > This sounds like it's per element, rather than for the entire vector.
> > > Also,
> > > > are we just going to assume that there will be as many updates as
> > > non-zeros
> > > > or as large as the array is? How does the cost of an update factor
> in?
> > > >
> > > > And, is "in-place" just for DenseVectors?
> > > >
> > > >
> > > > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil 
> > > wrote:
> > > >
> > > > > Depends on the speed of copying a 1M double array v/s doing a
> calloc
> > +
> > > > > copying 1000 non zeros (Assuming java is doing that underneath).
> > > > >
> > > > > --
> > > > > Robin Anil
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
> > > > dangeorge.fili...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Right, but is clone() generally slower than assigning? That
> strikes
> > > me
> > > > as
> > > > > > odd; doesn't Java optimize copying the internal structures (there
> > are
> > > > > > arrays underneath after all)?
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <
> robina...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > >> Seems like for dense clone is slower than like().assign I need
> to
> > > test
> > > > > it
> > > > > >> with different sizes to be sure. I kept it from the old
> behavior.
> > > > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <
> > dangeorge.fili...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Okay, so I should split it further into smaller sub-cases that
> > > handle
> > > > > >>> each Vector type. I tried making so that these match the cases
> in
> > > the
> > > > > >>> document to the extent possible.
> > > > > >>> You're right. It is ugly and I need to split it up.
> > > > > >>>
> > > > > >>> I removed the OrderedIntDoubleMapping (but with another if...)
> > but
> > > > one
> > > > > >>> more thing that helped is making the assign() only go through
> the
> > > > > non-zeros
> > > > > >>> when setting up a new copy.
> > > > > >>>
> > > > > >>> About the copying, I saw that the optimized copy doesn't always
> > > call
> > > > > >>> clone clone on the Vector. Why is this?
> > 

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-28 Thread Dan Filimon
Here's [1] the most recent benchmarks, after refactoring the Assign into
separate classes.
It's now using a 10 second runtime and a 1 second lead time and these are
only the benchmarks involving assign().

Oddly enough there are regressions with RandomAccessSparseVectors' plus()
and minus() despite these working like trunk from what I can tell from the
tests (VectorBinaryAssignTest).

Also, there's quite a bit of variance for some reason... Clone, Create  are
just two of the odd ones.

[1]
https://docs.google.com/spreadsheet/ccc?key=0AochdzPoBmWodG9RTms1UG40YlNQd3ByUFpQY0FLWmc#gid=4


On Fri, Apr 26, 2013 at 1:30 AM, Robin Anil  wrote:

> Do in place for rand an dense. Assign a bug to me to speed up rasv
> GetElement.
> On Apr 25, 2013 2:56 PM, "Dan Filimon" 
> wrote:
>
> > Nearly done splitting the code up, but I'm not sure what the costs should
> > ideally be.
> >
> > Robin, you proposed: cost of iteration + cost of lookup + cost of update
> > (if its not in-place)
> >
> > This sounds like it's per element, rather than for the entire vector.
> Also,
> > are we just going to assume that there will be as many updates as
> non-zeros
> > or as large as the array is? How does the cost of an update factor in?
> >
> > And, is "in-place" just for DenseVectors?
> >
> >
> > On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil 
> wrote:
> >
> > > Depends on the speed of copying a 1M double array v/s doing a calloc +
> > > copying 1000 non zeros (Assuming java is doing that underneath).
> > >
> > > --
> > > Robin Anil
> > >
> > >
> > >
> > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <
> > dangeorge.fili...@gmail.com
> > > >wrote:
> > >
> > > > Right, but is clone() generally slower than assigning? That strikes
> me
> > as
> > > > odd; doesn't Java optimize copying the internal structures (there are
> > > > arrays underneath after all)?
> > > >
> > > >
> > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil 
> > > wrote:
> > > >
> > > >> Seems like for dense clone is slower than like().assign I need to
> test
> > > it
> > > >> with different sizes to be sure. I kept it from the old behavior.
> > > >> On Apr 25, 2013 2:12 AM, "Dan Filimon"  >
> > > >> wrote:
> > > >>
> > > >>> Okay, so I should split it further into smaller sub-cases that
> handle
> > > >>> each Vector type. I tried making so that these match the cases in
> the
> > > >>> document to the extent possible.
> > > >>> You're right. It is ugly and I need to split it up.
> > > >>>
> > > >>> I removed the OrderedIntDoubleMapping (but with another if...) but
> > one
> > > >>> more thing that helped is making the assign() only go through the
> > > non-zeros
> > > >>> when setting up a new copy.
> > > >>>
> > > >>> About the copying, I saw that the optimized copy doesn't always
> call
> > > >>> clone clone on the Vector. Why is this?
> > > >>>
> > > >>> Thanks for the test and the feedback! :)
> > > >>>
> > > >>>
> > > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil  > > >wrote:
> > > >>>
> > > >>>> Yes its reaching dev. I checked out your code to take a look. I
> > > >>>> understand you are trying to still mix isSequentialAccess etc in
> the
> > > >>>> validity calculation. The whole point of this architecture is to
> > > completely
> > > >>>> unroll things to be more modular instead of creating a giant
> > spaghetti
> > > >>>>
> > > >>>> Here is an example of what I am trying to say.
> > > >>>>
> > > >>>> This is your current implementation of
> AssignIterateOneLookupOther.
> > > >>>> There are so many conditionals
> > > >>>>
> > > >>>>
> > > >>>>1. @Override
> > > >>>>2. public Vector assign(Vector x, Vector y,
> > > >>>>DoubleDoubleFunction f) {
> > > >>>>3.   return !swap ? assignInner(x, y, f) : assignInner(y,
> x,
> > > f);
> > > >>>>4. }
> &

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-25 Thread Dan Filimon
Nearly done splitting the code up, but I'm not sure what the costs should
ideally be.

Robin, you proposed: cost of iteration + cost of lookup + cost of update
(if its not in-place)

This sounds like it's per element, rather than for the entire vector. Also,
are we just going to assume that there will be as many updates as non-zeros
or as large as the array is? How does the cost of an update factor in?

And, is "in-place" just for DenseVectors?


On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil  wrote:

> Depends on the speed of copying a 1M double array v/s doing a calloc +
> copying 1000 non zeros (Assuming java is doing that underneath).
>
> --
> Robin Anil
>
>
>
> On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon  >wrote:
>
> > Right, but is clone() generally slower than assigning? That strikes me as
> > odd; doesn't Java optimize copying the internal structures (there are
> > arrays underneath after all)?
> >
> >
> > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil 
> wrote:
> >
> >> Seems like for dense clone is slower than like().assign I need to test
> it
> >> with different sizes to be sure. I kept it from the old behavior.
> >> On Apr 25, 2013 2:12 AM, "Dan Filimon" 
> >> wrote:
> >>
> >>> Okay, so I should split it further into smaller sub-cases that handle
> >>> each Vector type. I tried making so that these match the cases in the
> >>> document to the extent possible.
> >>> You're right. It is ugly and I need to split it up.
> >>>
> >>> I removed the OrderedIntDoubleMapping (but with another if...) but one
> >>> more thing that helped is making the assign() only go through the
> non-zeros
> >>> when setting up a new copy.
> >>>
> >>> About the copying, I saw that the optimized copy doesn't always call
> >>> clone clone on the Vector. Why is this?
> >>>
> >>> Thanks for the test and the feedback! :)
> >>>
> >>>
> >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil  >wrote:
> >>>
> >>>> Yes its reaching dev. I checked out your code to take a look. I
> >>>> understand you are trying to still mix isSequentialAccess etc in the
> >>>> validity calculation. The whole point of this architecture is to
> completely
> >>>> unroll things to be more modular instead of creating a giant spaghetti
> >>>>
> >>>> Here is an example of what I am trying to say.
> >>>>
> >>>> This is your current implementation of AssignIterateOneLookupOther.
> >>>> There are so many conditionals
> >>>>
> >>>>
> >>>>1. @Override
> >>>>2. public Vector assign(Vector x, Vector y,
> >>>>DoubleDoubleFunction f) {
> >>>>3.   return !swap ? assignInner(x, y, f) : assignInner(y, x,
> f);
> >>>>4. }
> >>>>5.
> >>>>6. public Vector assignInner(Vector x, Vector y,
> >>>>DoubleDoubleFunction f) {
> >>>>7.   Iterator xi = x.iterateNonZero();
> >>>>8.   Vector.Element xe;
> >>>>9.   Vector.Element ye;
> >>>>10.   OrderedIntDoubleMapping updates =
> newOrderedIntDoubleMapping();
> >>>>11.   while (xi.hasNext()) {
> >>>>12. xe = xi.next();
> >>>>13. ye = y.getElement(xe.index());
> >>>>14. if (!swap) {
> >>>>15.   xe.set(f.apply(xe.get(), ye.get()));
> >>>>16. } else {
> >>>>17.   if (ye.get() != 0.0 || y.isAddConstantTime()) {
> >>>>18. ye.set(f.apply(ye.get(), xe.get()));
> >>>>19.   } else {
> >>>>20. updates.set(xe.index(), f.apply(ye.get(),
> >>>>xe.get()));
> >>>>21.   }
> >>>>22. }
> >>>>23.   }
> >>>>24.   if (swap && !y.isAddConstantTime()) {
> >>>>25. y.mergeUpdates(updates);
> >>>>26.   }
> >>>>27.   return swap ? y : x;
> >>>>28. }
> >>>>29.   }
> >>>>
> >>>> Split this into.
> >>>>
> >>>> IterateThisLookupThatInplaceUpdate
> >>>> IterateThatLookupThisInplaceUpdate
&

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-25 Thread Dan Filimon
Right, but is clone() generally slower than assigning? That strikes me as
odd; doesn't Java optimize copying the internal structures (there are
arrays underneath after all)?


On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil  wrote:

> Seems like for dense clone is slower than like().assign I need to test it
> with different sizes to be sure. I kept it from the old behavior.
> On Apr 25, 2013 2:12 AM, "Dan Filimon" 
> wrote:
>
>> Okay, so I should split it further into smaller sub-cases that handle
>> each Vector type. I tried making so that these match the cases in the
>> document to the extent possible.
>> You're right. It is ugly and I need to split it up.
>>
>> I removed the OrderedIntDoubleMapping (but with another if...) but one
>> more thing that helped is making the assign() only go through the non-zeros
>> when setting up a new copy.
>>
>> About the copying, I saw that the optimized copy doesn't always call
>> clone clone on the Vector. Why is this?
>>
>> Thanks for the test and the feedback! :)
>>
>>
>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil  wrote:
>>
>>> Yes its reaching dev. I checked out your code to take a look. I
>>> understand you are trying to still mix isSequentialAccess etc in the
>>> validity calculation. The whole point of this architecture is to completely
>>> unroll things to be more modular instead of creating a giant spaghetti
>>>
>>> Here is an example of what I am trying to say.
>>>
>>> This is your current implementation of AssignIterateOneLookupOther.
>>> There are so many conditionals
>>>
>>>
>>>1. @Override
>>>2. public Vector assign(Vector x, Vector y, DoubleDoubleFunction
>>>f) {
>>>3.   return !swap ? assignInner(x, y, f) : assignInner(y, x, f);
>>>4. }
>>>5.
>>>6. public Vector assignInner(Vector x, Vector y,
>>>DoubleDoubleFunction f) {
>>>7.   Iterator xi = x.iterateNonZero();
>>>8.   Vector.Element xe;
>>>9.   Vector.Element ye;
>>>10.   OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping();
>>>11.   while (xi.hasNext()) {
>>>12. xe = xi.next();
>>>13. ye = y.getElement(xe.index());
>>>14. if (!swap) {
>>>15.   xe.set(f.apply(xe.get(), ye.get()));
>>>16. } else {
>>>17.   if (ye.get() != 0.0 || y.isAddConstantTime()) {
>>>18. ye.set(f.apply(ye.get(), xe.get()));
>>>19.   } else {
>>>20. updates.set(xe.index(), f.apply(ye.get(), xe.get()));
>>>21.   }
>>>22. }
>>>23.   }
>>>24.   if (swap && !y.isAddConstantTime()) {
>>>25. y.mergeUpdates(updates);
>>>26.   }
>>>27.   return swap ? y : x;
>>>28. }
>>>29.   }
>>>
>>> Split this into.
>>>
>>> IterateThisLookupThatInplaceUpdate
>>> IterateThatLookupThisInplaceUpdate
>>> IterateThisLookupThatMergeUpdate
>>> IterateThatLookupThisMergeUpdate
>>>
>>> Removing the conditionals itself will speedup the operations you see
>>> regression on.
>>>
>>> Also note that getElement() is not constant Add time for RASV. In
>>> earlier version we tried at best to set in place. That reduced the extra
>>> hashkey computation. getElement in AbstractVector is really optimized for
>>> DenseVector and it does indicate as the cause of regression you are seeing
>>> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing
>>> all these will bring back the loss.
>>>
>>> Now the cost for each can be computed as: cost of iteration + cost of
>>> lookup + cost of update (if its not in-place)
>>>
>>> See the attached file for a mock test. You will have to rework the
>>> expectations but the framework should be understandable.
>>>
>>>
>>> --
>>> Robin Anil
>>>
>>>
>>>
>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <
>>> dangeorge.fili...@gmail.com> wrote:
>>>
>>>> Here's [1] a link to the "design document" of the new vector operations
>>>> so I can lay out the ideas behind what I'm doing more clearly.
>>>>
>>>> I'd like anyone who can to have a look and 

Re: Review Request: MAHOUT-1192 [2]: Speed up Vector Operations

2013-04-25 Thread Dan Filimon
Okay, so I should split it further into smaller sub-cases that handle each
Vector type. I tried making so that these match the cases in the document
to the extent possible.
You're right. It is ugly and I need to split it up.

I removed the OrderedIntDoubleMapping (but with another if...) but one more
thing that helped is making the assign() only go through the non-zeros when
setting up a new copy.

About the copying, I saw that the optimized copy doesn't always call clone
clone on the Vector. Why is this?

Thanks for the test and the feedback! :)


On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil  wrote:

> Yes its reaching dev. I checked out your code to take a look. I understand
> you are trying to still mix isSequentialAccess etc in the validity
> calculation. The whole point of this architecture is to completely unroll
> things to be more modular instead of creating a giant spaghetti
>
> Here is an example of what I am trying to say.
>
> This is your current implementation of AssignIterateOneLookupOther. There
> are so many conditionals
>
>
>1. @Override
>2. public Vector assign(Vector x, Vector y, DoubleDoubleFunction
>f) {
>3.   return !swap ? assignInner(x, y, f) : assignInner(y, x, f);
>4. }
>5.
>6. public Vector assignInner(Vector x, Vector y,
>DoubleDoubleFunction f) {
>7.   Iterator xi = x.iterateNonZero();
>8.   Vector.Element xe;
>9.   Vector.Element ye;
>10.   OrderedIntDoubleMapping updates = newOrderedIntDoubleMapping();
>11.   while (xi.hasNext()) {
>12. xe = xi.next();
>13. ye = y.getElement(xe.index());
>14. if (!swap) {
>15.   xe.set(f.apply(xe.get(), ye.get()));
>16. } else {
>17.   if (ye.get() != 0.0 || y.isAddConstantTime()) {
>18. ye.set(f.apply(ye.get(), xe.get()));
>19.   } else {
>20. updates.set(xe.index(), f.apply(ye.get(), xe.get()));
>21.   }
>22. }
>23.   }
>24.   if (swap && !y.isAddConstantTime()) {
>25. y.mergeUpdates(updates);
>26.   }
>27.   return swap ? y : x;
>28. }
>29.   }
>
> Split this into.
>
> IterateThisLookupThatInplaceUpdate
> IterateThatLookupThisInplaceUpdate
> IterateThisLookupThatMergeUpdate
> IterateThatLookupThisMergeUpdate
>
> Removing the conditionals itself will speedup the operations you see
> regression on.
>
> Also note that getElement() is not constant Add time for RASV. In earlier
> version we tried at best to set in place. That reduced the extra hashkey
> computation. getElement in AbstractVector is really optimized for
> DenseVector and it does indicate as the cause of regression you are seeing
> in RASV. Another one is the unused new OrderedIntDoubleMapping(). Removing
> all these will bring back the loss.
>
> Now the cost for each can be computed as: cost of iteration + cost of
> lookup + cost of update (if its not in-place)
>
> See the attached file for a mock test. You will have to rework the
> expectations but the framework should be understandable.
>
>
> --
> Robin Anil
>
>
>
> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon 
> wrote:
>
>> Here's [1] a link to the "design document" of the new vector operations
>> so I can lay out the ideas behind what I'm doing more clearly.
>>
>> I'd like anyone who can to have a look and comment.
>>
>> Ted,
>> I know you'd like me to get back to working on clustering, but currently,
>> BallKMeans is roughly 2x as slow as Mahout's KMeans.
>> This is because of centroid.update() for one (it doesn't have
>> "interesting" properties that are exploited in the current) and also
>> because we're using we're just doing lots of Vector operations: dot
>> products for projections, hash codes for unique sets of candidates,
>> equality checks etc.
>>
>> Here's a snapshot of the hotspots for one run of BallKMeans (20
>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions and
>> ~100 nonzero values):
>>   Name Time (ms)  org.apache.mahout.math.AbstractVector.dot(Vector)
>> 104069  org.apache.mahout.math.DelegatingVector.hashCode() 30159  
>> org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
>> Iterator, DoubleDoubleFunction) 13720
>> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
>> 2911  java.lang.ClassLoader.loadClass(String, boolean) 2620  
>> sun.misc.Launcher$AppClassLoader.loadClass(String,
>> boolean) 2620
&g

  1   2   3   >