Re: eclipse codestyle.xml?

2009-11-24 Thread Drew Farris
Great -- this works now. Thanks!

On Tue, Nov 24, 2009 at 10:20 AM, Grant Ingersoll  wrote:
> Actually, the Mahout wiki links are out of date.  I'll update.
>


[jira] Updated: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-206:
---

Attachment: MAHOUT-206.patch

This adds back SparseVector now as the parent of both sparse impls, and 
extending AbstractVector.

There are certainly optimizations which can still be done, and again, no work 
has been done to dig through all the places where RandomAccessSparseVectors are 
created and choose whether it would be better to have 
SequentialAccessSparseVector.

This patch also has the fix for MAHOUT-207.

> Separate and clearly label different SparseVector implementations
> -
>
> Key: MAHOUT-206
> URL: https://issues.apache.org/jira/browse/MAHOUT-206
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-206.patch, MAHOUT-206.patch
>
>
> Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array 
> impl of SparseVector for an OpenIntDoubleMap (hash-based) one.  We actually 
> need both, as I think I've mentioned a gazillion times.
> There was a patch, long ago, on MAHOUT-165, in which Ted had 
> OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that 
> effect), and neither of them are called SparseVector.  I like this, because 
> it forces people to choose what kind of SparseVector they want (and they 
> should: sparse is an optimization, and the client should make a conscious 
> decision what they're optimizing for).  
> We could call them RandomAccessSparseVector and SequentialAccessSparseVector, 
> to be really obvious.
> But really, the important part is we have both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Sean Owen
True, and that advantage rarely comes up.

However declaring abstract methods in an abstract class has exactly
the same problem that adding interface methods does -- and I take it
this is the heart of the problem -- all implementors get broken
immediately. If you intend to add methods this way, neither one is any
different indeed, both have the same problem.

Abstract classes afford the possibility of adding methods plus
implementation, without breaking anybody, so yeah I'm into abstract
classes. But then that's no argument against an abstract class +
interface, which would add a small bit of flexibility too.

I don't feel strongly about it but do think the interface + abstract
approach is more conventional than declaring APIs through abstract
classes.

On Tue, Nov 24, 2009 at 8:49 PM, Yonik Seeley
 wrote:
> The only advantage an interface has over an abstract class is multiple
> inheritance.
> You can use abstract classes like interfaces: make it possible to
> override all methods and avoiding state unless absolutely needed for
> back compat changes.


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Yonik Seeley
On Tue, Nov 24, 2009 at 3:30 PM, Sean Owen  wrote:
> I'm willing to be convinced but what is the theoretical argument for this?

Rather the opposite - it's a practical argument gained through experience.

> I am all for interfaces *and* abstract classes. You write the API in
> terms of interfaces for maximum flexibility. You provide abstract
> partial implementations for convenience. Everyone is happy.

The only advantage an interface has over an abstract class is multiple
inheritance.
You can use abstract classes like interfaces: make it possible to
override all methods and avoiding state unless absolutely needed for
back compat changes.

-Yonik


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Grant Ingersoll

On Nov 24, 2009, at 3:30 PM, Sean Owen wrote:

> I'm willing to be convinced but what is the theoretical argument for this?

See the Lucene archives:  http://search.lucidimagination.com.  There has been a 
lot of discussion on it.  And I mean a lot.  And then some.  :-)  Search for 
anything on interfaces, abstract classes or back compatibility.

> 
> I am all for interfaces *and* abstract classes. You write the API in
> terms of interfaces for maximum flexibility. You provide abstract
> partial implementations for convenience. Everyone is happy.

I agree.

> 
> The best argument I've seen against it is that it can be overkill. In
> super-performance-critical situations the dynamic dispatch overhead is
> perhaps worth thinking about, but that's rare. What else?
> 
> I take the point about interfaces changing, but this is significant
> when you expect a lot of third-party implementers of your interfaces.
> I don't think that is true here.

I disagree here.  In open source, you never know where the next good idea is 
coming from.  We should just always assume they will change.


> 
> On Tue, Nov 24, 2009 at 8:08 PM, Ted Dunning  wrote:
>> Yes.  Interfaces are the problem that commons math have boxed themselves in
>> with.  The Hadoop crew (especially Doug C) are adamant about using as few
>> interfaces as possible except as mixin signals and only in cases where the
>> interface really is going to be very, very stable.
>> 
>> Our vector interfaces are definitely not going to be that stable for quite a
>> while.
>> 
>> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix  wrote:
>> 
>>> Well we do use AbstractVector.  Are you suggesting that we *not* have a
>>> Vector interface
>>> at all, and *only* have an abstract base class?  Similarly for Matrix?
>>> 
>>>  -jake
>>> 
>>> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning 
>>> wrote:
>>> 
 We should use abstract classes almost everywhere instead of interfaces to
 ease backward compatibility issues with user written extensions to
>>> Vectors
 and Matrices.
 
 On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA)  wrote:
 
> It seems like there is still some commonality between the two
> implementations (size, cardinality, etc.) that I think it would be
> worthwhile to keep SparseVector as an abstract class which the other
>>> two
> extend.
> 
 
 
 
 --
 Ted Dunning, CTO
 DeepDyve
 
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search



Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Grant Ingersoll
Yes, I have lived this pain for a long time with Lucene.  Personally, though, a 
lot of the pain comes from a fairly strict back compatibility policy that to me 
isn't always well founded given the release cycle Lucene usually operates 
under.  I've always wished there was a @introducing annotation for interfaces, 
such that you could tell people what is coming down the pike. 

I also often feel the right answer is a combination of both.  New methods could 
be added on a new interface that is then applied to an Abstract class, thus it 
can be inherited by downstream implementors.  People who don't inherit from the 
Abstract can choose to add the new interface if they see fit.

For now, we don't have any back compat commitments.   I think once we get to 
0.9, we can decide on that.


On Nov 24, 2009, at 3:21 PM, Jake Mannix wrote:

> Oof.
> 
> So you're arguing this as a temporary thing, until our interfaces stabilize?
> It makes
> unit testing much harder this way, but I guess I see the rationale.
> 
> If we do this, we need to leave a lot out of that base class - there may be
> some really
> big differences in implementation of these classes (for example: distributed
> / hdfs
> backed matrices vs locally memory-resident ones), so very very little should
> be
> assumed in the base impl.  I guess more can be done in the vector case,
> however.
> 
>  -jake
> 
> On Tue, Nov 24, 2009 at 12:08 PM, Ted Dunning  wrote:
> 
>> Yes.  Interfaces are the problem that commons math have boxed themselves in
>> with.  The Hadoop crew (especially Doug C) are adamant about using as few
>> interfaces as possible except as mixin signals and only in cases where the
>> interface really is going to be very, very stable.
>> 
>> Our vector interfaces are definitely not going to be that stable for quite
>> a
>> while.
>> 
>> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix 
>> wrote:
>> 
>>> Well we do use AbstractVector.  Are you suggesting that we *not* have a
>>> Vector interface
>>> at all, and *only* have an abstract base class?  Similarly for Matrix?
>>> 
>>> -jake
>>> 
>>> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning 
>>> wrote:
>>> 
 We should use abstract classes almost everywhere instead of interfaces
>> to
 ease backward compatibility issues with user written extensions to
>>> Vectors
 and Matrices.
 
 On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) <
>> j...@apache.org
> wrote:
 
> It seems like there is still some commonality between the two
> implementations (size, cardinality, etc.) that I think it would be
> worthwhile to keep SparseVector as an abstract class which the other
>>> two
> extend.
> 
 
 
 
 --
 Ted Dunning, CTO
 DeepDyve
 
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search



Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Sean Owen
I'm willing to be convinced but what is the theoretical argument for this?

I am all for interfaces *and* abstract classes. You write the API in
terms of interfaces for maximum flexibility. You provide abstract
partial implementations for convenience. Everyone is happy.

The best argument I've seen against it is that it can be overkill. In
super-performance-critical situations the dynamic dispatch overhead is
perhaps worth thinking about, but that's rare. What else?

I take the point about interfaces changing, but this is significant
when you expect a lot of third-party implementers of your interfaces.
I don't think that is true here.

On Tue, Nov 24, 2009 at 8:08 PM, Ted Dunning  wrote:
> Yes.  Interfaces are the problem that commons math have boxed themselves in
> with.  The Hadoop crew (especially Doug C) are adamant about using as few
> interfaces as possible except as mixin signals and only in cases where the
> interface really is going to be very, very stable.
>
> Our vector interfaces are definitely not going to be that stable for quite a
> while.
>
> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix  wrote:
>
>> Well we do use AbstractVector.  Are you suggesting that we *not* have a
>> Vector interface
>> at all, and *only* have an abstract base class?  Similarly for Matrix?
>>
>>  -jake
>>
>> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning 
>> wrote:
>>
>> > We should use abstract classes almost everywhere instead of interfaces to
>> > ease backward compatibility issues with user written extensions to
>> Vectors
>> > and Matrices.
>> >
>> > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) > > >wrote:
>> >
>> > > It seems like there is still some commonality between the two
>> > > implementations (size, cardinality, etc.) that I think it would be
>> > > worthwhile to keep SparseVector as an abstract class which the other
>> two
>> > > extend.
>> > >
>> >
>> >
>> >
>> > --
>> > Ted Dunning, CTO
>> > DeepDyve
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Jake Mannix
Oof.

So you're arguing this as a temporary thing, until our interfaces stabilize?
It makes
unit testing much harder this way, but I guess I see the rationale.

If we do this, we need to leave a lot out of that base class - there may be
some really
big differences in implementation of these classes (for example: distributed
/ hdfs
backed matrices vs locally memory-resident ones), so very very little should
be
assumed in the base impl.  I guess more can be done in the vector case,
however.

  -jake

On Tue, Nov 24, 2009 at 12:08 PM, Ted Dunning  wrote:

> Yes.  Interfaces are the problem that commons math have boxed themselves in
> with.  The Hadoop crew (especially Doug C) are adamant about using as few
> interfaces as possible except as mixin signals and only in cases where the
> interface really is going to be very, very stable.
>
> Our vector interfaces are definitely not going to be that stable for quite
> a
> while.
>
> On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix 
> wrote:
>
> > Well we do use AbstractVector.  Are you suggesting that we *not* have a
> > Vector interface
> > at all, and *only* have an abstract base class?  Similarly for Matrix?
> >
> >  -jake
> >
> > On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning 
> > wrote:
> >
> > > We should use abstract classes almost everywhere instead of interfaces
> to
> > > ease backward compatibility issues with user written extensions to
> > Vectors
> > > and Matrices.
> > >
> > > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) <
> j...@apache.org
> > > >wrote:
> > >
> > > > It seems like there is still some commonality between the two
> > > > implementations (size, cardinality, etc.) that I think it would be
> > > > worthwhile to keep SparseVector as an abstract class which the other
> > two
> > > > extend.
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Ted Dunning
Yes.  Interfaces are the problem that commons math have boxed themselves in
with.  The Hadoop crew (especially Doug C) are adamant about using as few
interfaces as possible except as mixin signals and only in cases where the
interface really is going to be very, very stable.

Our vector interfaces are definitely not going to be that stable for quite a
while.

On Tue, Nov 24, 2009 at 12:03 PM, Jake Mannix  wrote:

> Well we do use AbstractVector.  Are you suggesting that we *not* have a
> Vector interface
> at all, and *only* have an abstract base class?  Similarly for Matrix?
>
>  -jake
>
> On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning 
> wrote:
>
> > We should use abstract classes almost everywhere instead of interfaces to
> > ease backward compatibility issues with user written extensions to
> Vectors
> > and Matrices.
> >
> > On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA)  > >wrote:
> >
> > > It seems like there is still some commonality between the two
> > > implementations (size, cardinality, etc.) that I think it would be
> > > worthwhile to keep SparseVector as an abstract class which the other
> two
> > > extend.
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Jake Mannix
Well we do use AbstractVector.  Are you suggesting that we *not* have a
Vector interface
at all, and *only* have an abstract base class?  Similarly for Matrix?

  -jake

On Tue, Nov 24, 2009 at 11:57 AM, Ted Dunning  wrote:

> We should use abstract classes almost everywhere instead of interfaces to
> ease backward compatibility issues with user written extensions to Vectors
> and Matrices.
>
> On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA)  >wrote:
>
> > It seems like there is still some commonality between the two
> > implementations (size, cardinality, etc.) that I think it would be
> > worthwhile to keep SparseVector as an abstract class which the other two
> > extend.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Ted Dunning
We should use abstract classes almost everywhere instead of interfaces to
ease backward compatibility issues with user written extensions to Vectors
and Matrices.

On Tue, Nov 24, 2009 at 9:38 AM, Grant Ingersoll (JIRA) wrote:

> It seems like there is still some commonality between the two
> implementations (size, cardinality, etc.) that I think it would be
> worthwhile to keep SparseVector as an abstract class which the other two
> extend.
>



-- 
Ted Dunning, CTO
DeepDyve


[jira] Created: (MAHOUT-209) Add aggregate() methods for Vector

2009-11-24 Thread Jake Mannix (JIRA)
Add aggregate() methods for Vector
--

 Key: MAHOUT-209
 URL: https://issues.apache.org/jira/browse/MAHOUT-209
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
 Environment: all
Reporter: Jake Mannix
Priority: Minor
 Fix For: 0.3


As discussed in MAHOUT-165 at some point, Vector (and Matrix, but let's put 
that on a separate ticket) could do with a nice exposure of methods like the 
following:

{code}
// this can get optimized, of course

  public double aggregate(Vector other, BinaryFunction aggregator, 
BinaryFunction combiner) {
double result = 0;
for(int i=0; i

[jira] Created: (MAHOUT-208) Vector.getLengthSquared() is dangerously optimized

2009-11-24 Thread Jake Mannix (JIRA)
Vector.getLengthSquared() is dangerously optimized
--

 Key: MAHOUT-208
 URL: https://issues.apache.org/jira/browse/MAHOUT-208
 Project: Mahout
  Issue Type: Bug
  Components: Matrix
Affects Versions: 0.1
 Environment: all
Reporter: Jake Mannix
 Fix For: 0.3


SparseVector and DenseVector both cache the value of lengthSquared, so that 
subsequent calls to it get the cached value.  Great, except the cache is never 
cleared - calls to set/setQuick or assign or anything, all leave the cached 
value unchanged.  

Mutating method calls should set lengthNorm to -1 so that the cache is cleared.

This could be a really nasty bug if hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782080#action_12782080
 ] 

Grant Ingersoll commented on MAHOUT-207:


All makes sense.  Per the refactoring in MAHOUT-206, I think this argues even 
more for an abstract SparseVector implementation that can handle some of the 
common code.

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782077#action_12782077
 ] 

Jake Mannix commented on MAHOUT-207:


We definitely should include the optimization that set(i, 0) should work fast, 
but it's not trivial to do.  Hash based impls need to yes, remove the previous 
entry.  Array-based sparse vectors do what, exactly? 

A vector represented as { indices: int[] { 1, 3, 5 }, values: double[] { 1.1, 
2.2, 3.3 } } to start, gets a call to setQuick(3, 0), so that in the current 
implementation it becomes { indices: int[] { 1, 3, 5 }, values: double[] { 1.1, 
0, 3.3 } }.  What would you suggest be done to "remove" the entry efficiently?

What would be useful, is that all sparse vector impls have a "compact()" method 
which removes all zeroes and finds a small-space representation.

But either way, we should not *require* in the contract for sparse vector that 
there be no zero values.

Regarding the call to equivalent, you're right, that should not be done as the 
static method - if equivalent was non-static, then it could be overridden to be 
done smartly by subclasses (ie. if one of the two being compared is a 
DenseVector or RandomAccessSparseVector, iterate over the other checking the 
dense one via getQuick(), and if both of them are 
SequentialAccessSparseVectors, both iterators can be walked in parallel).  

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782064#action_12782064
 ] 

Grant Ingersoll commented on MAHOUT-207:


Aren't we loosing some of the benefits of SparseVector with this explicit set 
to zero stuff (by having to call equivalent)?  I've wondered in the past how a 
Sparse implementation should handle something like setQuick(i, 0).  One 
approach is to set it, but the other is to ignore it and possibly remove any 
previous nonzero entry, right?  Seems like tradeoffs w/ both.

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782054#action_12782054
 ] 

Grant Ingersoll commented on MAHOUT-206:


Jake, there's something weird in this patch in regards to SparseVector.  It 
didn't delete the file, but instead left it empty.

It seems like there is still some commonality between the two implementations 
(size, cardinality, etc.) that I think it would be worthwhile to keep 
SparseVector as an abstract class which the other two extend.

> Separate and clearly label different SparseVector implementations
> -
>
> Key: MAHOUT-206
> URL: https://issues.apache.org/jira/browse/MAHOUT-206
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-206.patch
>
>
> Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array 
> impl of SparseVector for an OpenIntDoubleMap (hash-based) one.  We actually 
> need both, as I think I've mentioned a gazillion times.
> There was a patch, long ago, on MAHOUT-165, in which Ted had 
> OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that 
> effect), and neither of them are called SparseVector.  I like this, because 
> it forces people to choose what kind of SparseVector they want (and they 
> should: sparse is an optimization, and the client should make a conscious 
> decision what they're optimizing for).  
> We could call them RandomAccessSparseVector and SequentialAccessSparseVector, 
> to be really obvious.
> But really, the important part is we have both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782049#action_12782049
 ] 

Jake Mannix commented on MAHOUT-207:


It looks like the work done on MAHOUT-159 did not use addition as the combiner 
on the hashCode() for Elements of the vector, so the answer was iteration order 
dependent.  Unit tests also didn't check what happened if a sparse vector had 
explicitly zero values set on it, which should not affect hasCode() or equals() 
computation (the latter was fine, the former was not!).

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782051#action_12782051
 ] 

Ted Dunning commented on MAHOUT-207:



I think that 159 is superseded by this work.

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-206:
--

Assignee: Grant Ingersoll

> Separate and clearly label different SparseVector implementations
> -
>
> Key: MAHOUT-206
> URL: https://issues.apache.org/jira/browse/MAHOUT-206
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-206.patch
>
>
> Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array 
> impl of SparseVector for an OpenIntDoubleMap (hash-based) one.  We actually 
> need both, as I think I've mentioned a gazillion times.
> There was a patch, long ago, on MAHOUT-165, in which Ted had 
> OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that 
> effect), and neither of them are called SparseVector.  I like this, because 
> it forces people to choose what kind of SparseVector they want (and they 
> should: sparse is an optimization, and the client should make a conscious 
> decision what they're optimizing for).  
> We could call them RandomAccessSparseVector and SequentialAccessSparseVector, 
> to be really obvious.
> But really, the important part is we have both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-207:
--

Assignee: Grant Ingersoll

> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-207) AbstractVector.hashCode() should not care about the order of iteration over elements

2009-11-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782041#action_12782041
 ] 

Grant Ingersoll commented on MAHOUT-207:


How does this all relate to https://issues.apache.org/jira/browse/MAHOUT-159?



> AbstractVector.hashCode() should not care about the order of iteration over 
> elements
> 
>
> Key: MAHOUT-207
> URL: https://issues.apache.org/jira/browse/MAHOUT-207
> Project: Mahout
>  Issue Type: Bug
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
> Fix For: 0.3
>
> Attachments: MAHOUT-207.patch
>
>
> As was discussed in MAHOUT-165, hashCode can be implemented simply like this:
> {code} 
> public int hashCode() {
> final int prime = 31;
> int result = prime + ((name == null) ? 0 : name.hashCode());
> result = prime * result + size();
> Iterator iter = iterateNonZero();
> while (iter.hasNext()) {
>   Element ele = iter.next();
>   long v = Double.doubleToLongBits(ele.get());
>   result += (ele.index() * (int)(v^(v>>32)));
> }
> return result;
>   }
> {code}
> which obviates the need to sort the elements in the case of a random access 
> hash-based implementation.  Also, (ele.index() * (int)(v^(v>>32)) ) == 0 when 
> v = Double.doubleToLongBits(0d), which avoids the wrong hashCode() for sparse 
> vectors which have zero elements returned from the iterateNonZero() iterator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

2009-11-24 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix resolved MAHOUT-201.


Resolution: Duplicate

The patch for this is currently included in the patch for MAHOUT-206 (which 
would have badly broken this patch anyways).

> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> 
>
> Key: MAHOUT-201
> URL: https://issues.apache.org/jira/browse/MAHOUT-201
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
>Reporter: Jake Mannix
> Fix For: 0.3
>
> Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector 
> implementation is great from a hashing standpoint (it's memory efficient and 
> fast for random-access), they don't provide anything like the 
> OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for 
> random access, or out-of-order modification, but is minimally sized 
> memory-wise and blazingly fast for doing read-only dot-products and vector 
> sums (where the latter is read-only on inputs, and is creating new output) 
> with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector 
> implementation we have in Mahout, because it *is* based on an int[] and a 
> double[].  Unfortunately, it's not at all optimized for the cases where it 
> can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input 
> is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap 
> implementation of SparseVector, that case as well), and do specialized 
> operations here.
> * even when those particular methods aren't being used, the AllIterator and 
> NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices 
> in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() { 
>  return values.get(ind);  } {code}, which is implemented as a binary search 
> on index values array (the index of which was already known!) followed by the 
> array lookup
> This last point is probably the entire reason why we've seen performance 
> problems with the SparseVector, as it's in both the NonZeroIterator and the 
> AllIterator, and so turned any O(numNonZeroElements) operations into 
> O(numNonZeroElements * log(numNonZeroElements)) (with some additional 
> constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I 
> didn't notice, I can whip up a patch (I've got a similar implementation over 
> in decomposer I can pull stuff out of, although mine is simpler because it is 
> immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA 
> ticket open for that already?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Moving ahead to Hadoop 0.22

2009-11-24 Thread Ted Dunning
Well, it is wrong at some level and it will become more and more wrong.

I have heard from Chris Wenzel that the cost of moving post 19 was pretty
high.  It would be good to do that when we can do it whole-heartedly.

(what is the situation with 21?)

On Tue, Nov 24, 2009 at 1:23 AM, Sean Owen  wrote:

> My alternative is to go back to 0.19, which isn't the end of the world,
> but feels wrong.
>



-- 
Ted Dunning, CTO
DeepDyve


[jira] Updated: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-24 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-206:
---

Attachment: MAHOUT-206.patch

Patch renames the hash-based SparseVector with RandomAccessSparseVector, and 
adds back the old form of SparseVector (backed by OrderedIntDoubleMapping) 
named now as SequentialAccessSparseVector.  

Unit tests added testing both forms of sparse vector.

BUT, I have not gone through all the places where we use SparseVector instances 
and deliberately picked SequetialAccess or RandomAccess choices for the sparse 
impl.  All usage is done as RandomAccessSparseVector (ie. by the 
refactor-rename on trunk of SparseVector to RandomAccessSparseVector), and we 
will need to intentionally swap in SequentialAccessSparseVector in the 
primarily immutable/read-only case.

> Separate and clearly label different SparseVector implementations
> -
>
> Key: MAHOUT-206
> URL: https://issues.apache.org/jira/browse/MAHOUT-206
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
> Environment: all
>Reporter: Jake Mannix
> Fix For: 0.3
>
> Attachments: MAHOUT-206.patch
>
>
> Shashi's last patch on MAHOUT-165 swapped out the int/double parallel array 
> impl of SparseVector for an OpenIntDoubleMap (hash-based) one.  We actually 
> need both, as I think I've mentioned a gazillion times.
> There was a patch, long ago, on MAHOUT-165, in which Ted had 
> OrderedIntDoubleVector, and OpenIntDoubleHashVector (or something to that 
> effect), and neither of them are called SparseVector.  I like this, because 
> it forces people to choose what kind of SparseVector they want (and they 
> should: sparse is an optimization, and the client should make a conscious 
> decision what they're optimizing for).  
> We could call them RandomAccessSparseVector and SequentialAccessSparseVector, 
> to be really obvious.
> But really, the important part is we have both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: eclipse codestyle.xml?

2009-11-24 Thread Grant Ingersoll
Actually, the Mahout wiki links are out of date.  I'll update.

On Nov 24, 2009, at 10:19 AM, Grant Ingersoll wrote:

> Hmm, weird.  They must have gotten lost when the ASF upgraded MoinMoin.  They 
> are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute
> 
> On Nov 24, 2009, at 10:11 AM, Drew Farris wrote:
> 
>> Hi All,
>> 
>> On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The
>> link at the bottom of the page to the eclipse codestyle.xml for
>> Mahout's coding conventions seems to be broken.
>> 
>> Does anyone have a codestyle.xml for eclipse available?
>> 
>> Thanks,
>> 
>> Drew
> 
> 




Re: eclipse codestyle.xml?

2009-11-24 Thread Simon Willnauer
We updated the lucene ones during apache con  - this should work though!

On Tue, Nov 24, 2009 at 4:19 PM, Grant Ingersoll  wrote:
> Hmm, weird.  They must have gotten lost when the ASF upgraded MoinMoin.  They 
> are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute
>
> On Nov 24, 2009, at 10:11 AM, Drew Farris wrote:
>
>> Hi All,
>>
>> On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The
>> link at the bottom of the page to the eclipse codestyle.xml for
>> Mahout's coding conventions seems to be broken.
>>
>> Does anyone have a codestyle.xml for eclipse available?
>>
>> Thanks,
>>
>> Drew
>
>
>


Re: eclipse codestyle.xml?

2009-11-24 Thread Grant Ingersoll
Hmm, weird.  They must have gotten lost when the ASF upgraded MoinMoin.  They 
are the same as Lucene's: http://wiki.apache.org/lucene-java/HowToContribute

On Nov 24, 2009, at 10:11 AM, Drew Farris wrote:

> Hi All,
> 
> On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The
> link at the bottom of the page to the eclipse codestyle.xml for
> Mahout's coding conventions seems to be broken.
> 
> Does anyone have a codestyle.xml for eclipse available?
> 
> Thanks,
> 
> Drew




eclipse codestyle.xml?

2009-11-24 Thread Drew Farris
Hi All,

On the wiki, http://cwiki.apache.org/MAHOUT/howtocontribute.html, The
link at the bottom of the page to the eclipse codestyle.xml for
Mahout's coding conventions seems to be broken.

Does anyone have a codestyle.xml for eclipse available?

Thanks,

Drew


[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781896#action_12781896
 ] 

Grant Ingersoll commented on MAHOUT-204:


Yeah, go ahead and submit the patch, then do the formatting.

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781858#action_12781858
 ] 

Sean Owen commented on MAHOUT-103:
--

Yes, this is basically item-based recommendation. With some superficial 
changes, it would exactly fit that model. Co-occurrence here is like a 
similarity metric, which is ultimately used as a weighting. Canonically this 
value would be in [-1,1], and you can easily map [1,...) into that range of 
course.

Next you're sort of estimating preferences when you add up co-occurrence 
values. Canonically, you'd be doing a weighted average over M1 - M3. This is 
the same thing -- you're just not dividing by 3.

The result is conceptually the same, though different approaches would yield 
slightly different results. I'm not necessarily suggesting you change the 
algorithm. At the same time I am also about to implement this very same thing 
-- the more 'canoncial' form, to go hand-in-hand with the existing 
GenericItemBasedRecommender. I'd rather avoid duplication, and would like to 
make the Hadoop-based implementation as analogous to the existing code as 
possible. All I'd say is, go ahead, and maybe we look at generalizing it or 
shifting these concepts towards the canonical setup later.

Look at GenericIRStatsEvaluator and subclass for precision-recall approaches.

> Co-occurence based nearest neighbourhood
> 
>
> Key: MAHOUT-103
> URL: https://issues.apache.org/jira/browse/MAHOUT-103
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Ankur
>Assignee: Ankur
> Attachments: jira-103.patch, mahout-103.patch.v1
>
>
> Nearest neighborhood type queries for users/items can be answered efficiently 
> and effectively by analyzing the co-occurrence model of a user/item w.r.t 
> another. This patch aims at providing an implementation for answering such 
> queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-24 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781838#action_12781838
 ] 

Ankur commented on MAHOUT-103:
--

For this co-occurrence based recommender I am planning to write a set of 
map-reduce jobs that compute recommendations for users as folllowing:-

1. Take user's item history
2. for each item in his history fetch the top-N similar items. (Similarity 
based on co-occurrence)
3. Add the co-occurrence scores if an item appears more than once (NOT weighted 
avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies 
for each of these along with co-occurrence scores 

M1 -> (A, 5), (B, 4), (C, 2)
M2 -> (D, 6), (E, 3), (F, 2)
M3 -> (G, 8), (C, 5), (B, 2)  

So the final scores in decreasing order will look like
(G, 8)
(C, 7)
(B, 6)
(D, 6)
(A, 5)
(E, 3)
(F, 2)

The idea I want to capture is that a candidate item gets higher score if its 
similar to more items in user's click history.

Do you see any issue with this approach ? Any other better approach that you 
can think of ?

As for the precision-recall test, I am still trying to see how to divide the 
data in 'train' and 'test' for a fair evaluation. How do we do it in the 
existing code ?

> Co-occurence based nearest neighbourhood
> 
>
> Key: MAHOUT-103
> URL: https://issues.apache.org/jira/browse/MAHOUT-103
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Ankur
>Assignee: Ankur
> Attachments: jira-103.patch, mahout-103.patch.v1
>
>
> Nearest neighborhood type queries for users/items can be answered efficiently 
> and effectively by analyzing the co-occurrence model of a user/item w.r.t 
> another. This patch aims at providing an implementation for answering such 
> queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781834#action_12781834
 ] 

Sean Owen commented on MAHOUT-204:
--

I am happy to do this, but in order to avoid a massive merge conflict, go ahead 
and submit the above patch?

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Moving ahead to Hadoop 0.22

2009-11-24 Thread Sean Owen
Early report from my testing is it's going to break a lot of our code,
so, perhaps a bridge too far now.

There's one reason I'm keen to move forward and it's not merely
wanting to be on the bleeding edge, far from it. It's that 0.20.x does
not work at all for my jobs. It runs into bugs that 0.20.1 does not
appear to fix. So I'm kind of between a rock and hard place. My
alternative is to go back to 0.19, which isn't the end of the world,
but feels wrong.

On Tue, Nov 24, 2009 at 8:25 AM, Robin Anil  wrote:
> 0.22 is supposed to stabilize the new mapreduce package and remove
> support for the old mapred package. So I am guessing the reason for
> moving to 0.22 would go side by side with conversion of all our
> existing mapred programs to mapreduce ones. And I believe I read
> somewhere  that this is the api that is going to used as they move
> closer to hadoop 1.0
>
> Robin
>
>
>
> On Tue, Nov 24, 2009 at 1:10 PM, Ted Dunning  wrote:
>> I hate it when people on the commons list start whining about dropping
>> support for java version -17 (joke, exaggeration warning), but here I am
>> about to do it.
>>
>> I still run 0.19.  A fair number of production clusters run 0.18.3.  Many
>> run 0.20
>>
>> Hopefully when 1.0 comes out there will be a large move to that, but is it
>> important yet to gauge where our largest audience is or even will be in,
>> say, 6-12 months?
>>
>> I don't see a large benefit to moving to the latest just because it is the
>> latest.  I do see a benefit in moving to a future-proof API sooner rather
>> than later, but is 0.22 realistic yet?
>>
>> On Mon, Nov 23, 2009 at 5:22 PM, Sean Owen  wrote:
>>
>>> In my own client, I'm forging ahead to Hadoop 0.22 to see if it works
>>> for me. If it's not too much change to update what we've got to 0.22
>>> (and the change is nonzero) and it works better for me, maybe we can
>>> jump ahead to depend on it.
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>


Re: Moving ahead to Hadoop 0.22

2009-11-24 Thread Robin Anil
0.22 is supposed to stabilize the new mapreduce package and remove
support for the old mapred package. So I am guessing the reason for
moving to 0.22 would go side by side with conversion of all our
existing mapred programs to mapreduce ones. And I believe I read
somewhere  that this is the api that is going to used as they move
closer to hadoop 1.0

Robin



On Tue, Nov 24, 2009 at 1:10 PM, Ted Dunning  wrote:
> I hate it when people on the commons list start whining about dropping
> support for java version -17 (joke, exaggeration warning), but here I am
> about to do it.
>
> I still run 0.19.  A fair number of production clusters run 0.18.3.  Many
> run 0.20
>
> Hopefully when 1.0 comes out there will be a large move to that, but is it
> important yet to gauge where our largest audience is or even will be in,
> say, 6-12 months?
>
> I don't see a large benefit to moving to the latest just because it is the
> latest.  I do see a benefit in moving to a future-proof API sooner rather
> than later, but is 0.22 realistic yet?
>
> On Mon, Nov 23, 2009 at 5:22 PM, Sean Owen  wrote:
>
>> In my own client, I'm forging ahead to Hadoop 0.22 to see if it works
>> for me. If it's not too much change to update what we've got to 0.22
>> (and the change is nonzero) and it works better for me, maybe we can
>> jump ahead to depend on it.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>