Re: What index structure does kNN algorithm use in mahout?

2010-01-08 Thread Grant Ingersoll

On Jan 8, 2010, at 2:17 PM, Ted Dunning wrote:

> kNN stands for k-nearest neighbor.

Yeah, I know.  Just wasn't sure on the context of the question.

> 
> On Fri, Jan 8, 2010 at 3:34 AM, Grant Ingersoll  wrote:
> 
>> Do you mean K-Means?
>> 
>> On Jan 7, 2010, at 3:50 AM, xiao yang wrote:
>> 
>>> Like R-tree.
>>> Or it compares each record for every query?
>>> 
>>> Thanks!
>>> Xiao



Re: What index structure does kNN algorithm use in mahout?

2010-01-08 Thread Ted Dunning
kNN stands for k-nearest neighbor.

On Fri, Jan 8, 2010 at 3:34 AM, Grant Ingersoll  wrote:

> Do you mean K-Means?
>
> On Jan 7, 2010, at 3:50 AM, xiao yang wrote:
>
> > Like R-tree.
> > Or it compares each record for every query?
> >
> > Thanks!
> > Xiao
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve


Re: MapReduce Unit Testing

2010-01-08 Thread Ted Dunning
Mostly I depend on very strong unit tests for the mapper and reducer
separately.

As far as I have heard, MRUnit is the only game in town for creating simple
unit tests for combing the mapper and reducer.

On Fri, Jan 8, 2010 at 4:27 AM, zhao zhendong wrote:

> Does anybody know a out-off-shift Unit testing package for Mapreduce
> framework? MRUnit is good, but this package only can be found in Cloudera
> own Hadoop.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: compareTo() issue

2010-01-08 Thread Sean Owen
If the placement doesn't matter, why is returning 0 a problem? I'm
just wondering if this doesn't introduce some subtle bugs in the way
that not implementing hashCode/equals does. it may happen to work here
but later...

The overflow problem is remote, but not trivial... can support be
large? like anywhere near a billion? then it's a possible bug.

On Fri, Jan 8, 2010 at 6:26 PM, Robin Anil  wrote:
> That one was specifc to ordering of sub patterns in Fpgrowth stage. I did
> that as an optimisation where in the object needs to be at a random place in
> the heap if they are of equal length and support. Since it is the most
> called function in the entire algorithm, I got some performance benefit from
> it, because there were many patterns of the same length and support by
> differnet pattern underneath. Plus there is no need of comparing the two
> arrays which would be expensive.  If the compareTo contract has to be
> maintained, I will move it out to another class and use it as the comparator
> during heap initialization.
>
>
> public int compareTo(Pattern cr2) {
>    long support2 = cr2.support();
>    int length2 = cr2.length();
>    if (support == support2) {
>      if (length == length2) {
>        // if they are of same length and support order randomly
>        return 1;
>      } else {
>        return length - length2;
>      }
>    } else {
>      if (support > support2) {
>        return 1;
>      } else {
>        return -1;
>      }
>    }
>  }
>
> On Fri, Jan 8, 2010 at 2:58 PM, Sean Owen  wrote:
>
>> I see some compareTo() methods with logic like this --
>>
>> int a = object1.foo();
>> int b = object2.foo();
>> if (a == b) {
>>  return 1; // order randomly
>> } else {
>>  return a - b;
>> }
>>
>> Three problems here:
>> - This does not produce a random ordering when used with a sort; it's
>> quite deterministic
>> - This violates the contract of compareTo() -- true that
>> a.compareTo(b) = -b.compareTo(a)
>> - "a-b" can overflow and give the wrong sign
>>
>> Mind if I fix?
>>
>> There's still code going in with lots of variances from what I think
>> are our agreed standards too.
>>
>> Sean
>>
>


Re: compareTo() issue

2010-01-08 Thread Robin Anil
That one was specifc to ordering of sub patterns in Fpgrowth stage. I did
that as an optimisation where in the object needs to be at a random place in
the heap if they are of equal length and support. Since it is the most
called function in the entire algorithm, I got some performance benefit from
it, because there were many patterns of the same length and support by
differnet pattern underneath. Plus there is no need of comparing the two
arrays which would be expensive.  If the compareTo contract has to be
maintained, I will move it out to another class and use it as the comparator
during heap initialization.


public int compareTo(Pattern cr2) {
long support2 = cr2.support();
int length2 = cr2.length();
if (support == support2) {
  if (length == length2) {
// if they are of same length and support order randomly
return 1;
  } else {
return length - length2;
  }
} else {
  if (support > support2) {
return 1;
  } else {
return -1;
  }
}
  }

On Fri, Jan 8, 2010 at 2:58 PM, Sean Owen  wrote:

> I see some compareTo() methods with logic like this --
>
> int a = object1.foo();
> int b = object2.foo();
> if (a == b) {
>  return 1; // order randomly
> } else {
>  return a - b;
> }
>
> Three problems here:
> - This does not produce a random ordering when used with a sort; it's
> quite deterministic
> - This violates the contract of compareTo() -- true that
> a.compareTo(b) = -b.compareTo(a)
> - "a-b" can overflow and give the wrong sign
>
> Mind if I fix?
>
> There's still code going in with lots of variances from what I think
> are our agreed standards too.
>
> Sean
>


Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup

2010-01-08 Thread Drew Farris
I wonder if we can get the hadoop people to include source jars with
their snapshots?

On Fri, Jan 8, 2010 at 11:23 AM, Sean Owen  wrote:
> I need a fix after 0.20.1, that's the primary reason. As a bonus, we
> don't have to maintain our own version. The downside is relying on a
> SNAPSHOT, but seems worth it to me.
>
> On Fri, Jan 8, 2010 at 4:02 PM, zhao zhendong  wrote:
>> Thanks Drew,
>>
>> +1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is
>> obvious :)
>>
>> Cheers,
>> Zhendong
>>
>>
>


Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup

2010-01-08 Thread Sean Owen
I need a fix after 0.20.1, that's the primary reason. As a bonus, we
don't have to maintain our own version. The downside is relying on a
SNAPSHOT, but seems worth it to me.

On Fri, Jan 8, 2010 at 4:02 PM, zhao zhendong  wrote:
> Thanks Drew,
>
> +1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is
> obvious :)
>
> Cheers,
> Zhendong
>
>


Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup

2010-01-08 Thread zhao zhendong
Thanks Drew,

+1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is
obvious :)

Cheers,
Zhendong


On Fri, Jan 8, 2010 at 10:23 PM, Drew Farris  wrote:

> First, apologies for propagating a problem here.
>
> Since 0.20.2 is a snapshot, there's no release of hadoop that
> corresponds directly to it. In maven terms, a snapshot could be
> anything after 0.20.1 but prior to a formal release of 0.20.2. In the
> repo, they are timestamped.
>
> We're pulling this jar from the apache snapsot repo here:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/
>
> Unfortunately, hadoop doesn't publish a source jar alongisde their
> snapshot, which makes it a bit difficut to pin down what was used for
> the build of jar. From the maven repo, you can see the jar was
> generated 2009.11.13, 1:09:47 GMT, so as Ted suggests one option would
> be to check out the hadoop-core sources from that date and use those
> as a basis for your work on MAHOUT-232
>
> I'm not 100% comfortable with this. When I picked up the work for
> MAHOUT-238, I noticed that hadoop-0.20.2 SNAPSHOT was in core yet
> hadoop-0.20.1, the mahout hand-rolled version, was used in math and
> thus propagated as a transient dependency across all of the projects.
>
> I believe we moved to a SNAPSHOT so that we could use one of the
> offical hadoop dependencies instead of our hand-rolled version of
> 0.20.1. IIRC, there are some bugfixes that we required to, but I'm not
> certain about that?
>
> Does anyone feel that we should move back 0.20.1 and use a stable
> hadoop release instead of using the snapshot, so that the version of
> the code we depend on is clearer?
>
> Drew
>
> On Fri, Jan 8, 2010 at 2:08 AM, Ted Dunning  wrote:
> > SVN should help: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/
> >
> > If you look at the release notes, you should be able to discern what made
> up
> > 20.2 if it is a real release (looking at common made is look like it
> isn't).
> >
> > On Thu, Jan 7, 2010 at 9:46 PM, zhao zhendong  >wrote:
> >
> >> Where can I find the source code of hadoop-0.20.2?
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><>
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhend...@gmail.com
>>><><><><><><><><<><>><><<


Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup

2010-01-08 Thread Drew Farris
First, apologies for propagating a problem here.

Since 0.20.2 is a snapshot, there's no release of hadoop that
corresponds directly to it. In maven terms, a snapshot could be
anything after 0.20.1 but prior to a formal release of 0.20.2. In the
repo, they are timestamped.

We're pulling this jar from the apache snapsot repo here:
https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/

Unfortunately, hadoop doesn't publish a source jar alongisde their
snapshot, which makes it a bit difficut to pin down what was used for
the build of jar. From the maven repo, you can see the jar was
generated 2009.11.13, 1:09:47 GMT, so as Ted suggests one option would
be to check out the hadoop-core sources from that date and use those
as a basis for your work on MAHOUT-232

I'm not 100% comfortable with this. When I picked up the work for
MAHOUT-238, I noticed that hadoop-0.20.2 SNAPSHOT was in core yet
hadoop-0.20.1, the mahout hand-rolled version, was used in math and
thus propagated as a transient dependency across all of the projects.

I believe we moved to a SNAPSHOT so that we could use one of the
offical hadoop dependencies instead of our hand-rolled version of
0.20.1. IIRC, there are some bugfixes that we required to, but I'm not
certain about that?

Does anyone feel that we should move back 0.20.1 and use a stable
hadoop release instead of using the snapshot, so that the version of
the code we depend on is clearer?

Drew

On Fri, Jan 8, 2010 at 2:08 AM, Ted Dunning  wrote:
> SVN should help: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/
>
> If you look at the release notes, you should be able to discern what made up
> 20.2 if it is a real release (looking at common made is look like it isn't).
>
> On Thu, Jan 7, 2010 at 9:46 PM, zhao zhendong wrote:
>
>> Where can I find the source code of hadoop-0.20.2?
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: MapReduce Unit Testing

2010-01-08 Thread Sean Owen
Off-the-shelf?
MRUnit is open source and Apache licensed, so I don't see why you can't use it.

On Fri, Jan 8, 2010 at 12:27 PM, zhao zhendong  wrote:
> Hi,
>
> Does anybody know a out-off-shift Unit testing package for Mapreduce
> framework? MRUnit is good, but this package only can be found in Cloudera
> own Hadoop.
>
> Cheers,
> Zhendong
> --
> -
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
>><><><><><><><><><><><><><><><>
> Homepage:http://zhaozhendong.googlepages.com
> Mail: zhaozhend...@gmail.com
<><><><><><><><<><>><><<
>


MapReduce Unit Testing

2010-01-08 Thread zhao zhendong
Hi,

Does anybody know a out-off-shift Unit testing package for Mapreduce
framework? MRUnit is good, but this package only can be found in Cloudera
own Hadoop.

Cheers,
Zhendong
-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><>
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhend...@gmail.com
>>><><><><><><><><<><>><><<


Re: What index structure does kNN algorithm use in mahout?

2010-01-08 Thread Grant Ingersoll
Do you mean K-Means?

On Jan 7, 2010, at 3:50 AM, xiao yang wrote:

> Like R-tree.
> Or it compares each record for every query?
> 
> Thanks!
> Xiao

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp

2010-01-08 Thread deneche abdelhakim
the build is successful, thanks =D

On Fri, Jan 8, 2010 at 9:23 AM, Robin Anil  wrote:
> Try Now
>


compareTo() issue

2010-01-08 Thread Sean Owen
I see some compareTo() methods with logic like this --

int a = object1.foo();
int b = object2.foo();
if (a == b) {
  return 1; // order randomly
} else {
  return a - b;
}

Three problems here:
- This does not produce a random ordering when used with a sort; it's
quite deterministic
- This violates the contract of compareTo() -- true that
a.compareTo(b) = -b.compareTo(a)
- "a-b" can overflow and give the wrong sign

Mind if I fix?

There's still code going in with lots of variances from what I think
are our agreed standards too.

Sean


Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp

2010-01-08 Thread Robin Anil
Try Now