Re: What index structure does kNN algorithm use in mahout?
On Jan 8, 2010, at 2:17 PM, Ted Dunning wrote: > kNN stands for k-nearest neighbor. Yeah, I know. Just wasn't sure on the context of the question. > > On Fri, Jan 8, 2010 at 3:34 AM, Grant Ingersoll wrote: > >> Do you mean K-Means? >> >> On Jan 7, 2010, at 3:50 AM, xiao yang wrote: >> >>> Like R-tree. >>> Or it compares each record for every query? >>> >>> Thanks! >>> Xiao
Re: What index structure does kNN algorithm use in mahout?
kNN stands for k-nearest neighbor. On Fri, Jan 8, 2010 at 3:34 AM, Grant Ingersoll wrote: > Do you mean K-Means? > > On Jan 7, 2010, at 3:50 AM, xiao yang wrote: > > > Like R-tree. > > Or it compares each record for every query? > > > > Thanks! > > Xiao > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Ted Dunning, CTO DeepDyve
Re: MapReduce Unit Testing
Mostly I depend on very strong unit tests for the mapper and reducer separately. As far as I have heard, MRUnit is the only game in town for creating simple unit tests for combing the mapper and reducer. On Fri, Jan 8, 2010 at 4:27 AM, zhao zhendong wrote: > Does anybody know a out-off-shift Unit testing package for Mapreduce > framework? MRUnit is good, but this package only can be found in Cloudera > own Hadoop. > -- Ted Dunning, CTO DeepDyve
Re: compareTo() issue
If the placement doesn't matter, why is returning 0 a problem? I'm just wondering if this doesn't introduce some subtle bugs in the way that not implementing hashCode/equals does. it may happen to work here but later... The overflow problem is remote, but not trivial... can support be large? like anywhere near a billion? then it's a possible bug. On Fri, Jan 8, 2010 at 6:26 PM, Robin Anil wrote: > That one was specifc to ordering of sub patterns in Fpgrowth stage. I did > that as an optimisation where in the object needs to be at a random place in > the heap if they are of equal length and support. Since it is the most > called function in the entire algorithm, I got some performance benefit from > it, because there were many patterns of the same length and support by > differnet pattern underneath. Plus there is no need of comparing the two > arrays which would be expensive. If the compareTo contract has to be > maintained, I will move it out to another class and use it as the comparator > during heap initialization. > > > public int compareTo(Pattern cr2) { > long support2 = cr2.support(); > int length2 = cr2.length(); > if (support == support2) { > if (length == length2) { > // if they are of same length and support order randomly > return 1; > } else { > return length - length2; > } > } else { > if (support > support2) { > return 1; > } else { > return -1; > } > } > } > > On Fri, Jan 8, 2010 at 2:58 PM, Sean Owen wrote: > >> I see some compareTo() methods with logic like this -- >> >> int a = object1.foo(); >> int b = object2.foo(); >> if (a == b) { >> return 1; // order randomly >> } else { >> return a - b; >> } >> >> Three problems here: >> - This does not produce a random ordering when used with a sort; it's >> quite deterministic >> - This violates the contract of compareTo() -- true that >> a.compareTo(b) = -b.compareTo(a) >> - "a-b" can overflow and give the wrong sign >> >> Mind if I fix? >> >> There's still code going in with lots of variances from what I think >> are our agreed standards too. >> >> Sean >> >
Re: compareTo() issue
That one was specifc to ordering of sub patterns in Fpgrowth stage. I did that as an optimisation where in the object needs to be at a random place in the heap if they are of equal length and support. Since it is the most called function in the entire algorithm, I got some performance benefit from it, because there were many patterns of the same length and support by differnet pattern underneath. Plus there is no need of comparing the two arrays which would be expensive. If the compareTo contract has to be maintained, I will move it out to another class and use it as the comparator during heap initialization. public int compareTo(Pattern cr2) { long support2 = cr2.support(); int length2 = cr2.length(); if (support == support2) { if (length == length2) { // if they are of same length and support order randomly return 1; } else { return length - length2; } } else { if (support > support2) { return 1; } else { return -1; } } } On Fri, Jan 8, 2010 at 2:58 PM, Sean Owen wrote: > I see some compareTo() methods with logic like this -- > > int a = object1.foo(); > int b = object2.foo(); > if (a == b) { > return 1; // order randomly > } else { > return a - b; > } > > Three problems here: > - This does not produce a random ordering when used with a sort; it's > quite deterministic > - This violates the contract of compareTo() -- true that > a.compareTo(b) = -b.compareTo(a) > - "a-b" can overflow and give the wrong sign > > Mind if I fix? > > There's still code going in with lots of variances from what I think > are our agreed standards too. > > Sean >
Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup
I wonder if we can get the hadoop people to include source jars with their snapshots? On Fri, Jan 8, 2010 at 11:23 AM, Sean Owen wrote: > I need a fix after 0.20.1, that's the primary reason. As a bonus, we > don't have to maintain our own version. The downside is relying on a > SNAPSHOT, but seems worth it to me. > > On Fri, Jan 8, 2010 at 4:02 PM, zhao zhendong wrote: >> Thanks Drew, >> >> +1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is >> obvious :) >> >> Cheers, >> Zhendong >> >> >
Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup
I need a fix after 0.20.1, that's the primary reason. As a bonus, we don't have to maintain our own version. The downside is relying on a SNAPSHOT, but seems worth it to me. On Fri, Jan 8, 2010 at 4:02 PM, zhao zhendong wrote: > Thanks Drew, > > +1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is > obvious :) > > Cheers, > Zhendong > >
Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup
Thanks Drew, +1 for me to maintain a stable hadoop release, such as 0.20.1. The reason is obvious :) Cheers, Zhendong On Fri, Jan 8, 2010 at 10:23 PM, Drew Farris wrote: > First, apologies for propagating a problem here. > > Since 0.20.2 is a snapshot, there's no release of hadoop that > corresponds directly to it. In maven terms, a snapshot could be > anything after 0.20.1 but prior to a formal release of 0.20.2. In the > repo, they are timestamped. > > We're pulling this jar from the apache snapsot repo here: > > https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/ > > Unfortunately, hadoop doesn't publish a source jar alongisde their > snapshot, which makes it a bit difficut to pin down what was used for > the build of jar. From the maven repo, you can see the jar was > generated 2009.11.13, 1:09:47 GMT, so as Ted suggests one option would > be to check out the hadoop-core sources from that date and use those > as a basis for your work on MAHOUT-232 > > I'm not 100% comfortable with this. When I picked up the work for > MAHOUT-238, I noticed that hadoop-0.20.2 SNAPSHOT was in core yet > hadoop-0.20.1, the mahout hand-rolled version, was used in math and > thus propagated as a transient dependency across all of the projects. > > I believe we moved to a SNAPSHOT so that we could use one of the > offical hadoop dependencies instead of our hand-rolled version of > 0.20.1. IIRC, there are some bugfixes that we required to, but I'm not > certain about that? > > Does anyone feel that we should move back 0.20.1 and use a stable > hadoop release instead of using the snapshot, so that the version of > the code we depend on is clearer? > > Drew > > On Fri, Jan 8, 2010 at 2:08 AM, Ted Dunning wrote: > > SVN should help: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/ > > > > If you look at the release notes, you should be able to discern what made > up > > 20.2 if it is a real release (looking at common made is look like it > isn't). > > > > On Thu, Jan 7, 2010 at 9:46 PM, zhao zhendong >wrote: > > > >> Where can I find the source code of hadoop-0.20.2? > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>> Department of Computer Science School of Computing National University of Singapore ><><><><><><><><><><><><><><><> Homepage:http://zhaozhendong.googlepages.com Mail: zhaozhend...@gmail.com >>><><><><><><><><<><>><><<
Re: [jira] Commented: (MAHOUT-238) Further Dependency Cleanup
First, apologies for propagating a problem here. Since 0.20.2 is a snapshot, there's no release of hadoop that corresponds directly to it. In maven terms, a snapshot could be anything after 0.20.1 but prior to a formal release of 0.20.2. In the repo, they are timestamped. We're pulling this jar from the apache snapsot repo here: https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/ Unfortunately, hadoop doesn't publish a source jar alongisde their snapshot, which makes it a bit difficut to pin down what was used for the build of jar. From the maven repo, you can see the jar was generated 2009.11.13, 1:09:47 GMT, so as Ted suggests one option would be to check out the hadoop-core sources from that date and use those as a basis for your work on MAHOUT-232 I'm not 100% comfortable with this. When I picked up the work for MAHOUT-238, I noticed that hadoop-0.20.2 SNAPSHOT was in core yet hadoop-0.20.1, the mahout hand-rolled version, was used in math and thus propagated as a transient dependency across all of the projects. I believe we moved to a SNAPSHOT so that we could use one of the offical hadoop dependencies instead of our hand-rolled version of 0.20.1. IIRC, there are some bugfixes that we required to, but I'm not certain about that? Does anyone feel that we should move back 0.20.1 and use a stable hadoop release instead of using the snapshot, so that the version of the code we depend on is clearer? Drew On Fri, Jan 8, 2010 at 2:08 AM, Ted Dunning wrote: > SVN should help: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/ > > If you look at the release notes, you should be able to discern what made up > 20.2 if it is a real release (looking at common made is look like it isn't). > > On Thu, Jan 7, 2010 at 9:46 PM, zhao zhendong wrote: > >> Where can I find the source code of hadoop-0.20.2? > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: MapReduce Unit Testing
Off-the-shelf? MRUnit is open source and Apache licensed, so I don't see why you can't use it. On Fri, Jan 8, 2010 at 12:27 PM, zhao zhendong wrote: > Hi, > > Does anybody know a out-off-shift Unit testing package for Mapreduce > framework? MRUnit is good, but this package only can be found in Cloudera > own Hadoop. > > Cheers, > Zhendong > -- > - > > Zhen-Dong Zhao (Maxim) > > <><<><><><><><><><>><><><><><>> > > Department of Computer Science > School of Computing > National University of Singapore > >><><><><><><><><><><><><><><><> > Homepage:http://zhaozhendong.googlepages.com > Mail: zhaozhend...@gmail.com <><><><><><><><<><>><><< >
MapReduce Unit Testing
Hi, Does anybody know a out-off-shift Unit testing package for Mapreduce framework? MRUnit is good, but this package only can be found in Cloudera own Hadoop. Cheers, Zhendong -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>> Department of Computer Science School of Computing National University of Singapore ><><><><><><><><><><><><><><><> Homepage:http://zhaozhendong.googlepages.com Mail: zhaozhend...@gmail.com >>><><><><><><><><<><>><><<
Re: What index structure does kNN algorithm use in mahout?
Do you mean K-Means? On Jan 7, 2010, at 3:50 AM, xiao yang wrote: > Like R-tree. > Or it compares each record for every query? > > Thanks! > Xiao -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp
the build is successful, thanks =D On Fri, Jan 8, 2010 at 9:23 AM, Robin Anil wrote: > Try Now >
compareTo() issue
I see some compareTo() methods with logic like this -- int a = object1.foo(); int b = object2.foo(); if (a == b) { return 1; // order randomly } else { return a - b; } Three problems here: - This does not produce a random ordering when used with a sort; it's quite deterministic - This violates the contract of compareTo() -- true that a.compareTo(b) = -b.compareTo(a) - "a-b" can overflow and give the wrong sign Mind if I fix? There's still code going in with lots of variances from what I think are our agreed standards too. Sean
Re: svn commit: r896922 [1/3] - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/common/ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ core/src/main/java/org/apache/mahout/fpm/pfp
Try Now