[jira] Commented: (MAHOUT-235) GenericSorting.java also needs replacing
[ https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795707#action_12795707 ] Benson Margulies commented on MAHOUT-235: - the odd thing is that I got here by modifying one of the many working copies to use an external swapper. On Dec 31, 2009, at 6:24 PM, "Ted Dunning (JIRA)" > GenericSorting.java also needs replacing > > > Key: MAHOUT-235 > URL: https://issues.apache.org/jira/browse/MAHOUT-235 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies > Attachments: oops.diff > > > It looks like GenericSorting.java is more code of the same dubious parentage > that needs the same treatment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-235) GenericSorting.java also needs replacing
[ https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795704#action_12795704 ] Ted Dunning edited comment on MAHOUT-235 at 12/31/09 11:24 PM: --- I just looked at this. Didn't find the problem, but did make some comments in the code and clarified it a bit. My guess is that the main partition loop leaves b and c in the wrong order. If i get another chance to look at this, I will put in some assertions that match what I think the loop invariants should be. That will probably make things obvious. Here is my commented version fo the key sort routine: {noformat} private static void quickSort0(int start, int end, IntComparator comp, Swapper swap) { int length = end - start; if (length < 7) { insertionSort(start, end, comp, swap); return; } int middle = (start + end) / 2; if (length > 7) { int bottom = start; int top = end - 1; if (length > 40) { // for lots of data, bottom, middle and top are medians near the beginning, middle or end of the data int skosh = length / 8; bottom = med3(bottom, bottom + skosh, bottom + (2 * skosh), comp); middle = med3(middle - skosh, middle, middle + skosh, comp); top = med3(top - (2 * skosh), top - skosh, top, comp); } middle = med3(bottom, middle, top, comp); } int partitionIndex = middle; // an index, not a value. // regions from a to b and from c to d are what we will recursively sort int a, b, c, d; a = b = start; c = d = end - 1; while (b <= c) { // copy all values equal to the partition value to before a..b. In the process, advance b // as long as values less than the partition or equal are found, also stop when a..b collides with c..d int comparison = comp.compare(b, partitionIndex); while (b <= c && comparison <= 0) { if (comparison == 0) { swap.swap(a, b); a++; } b++; comparison = comp.compare(b, partitionIndex); } // at this point [start..a) has partition values, [a..b) has values < partition // also, either b>c or v[b] > partition value comparison = comp.compare(c, partitionIndex); while (c >= b && comparison >= 0) { if (comparison == 0) { swap.swap(c, d); d--; } c--; comparison = comp.compare(c, partitionIndex); } // now we also know that [d..end] contains partition values, // [c..d) contains values > partition value // also, either b>c or (v[b] > partition OR v[c] < partition) if (b <= c) { // v[b] > partition OR v[c] < partition // swapping will let us continue to grow the two regions swap.swap(b, c); b++; c--; } } // now we know // b = c+1 // [start..a) and [d..end) contain partition value // all of [a..b) are less than partition // all of [c..d) are greater than partition // shift [a..b) to beginning length = Math.min(a - start, b - a); int l = start; int h = b - length; while (length-- > 0) { swap.swap(l, h); l++; h++; } // shift [c..d) to end length = Math.min(d - c, end - 1 - d); l = b; h = end - length; while (length-- > 0) { swap.swap(l, h); l++; h++; } // recurse left and right length = b - a; if (length > 0) { quickSort0(start, start + length, comp, swap); } length = d - c; if (length > 0) { quickSort0(end - length, end, comp, swap); } } /** * In-place insertion sort that is fast for pre-sorted data. * * @param start Where to start sorting (inclusive) * @param end Where to stop (exclusive) * @param comp Sort order. * @param swap How to swap items. */ private static void insertionSort(int start, int end, IntComparator comp, Swapper swap) { for (int i = start + 1; i < end; i++) { for (int j = i; j > start && comp.compare(j - 1, j) > 0; j--) { swap.swap(j - 1, j); } } } {noformat} was (Author: tdunning): I just looked at this. Didn't find the problem, but did make some comments in the code and clarified it a bit. My guess is that the main partition loop leaves b and c in the wrong order. If i get another chance to look at this, I will put in some assertions that match what I think the loop invariants should be. That will probably make things obvious. Here is my commented version fo the key sort routine: {unformatted} private static void quickSort0(int start, int end, IntComparator comp, Swapper swap) { int length = end - start; if (length < 7) { insertionSort(start, end, comp, swap); return; } int middle = (start + end) / 2; if (length > 7) { int bottom = s
[jira] Commented: (MAHOUT-235) GenericSorting.java also needs replacing
[ https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795704#action_12795704 ] Ted Dunning commented on MAHOUT-235: I just looked at this. Didn't find the problem, but did make some comments in the code and clarified it a bit. My guess is that the main partition loop leaves b and c in the wrong order. If i get another chance to look at this, I will put in some assertions that match what I think the loop invariants should be. That will probably make things obvious. Here is my commented version fo the key sort routine: {unformatted} private static void quickSort0(int start, int end, IntComparator comp, Swapper swap) { int length = end - start; if (length < 7) { insertionSort(start, end, comp, swap); return; } int middle = (start + end) / 2; if (length > 7) { int bottom = start; int top = end - 1; if (length > 40) { // for lots of data, bottom, middle and top are medians near the beginning, middle or end of the data int skosh = length / 8; bottom = med3(bottom, bottom + skosh, bottom + (2 * skosh), comp); middle = med3(middle - skosh, middle, middle + skosh, comp); top = med3(top - (2 * skosh), top - skosh, top, comp); } middle = med3(bottom, middle, top, comp); } int partitionIndex = middle; // an index, not a value. // regions from a to b and from c to d are what we will recursively sort int a, b, c, d; a = b = start; c = d = end - 1; while (b <= c) { // copy all values equal to the partition value to before a..b. In the process, advance b // as long as values less than the partition or equal are found, also stop when a..b collides with c..d int comparison = comp.compare(b, partitionIndex); while (b <= c && comparison <= 0) { if (comparison == 0) { swap.swap(a, b); a++; } b++; comparison = comp.compare(b, partitionIndex); } // at this point [start..a) has partition values, [a..b) has values < partition // also, either b>c or v[b] > partition value comparison = comp.compare(c, partitionIndex); while (c >= b && comparison >= 0) { if (comparison == 0) { swap.swap(c, d); d--; } c--; comparison = comp.compare(c, partitionIndex); } // now we also know that [d..end] contains partition values, // [c..d) contains values > partition value // also, either b>c or (v[b] > partition OR v[c] < partition) if (b <= c) { // v[b] > partition OR v[c] < partition // swapping will let us continue to grow the two regions swap.swap(b, c); b++; c--; } } // now we know // b = c+1 // [start..a) and [d..end) contain partition value // all of [a..b) are less than partition // all of [c..d) are greater than partition // shift [a..b) to beginning length = Math.min(a - start, b - a); int l = start; int h = b - length; while (length-- > 0) { swap.swap(l, h); l++; h++; } // shift [c..d) to end length = Math.min(d - c, end - 1 - d); l = b; h = end - length; while (length-- > 0) { swap.swap(l, h); l++; h++; } // recurse left and right length = b - a; if (length > 0) { quickSort0(start, start + length, comp, swap); } length = d - c; if (length > 0) { quickSort0(end - length, end, comp, swap); } } /** * In-place insertion sort that is fast for pre-sorted data. * * @param start Where to start sorting (inclusive) * @param end Where to stop (exclusive) * @param comp Sort order. * @param swap How to swap items. */ private static void insertionSort(int start, int end, IntComparator comp, Swapper swap) { for (int i = start + 1; i < end; i++) { for (int j = i; j > start && comp.compare(j - 1, j) > 0; j--) { swap.swap(j - 1, j); } } } {unformatted} > GenericSorting.java also needs replacing > > > Key: MAHOUT-235 > URL: https://issues.apache.org/jira/browse/MAHOUT-235 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies > Attachments: oops.diff > > > It looks like GenericSorting.java is more code of the same dubious parentage > that needs the same treatment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[math] I need an extra pair of eyeballs
I attached a patch to MAHOUT-235 that fails its unit test. I haven't spend \hours/ staring at it, but I have spent some time, and I feel like all this quicksort code is running together in my head. If someone is feeling charitable, or wants an opportunity to snicker at my expense ...
[jira] Updated: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-230: --- Resolution: Fixed Status: Resolved (was: Patch Available) Anyone can. > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance
[ https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795666#action_12795666 ] Benson Margulies commented on MAHOUT-230: - Grant, do you resolve or do I? > Replace org.apache.mahout.math.Sorting with code of clear provenance > > > Key: MAHOUT-230 > URL: https://issues.apache.org/jira/browse/MAHOUT-230 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: replace-sorting.diff > > Original Estimate: 72h > Remaining Estimate: 72h > > org.apache.mahout.math.Sorting looks as if the original author borrowed from > the Sun JRE, based on the private internal function names and contents. That > code has a restrictive license. We need to take the equivalent file > (java.util.Arrays) from Apache Harmony and use it as the basis for a clean > replacement. > The problematic code are the quickSort and mergeSort functions, which extend > 'Arrays' by supporting slices of arrays and custom sorting predicate > functions. > One might also wistfully note that the more recent JDKs from Sun have > deployed different (and one hopes) better sort algorithms that 1.5 and/or > Harmony, so a really energetic person might build implementations in here to > match. However, expediency calls for just bashing on the Harmony > implementation to solve the problem at hand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-235) GenericSorting.java also needs replacing
[ https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-235: Attachment: oops.diff This doesn't work, and I'm a bit stumped. > GenericSorting.java also needs replacing > > > Key: MAHOUT-235 > URL: https://issues.apache.org/jira/browse/MAHOUT-235 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies > Attachments: oops.diff > > > It looks like GenericSorting.java is more code of the same dubious parentage > that needs the same treatment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio
[ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795604#action_12795604 ] Grant Ingersoll commented on MAHOUT-163: Yep, I committed a change w/ those things being configurable. > Get (better) cluster labels using Log Likelihood Ratio > -- > > Key: MAHOUT-163 > URL: https://issues.apache.org/jira/browse/MAHOUT-163 > Project: Mahout > Issue Type: Improvement >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, > MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, > mahout-cluster-labels-llr.patch > > > Log Likelihood Ratio (LLR) is a better technique to identify cluster labels > instead of the top features of the centroid vector. LLR finds terms/phrases > which are common in the cluster but rare outside. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795603#action_12795603 ] Grant Ingersoll commented on MAHOUT-220: Yeah, I don't think Utils should need to depend on core. I would think it should be: Most things depend on Math (i.e. Vectors) including core and utils Utils should be standalone tools for getting things into the appropriate mathematical representation which can be consumed by core, et. al. bq. Why is utils in its own module anyways, instead of just being in core? I imagined utils to be the place where things that did useful ancillary tasks lived, such as converting ARFF to Mahout Vector or converting a Lucene index to Mahout Vector or converting a whole slew of raw text to Mahout Vector. That way the core wouldn't need to be muddied by the various dependencies and could be as lightweight as possible. So, overtime, Utils may grow to have a dep. on Tika for instance or a pipeline like OpenPipeline, but the core need not know anything about it. Bleah, indeed! > Mahout Bayes Code cleanup > - > > Key: MAHOUT-220 > URL: https://issues.apache.org/jira/browse/MAHOUT-220 > Project: Mahout > Issue Type: Improvement > Components: Classification >Affects Versions: 0.3 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch > > > Following isabel's checkstyle, I am adding a whole slew of code cleanup with > the following exceptions > 1. Line length used is 120 instead of 80. > 2. static final log is kept as is. not LOG. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [math] MAHOUT-235/GenericSorting, looking for a clue
It is needed, but on the drive in this morning I thought of a perfectly straightforward way to extend the code in Sorting to do what it does. It's got something that nothing else has: sort (startIndex, endIndex, Comparator, Swapper) but I can see how to add that without understanding the code we got from CERN which is covered with comments indicating JDK derivation. On Thu, Dec 31, 2009 at 7:01 AM, Grant Ingersoll wrote: > +1. The JVM provides a mergeSort, right? Lucene has a quickSort > implementation that I believe is pretty fast. > > On Dec 31, 2009, at 1:09 AM, Ted Dunning wrote: > >> Nuke it if there isn't an obvious need for the code. >> >> (my opinion anyway) >> >> On Wed, Dec 30, 2009 at 6:10 PM, Benson Margulies >> wrote: >> >>> Folks, >>> >>> GenericSorting contains quick and merge sorts where the sort procedure >>> has no direct access to the data. It takes a swapper and a comparator, >>> and the idea is that the comparator takes the indices and uses them to >>> look in some mysterious place. >>> >>> I am not understanding the implementation of quicksort, which is doing >>> things with the indices which I would expect to require access to the >>> values. Something tells me that this is old news. >>> >>> --benson >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve > >
Re: [math] MAHOUT-235/GenericSorting, looking for a clue
+1. The JVM provides a mergeSort, right? Lucene has a quickSort implementation that I believe is pretty fast. On Dec 31, 2009, at 1:09 AM, Ted Dunning wrote: > Nuke it if there isn't an obvious need for the code. > > (my opinion anyway) > > On Wed, Dec 30, 2009 at 6:10 PM, Benson Margulies > wrote: > >> Folks, >> >> GenericSorting contains quick and merge sorts where the sort procedure >> has no direct access to the data. It takes a swapper and a comparator, >> and the idea is that the comparator takes the indices and uses them to >> look in some mysterious place. >> >> I am not understanding the implementation of quicksort, which is doing >> things with the indices which I would expect to require access to the >> values. Something tells me that this is old news. >> >> --benson >> > > > > -- > Ted Dunning, CTO > DeepDyve
[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio
[ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795570#action_12795570 ] Shashikant Kore commented on MAHOUT-163: Grant, Yes, it should have been configurable number. If the corpus size is big (tens of thousands of documents or more), the size I was working with, most likely such clusters are formed by outliers. Ignoring such clusters doesn't have any impact on the quality. > Get (better) cluster labels using Log Likelihood Ratio > -- > > Key: MAHOUT-163 > URL: https://issues.apache.org/jira/browse/MAHOUT-163 > Project: Mahout > Issue Type: Improvement >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, > MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, > mahout-cluster-labels-llr.patch > > > Log Likelihood Ratio (LLR) is a better technique to identify cluster labels > instead of the top features of the centroid vector. LLR finds terms/phrases > which are common in the cluster but rare outside. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.