[jira] Commented: (MAHOUT-235) GenericSorting.java also needs replacing

2009-12-31 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795707#action_12795707
 ] 

Benson Margulies commented on MAHOUT-235:
-

the odd thing is that I got here by modifying
one of the many working copies to use an
external swapper.

On Dec 31, 2009, at 6:24 PM, "Ted Dunning (JIRA)"   



> GenericSorting.java also needs replacing
> 
>
> Key: MAHOUT-235
> URL: https://issues.apache.org/jira/browse/MAHOUT-235
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
> Attachments: oops.diff
>
>
> It looks like GenericSorting.java is more code of the same dubious parentage 
> that needs the same treatment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-235) GenericSorting.java also needs replacing

2009-12-31 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795704#action_12795704
 ] 

Ted Dunning edited comment on MAHOUT-235 at 12/31/09 11:24 PM:
---

I just looked at this.  Didn't find the problem, but did make some comments in 
the code and clarified it a bit.  My guess is that the main partition loop 
leaves b and c in the wrong order.  If i get another chance to look at this, I 
will put in some assertions that match what I think the loop invariants should 
be.  That will probably make things obvious.

Here is my commented version fo the key sort routine:
{noformat}
  private static void quickSort0(int start, int end, IntComparator comp, 
Swapper swap) {
int length = end - start;
if (length < 7) {
  insertionSort(start, end, comp, swap);
  return;
}
int middle = (start + end) / 2;
if (length > 7) {
  int bottom = start;
  int top = end - 1;
  if (length > 40) {
// for lots of data, bottom, middle and top are medians near the 
beginning, middle or end of the data
int skosh = length / 8;
bottom = med3(bottom, bottom + skosh, bottom + (2 * skosh), comp);
middle = med3(middle - skosh, middle, middle + skosh, comp);
top = med3(top - (2 * skosh), top - skosh, top, comp);
  }
  middle = med3(bottom, middle, top, comp);
}

int partitionIndex = middle; // an index, not a value.

// regions from a to b and from c to d are what we will recursively sort
int a, b, c, d;
a = b = start;
c = d = end - 1;
while (b <= c) {
  // copy all values equal to the partition value to before a..b.  In the 
process, advance b
  // as long as values less than the partition or equal are found, also 
stop when a..b collides with c..d
  int comparison = comp.compare(b, partitionIndex);
  while (b <= c && comparison <= 0) {
if (comparison == 0) {
  swap.swap(a, b);
  a++;
}
b++;
comparison = comp.compare(b, partitionIndex);
  }
  // at this point [start..a) has partition values, [a..b) has values < 
partition
  // also, either b>c or v[b] > partition value

  comparison = comp.compare(c, partitionIndex);
  while (c >= b && comparison >= 0) {
if (comparison == 0) {
  swap.swap(c, d);
  d--;
}
c--;
comparison = comp.compare(c, partitionIndex);
  }
  // now we also know that [d..end] contains partition values,
  // [c..d) contains values > partition value
  // also, either b>c or (v[b] > partition OR v[c] < partition)

  if (b <= c) {
// v[b] > partition OR v[c] < partition
// swapping will let us continue to grow the two regions
swap.swap(b, c);
b++;
c--;
  }
}
// now we know
// b = c+1
// [start..a) and [d..end) contain partition value
// all of [a..b) are less than partition
// all of [c..d) are greater than partition

// shift [a..b) to beginning
length = Math.min(a - start, b - a);
int l = start;
int h = b - length;
while (length-- > 0) {
  swap.swap(l, h);
  l++;
  h++;
}

// shift [c..d) to end
length = Math.min(d - c, end - 1 - d);
l = b;
h = end - length;
 while (length-- > 0) {
  swap.swap(l, h);
  l++;
  h++;
}

// recurse left and right
length = b - a;
if (length > 0) {
  quickSort0(start, start + length, comp, swap);
}

length = d - c;
if (length > 0) {
  quickSort0(end - length, end, comp, swap);
}
  }

  /**
   * In-place insertion sort that is fast for pre-sorted data.
   *
   * @param start Where to start sorting (inclusive)
   * @param end   Where to stop (exclusive)
   * @param comp  Sort order.
   * @param swap  How to swap items.
   */
  private static void insertionSort(int start, int end, IntComparator comp, 
Swapper swap) {
for (int i = start + 1; i < end; i++) {
  for (int j = i; j > start && comp.compare(j - 1, j) > 0; j--) {
swap.swap(j - 1, j);
  }
}
  }
{noformat}


  was (Author: tdunning):

I just looked at this.  Didn't find the problem, but did make some comments in 
the code and clarified it a bit.  My guess is that the main partition loop 
leaves b and c in the wrong order.  If i get another chance to look at this, I 
will put in some assertions that match what I think the loop invariants should 
be.  That will probably make things obvious.

Here is my commented version fo the key sort routine:
{unformatted}
  private static void quickSort0(int start, int end, IntComparator comp, 
Swapper swap) {
int length = end - start;
if (length < 7) {
  insertionSort(start, end, comp, swap);
  return;
}
int middle = (start + end) / 2;
if (length > 7) {
  int bottom = s

[jira] Commented: (MAHOUT-235) GenericSorting.java also needs replacing

2009-12-31 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795704#action_12795704
 ] 

Ted Dunning commented on MAHOUT-235:



I just looked at this.  Didn't find the problem, but did make some comments in 
the code and clarified it a bit.  My guess is that the main partition loop 
leaves b and c in the wrong order.  If i get another chance to look at this, I 
will put in some assertions that match what I think the loop invariants should 
be.  That will probably make things obvious.

Here is my commented version fo the key sort routine:
{unformatted}
  private static void quickSort0(int start, int end, IntComparator comp, 
Swapper swap) {
int length = end - start;
if (length < 7) {
  insertionSort(start, end, comp, swap);
  return;
}
int middle = (start + end) / 2;
if (length > 7) {
  int bottom = start;
  int top = end - 1;
  if (length > 40) {
// for lots of data, bottom, middle and top are medians near the 
beginning, middle or end of the data
int skosh = length / 8;
bottom = med3(bottom, bottom + skosh, bottom + (2 * skosh), comp);
middle = med3(middle - skosh, middle, middle + skosh, comp);
top = med3(top - (2 * skosh), top - skosh, top, comp);
  }
  middle = med3(bottom, middle, top, comp);
}

int partitionIndex = middle; // an index, not a value.

// regions from a to b and from c to d are what we will recursively sort
int a, b, c, d;
a = b = start;
c = d = end - 1;
while (b <= c) {
  // copy all values equal to the partition value to before a..b.  In the 
process, advance b
  // as long as values less than the partition or equal are found, also 
stop when a..b collides with c..d
  int comparison = comp.compare(b, partitionIndex);
  while (b <= c && comparison <= 0) {
if (comparison == 0) {
  swap.swap(a, b);
  a++;
}
b++;
comparison = comp.compare(b, partitionIndex);
  }
  // at this point [start..a) has partition values, [a..b) has values < 
partition
  // also, either b>c or v[b] > partition value

  comparison = comp.compare(c, partitionIndex);
  while (c >= b && comparison >= 0) {
if (comparison == 0) {
  swap.swap(c, d);
  d--;
}
c--;
comparison = comp.compare(c, partitionIndex);
  }
  // now we also know that [d..end] contains partition values,
  // [c..d) contains values > partition value
  // also, either b>c or (v[b] > partition OR v[c] < partition)

  if (b <= c) {
// v[b] > partition OR v[c] < partition
// swapping will let us continue to grow the two regions
swap.swap(b, c);
b++;
c--;
  }
}
// now we know
// b = c+1
// [start..a) and [d..end) contain partition value
// all of [a..b) are less than partition
// all of [c..d) are greater than partition

// shift [a..b) to beginning
length = Math.min(a - start, b - a);
int l = start;
int h = b - length;
while (length-- > 0) {
  swap.swap(l, h);
  l++;
  h++;
}

// shift [c..d) to end
length = Math.min(d - c, end - 1 - d);
l = b;
h = end - length;
 while (length-- > 0) {
  swap.swap(l, h);
  l++;
  h++;
}

// recurse left and right
length = b - a;
if (length > 0) {
  quickSort0(start, start + length, comp, swap);
}

length = d - c;
if (length > 0) {
  quickSort0(end - length, end, comp, swap);
}
  }

  /**
   * In-place insertion sort that is fast for pre-sorted data.
   *
   * @param start Where to start sorting (inclusive)
   * @param end   Where to stop (exclusive)
   * @param comp  Sort order.
   * @param swap  How to swap items.
   */
  private static void insertionSort(int start, int end, IntComparator comp, 
Swapper swap) {
for (int i = start + 1; i < end; i++) {
  for (int j = i; j > start && comp.compare(j - 1, j) > 0; j--) {
swap.swap(j - 1, j);
  }
}
  }
{unformatted}


> GenericSorting.java also needs replacing
> 
>
> Key: MAHOUT-235
> URL: https://issues.apache.org/jira/browse/MAHOUT-235
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
> Attachments: oops.diff
>
>
> It looks like GenericSorting.java is more code of the same dubious parentage 
> that needs the same treatment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[math] I need an extra pair of eyeballs

2009-12-31 Thread Benson Margulies
I attached a patch to MAHOUT-235 that fails its unit test. I haven't
spend \hours/ staring at it, but I have spent some time, and I feel
like all this quicksort code is running together in my head. If
someone is feeling charitable, or wants an opportunity to snicker at
my expense ...


[jira] Updated: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-31 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-230:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Anyone can.

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-230) Replace org.apache.mahout.math.Sorting with code of clear provenance

2009-12-31 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795666#action_12795666
 ] 

Benson Margulies commented on MAHOUT-230:
-

Grant, do you resolve or do I?

> Replace org.apache.mahout.math.Sorting with code of clear provenance
> 
>
> Key: MAHOUT-230
> URL: https://issues.apache.org/jira/browse/MAHOUT-230
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: replace-sorting.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> org.apache.mahout.math.Sorting looks as if the original author borrowed from 
> the Sun JRE, based on the private internal function names and contents. That 
> code has a restrictive license. We need to take the equivalent file 
> (java.util.Arrays) from Apache Harmony and use it as the basis for a clean 
> replacement.
> The problematic code are the quickSort and mergeSort functions, which extend 
> 'Arrays' by supporting slices of arrays and custom sorting predicate 
> functions. 
> One might also wistfully note that the more recent JDKs from Sun have 
> deployed different (and one hopes) better sort algorithms that 1.5 and/or 
> Harmony, so a really energetic person might build implementations in here to 
> match. However, expediency calls for just bashing on the Harmony 
> implementation to solve the problem at hand.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-235) GenericSorting.java also needs replacing

2009-12-31 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-235:


Attachment: oops.diff

This doesn't work, and I'm a bit stumped.

> GenericSorting.java also needs replacing
> 
>
> Key: MAHOUT-235
> URL: https://issues.apache.org/jira/browse/MAHOUT-235
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
> Attachments: oops.diff
>
>
> It looks like GenericSorting.java is more code of the same dubious parentage 
> that needs the same treatment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

2009-12-31 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795604#action_12795604
 ] 

Grant Ingersoll commented on MAHOUT-163:


Yep, I committed a change w/ those things being configurable.

> Get (better) cluster labels using Log Likelihood Ratio
> --
>
> Key: MAHOUT-163
> URL: https://issues.apache.org/jira/browse/MAHOUT-163
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, 
> MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, 
> mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels 
> instead of the top features of the centroid vector. LLR finds terms/phrases 
> which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-220) Mahout Bayes Code cleanup

2009-12-31 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795603#action_12795603
 ] 

Grant Ingersoll commented on MAHOUT-220:


Yeah, I don't think Utils should need to depend on core.  I would think it 
should be:

Most things depend on Math (i.e. Vectors) including core and utils
Utils should be standalone tools for getting things into the appropriate 
mathematical representation which can be consumed by core, et. al.

bq. Why is utils in its own module anyways, instead of just being in core?

I imagined utils to be the place where things that did useful ancillary tasks 
lived, such as converting ARFF to Mahout Vector or converting a Lucene index to 
Mahout Vector or converting a whole slew of raw text to Mahout Vector.  That 
way the core wouldn't need to be muddied by the various dependencies and could 
be as lightweight as possible.  So, overtime, Utils may grow to have a dep. on 
Tika for instance or a pipeline like OpenPipeline, but the core need not know 
anything about it.

Bleah, indeed!

> Mahout Bayes Code cleanup
> -
>
> Key: MAHOUT-220
> URL: https://issues.apache.org/jira/browse/MAHOUT-220
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch
>
>
> Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
> the following exceptions
> 1.  Line length used is 120 instead of 80. 
> 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [math] MAHOUT-235/GenericSorting, looking for a clue

2009-12-31 Thread Benson Margulies
It is needed, but on the drive in this morning I thought of a
perfectly straightforward way to extend the code in Sorting to do what
it does.

It's got something that nothing else has:

sort (startIndex, endIndex, Comparator, Swapper)

but I can see how to add that without understanding the code we got
from CERN which is covered with comments indicating JDK derivation.

On Thu, Dec 31, 2009 at 7:01 AM, Grant Ingersoll  wrote:
> +1.  The JVM provides a mergeSort, right?  Lucene has a quickSort 
> implementation that I believe is pretty fast.
>
> On Dec 31, 2009, at 1:09 AM, Ted Dunning wrote:
>
>> Nuke it if there isn't an obvious need for the code.
>>
>> (my opinion anyway)
>>
>> On Wed, Dec 30, 2009 at 6:10 PM, Benson Margulies 
>> wrote:
>>
>>> Folks,
>>>
>>> GenericSorting contains quick and merge sorts where the sort procedure
>>> has no direct access to the data. It takes a swapper and a comparator,
>>> and the idea is that the comparator takes the indices and uses them to
>>> look in some mysterious place.
>>>
>>> I am not understanding the implementation of quicksort, which is doing
>>> things with the indices which I would expect to require access to the
>>> values. Something tells me that this is old news.
>>>
>>> --benson
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>
>


Re: [math] MAHOUT-235/GenericSorting, looking for a clue

2009-12-31 Thread Grant Ingersoll
+1.  The JVM provides a mergeSort, right?  Lucene has a quickSort 
implementation that I believe is pretty fast.

On Dec 31, 2009, at 1:09 AM, Ted Dunning wrote:

> Nuke it if there isn't an obvious need for the code.
> 
> (my opinion anyway)
> 
> On Wed, Dec 30, 2009 at 6:10 PM, Benson Margulies 
> wrote:
> 
>> Folks,
>> 
>> GenericSorting contains quick and merge sorts where the sort procedure
>> has no direct access to the data. It takes a swapper and a comparator,
>> and the idea is that the comparator takes the indices and uses them to
>> look in some mysterious place.
>> 
>> I am not understanding the implementation of quicksort, which is doing
>> things with the indices which I would expect to require access to the
>> values. Something tells me that this is old news.
>> 
>> --benson
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve



[jira] Commented: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

2009-12-31 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795570#action_12795570
 ] 

Shashikant Kore commented on MAHOUT-163:


Grant, 

Yes, it should have been configurable number.  

If the corpus size is big (tens of thousands of documents or more), the size I 
was working with, most likely such clusters are formed by outliers. Ignoring 
such clusters doesn't have any impact on the quality.


> Get (better) cluster labels using Log Likelihood Ratio
> --
>
> Key: MAHOUT-163
> URL: https://issues.apache.org/jira/browse/MAHOUT-163
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-163-17sep.patch, MAHOUT-163.patch, 
> MAHOUT-163.patch, MAHOUT-163.patch, MAHOUT-163.patch, mahout-163.patch, 
> mahout-cluster-labels-llr.patch
>
>
> Log Likelihood Ratio (LLR) is a better technique to identify cluster labels 
> instead of the top features of the centroid vector. LLR finds terms/phrases 
> which are common in the cluster but rare outside. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.