[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672852#comment-13672852
 ] 

Dawid Weiss commented on MAHOUT-1225:
-

Ehm. I've closed this issue as per Robin's comment above but I don't think this 
was the right way to go -- this should have been left open (with a fixed 
resolution) until a release is made. Apologies for the noise. I can't reopen it 
now -- probably missing some Jira's karma to do this. Please correct my mistake 
if you have admin rights, reopen and then bulk close at release time. Thanks!

 Sets and maps incorrectly clear() their state arrays (potential endless loops)
 --

 Key: MAHOUT-1225
 URL: https://issues.apache.org/jira/browse/MAHOUT-1225
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
 collections (Set) 0.7, hppc 0.4.3
Reporter: Sophie Sperner
Assignee: Dawid Weiss
  Labels: hashset, java, mahout, test
 Fix For: 0.7

 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
 MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar

   Original Estimate: 48h
  Remaining Estimate: 48h

 The code I attached hangs on forever, Eclipse does not print me its stack 
 trace because it does not terminate the program. So I decided to make a small 
 test.java file that you can easily run.
 This code has the main function that simply runs getItemList() method which 
 successfully executes getDataset() method (here please download mushroom.dat 
 dataset and set the full path into filePath string variable) and the hangs on 
 (the problem happens on a fourth columnValues.add() call). After the dataset 
 was taken into X array, the code simply goes through X column by column and 
 searches for different items in it.
 If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
 corresponding import headers then everything will work just fine (you will 
 also need to include hppc jar file found here 
 http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: mutual information issue in logLikelihoodRatio

2013-06-03 Thread Ted Dunning
So looking at the tests, this makes things look less horrifying.

org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood

assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1),
0.01);
assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10),
0.1);
assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0,
10), 0.1);
assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995,
1000, 10), 0.001);
assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000,
1000, 10), 0.001);
assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000,
1000, 99000), 0.001);

Next step is to determine whether these values are correct.  I recognize
the first two.

I put these values into my R script and got a successful load.  I think
that this means that the code is somehow correct, regardless of your
reading of it.  I don't have time right now to read the code in detail, but
I think that things are working.

You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R



On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com wrote:


 This is a horrifying possibility.  I thought we had several test cases in
 place to verify this code.

 Let me look.  I wonder if the code you have found is not referenced
 somehow.


 On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote:

 The definition of
 org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
 long k12, long k21, long k22):

 public static double logLikelihoodRatio(long k11, long k12, long k21,
 long k22) {
   Preconditions.checkArgument(k11 = 0  k12 = 0  k21 = 0  k22
 = 0);
   // note that we have counts here, not probabilities, and that the
 entropy is not normalized.
 *  double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
 *  double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
   double matrixEntropy = entropy(k11, k12, k21, k22);
   if (rowEntropy + columnEntropy  matrixEntropy) {
 // round off error
 return 0.0;
   }
   return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
 }

 The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
 think it should be:

 *  double rowEntropy = entropy(k11+k12, k21+k22)*
 *  double columnEntropy = entropy(k11+k21, k12+k22)*
 *
 *
 which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
 H(colSums(k))) *referred from
 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .

 LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in
 this example), and I is the mutual infomation.

 [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
 value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) =
 k11/N.


 [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
 H(colSums(k)

 The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k),
 we get:

 *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
 entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
 *
 *
 that multiplied by 2.0 is just the LLR.

 Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong
 or have I misunderstood something?





[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672909#comment-13672909
 ] 

Robin Anil commented on MAHOUT-1225:


Could you elaborate on the buggy scenario. I dont see an option to reopen 
myself.

 Sets and maps incorrectly clear() their state arrays (potential endless loops)
 --

 Key: MAHOUT-1225
 URL: https://issues.apache.org/jira/browse/MAHOUT-1225
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
 collections (Set) 0.7, hppc 0.4.3
Reporter: Sophie Sperner
Assignee: Dawid Weiss
  Labels: hashset, java, mahout, test
 Fix For: 0.7

 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
 MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar

   Original Estimate: 48h
  Remaining Estimate: 48h

 The code I attached hangs on forever, Eclipse does not print me its stack 
 trace because it does not terminate the program. So I decided to make a small 
 test.java file that you can easily run.
 This code has the main function that simply runs getItemList() method which 
 successfully executes getDataset() method (here please download mushroom.dat 
 dataset and set the full path into filePath string variable) and the hangs on 
 (the problem happens on a fourth columnValues.add() call). After the dataset 
 was taken into X array, the code simply goes through X column by column and 
 searches for different items in it.
 If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
 corresponding import headers then everything will work just fine (you will 
 also need to include hppc jar file found here 
 http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-950) Change BtJob to use new MultipleOutputs API

2013-06-03 Thread Tom White (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672918#comment-13672918
 ] 

Tom White commented on MAHOUT-950:
--

MAPREDUCE-3607 is in Hadoop 1.0.1, so this patch should allow Mahout to work 
against that version of Hadoop or a later one.

 Change BtJob to use new MultipleOutputs API
 ---

 Key: MAHOUT-950
 URL: https://issues.apache.org/jira/browse/MAHOUT-950
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Reporter: Tom White
 Fix For: 1.0

 Attachments: MAHOUT-950.patch


 BtJob uses a mixture of the old and new MapReduce API to allow it to use 
 MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when 
 run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new 
 MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made 
 available in a 1.x release - see MAPREDUCE-3607.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass

2013-06-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Angel López Otero updated MAHOUT-663:
--

Attachment: MAHOUT-663.patch

 Rationalize hadoop job creation with respect to setJarByClass
 -

 Key: MAHOUT-663
 URL: https://issues.apache.org/jira/browse/MAHOUT-663
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.4, 0.5
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.6

 Attachments: MAHOUT-663.patch


 Mahout includes a series of driver classes that create hadoop jobs via static 
 methods.
 Each one of these calls job.setJarByClass(itself.class).
 Unfortunately, this subverts the hadoop support for putting additional jars 
 in the lib directory of a job jar, since the class passed in is not a class 
 that lives in the ordinary section of the job jar.
 The effect of this is to force users of Mahout (and Mahout's own example job 
 jar) to unpack the mahout-core jar into the main section, instead of just 
 treating it as a 'lib' dependency.
 It seems to me that all the static job creators should be refactored into a 
 public function that returns a job object (and does NOT call 
 waitForCompletion), and then the existing wrapper. Users could call the new 
 functions, and make their own call to setJarByClass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass

2013-06-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Angel López Otero updated MAHOUT-663:
--

Attachment: (was: MAHOUT-663.patch)

 Rationalize hadoop job creation with respect to setJarByClass
 -

 Key: MAHOUT-663
 URL: https://issues.apache.org/jira/browse/MAHOUT-663
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.4, 0.5
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.6

 Attachments: MAHOUT-663.patch


 Mahout includes a series of driver classes that create hadoop jobs via static 
 methods.
 Each one of these calls job.setJarByClass(itself.class).
 Unfortunately, this subverts the hadoop support for putting additional jars 
 in the lib directory of a job jar, since the class passed in is not a class 
 that lives in the ordinary section of the job jar.
 The effect of this is to force users of Mahout (and Mahout's own example job 
 jar) to unpack the mahout-core jar into the main section, instead of just 
 treating it as a 'lib' dependency.
 It seems to me that all the static job creators should be refactored into a 
 public function that returns a job object (and does NOT call 
 waitForCompletion), and then the existing wrapper. Users could call the new 
 functions, and make their own call to setJarByClass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Suggested 0.8 Code Freeze Date

2013-06-03 Thread Sebastian Schelter
+1 on that.

On 03.06.2013 00:26, Grant Ingersoll wrote:
 I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs.
 
 If they aren't in by then, they will get pushed, unless they are blockers.
 
 After that, I will create the release candidates.
 
 -Grant
 



Re: Suggested 0.8 Code Freeze Date

2013-06-03 Thread Dan Filimon
+1


On Jun 3, 2013, at 0:26, Grant Ingersoll gsing...@apache.org wrote:

 I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs.
 
 If they aren't in by then, they will get pushed, unless they are blockers.
 
 After that, I will create the release candidates.
 
 -Grant


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672981#comment-13672981
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

OK, I read up on partitioners and I'd agree, Matt, this is effectively hadoop's 
way of doing what I proposed and doesn't pollute the M/R code, so I'm going to 
go forward w/ your patch.

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Grant Ingersoll
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Performance of primitive collections

2013-06-03 Thread Robin Anil
Dawid, do you have your existing benchmark code against
fastutil/hppc/trove. Since the performance improvements we made in last
couple of months I am itching to revisit the numbers.

Robin


[jira] [Created] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Maysam Yabandeh (JIRA)
Maysam Yabandeh created MAHOUT-1238:
---

 Summary: VectorWritable's bug with VectorView of sparse vectors
 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
 Fix For: 0.8, 0.7


VectorWritable raises an exception if it is used on a VectorView of a sparse 
vector. The reason is that the sparse vector writes only the non-zero elements, 
while VectorView's implementation of getNumNondefaultElements() returns the 
size of the entire data. Later when reading the vector, VectorWritable expects 
reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1238:


Attachment: MAHOUT-1238.patch

I am attaching the patch that fixes the bug.

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
  Labels: reduce, test
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1238:


Status: Patch Available  (was: Open)

Submitting the patch to get Hudson comments.

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
  Labels: reduce, test
 Fix For: 0.8, 0.7

 Attachments: MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1238:
---

Assignee: Robin Anil

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Assignee: Robin Anil
  Labels: reduce, test
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Performance of primitive collections

2013-06-03 Thread Robin Anil
There has been some improvements in the hashmaps, so I would like to re-run
these tests.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Jun 3, 2013 at 5:46 AM, Sebastiano Vigna vi...@di.unimi.it wrote:

 On 3 Jun 2013, at 12:43 PM, Robin Anil robin.a...@gmail.com wrote:

  Dawid, do you have your existing benchmark code against
 fastutil/hppc/trove. Since the performance improvements we made in last
 couple of months I am itching to revisit the numbers.

 If you're interested, they did a thorough job here:

 http://blog.aggregateknowledge.com/2011/12/12/big-memory-part-4/

 Ciao,

 seba




Re: Performance of primitive collections

2013-06-03 Thread Dawid Weiss
 Dawid, do you have your existing benchmark code against
 fastutil/hppc/trove. Since the performance improvements we made in last
 couple of months I am itching to revisit the numbers.

I can rerun those that I have -- will let you know! Should I use the
master branch or the official latest release?

Dawid


[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673031#comment-13673031
 ] 

Dawid Weiss commented on MAHOUT-1225:
-

Take a look at this test:
{code}
@Test
public void testClearTable() throws Exception {
OpenObjectIntHashMapInteger m = new OpenObjectIntHashMapInteger();
m.clear(); // rehash from the default capacity to the next prime after 
1 (3).
m.put(1, 2);
m.clear(); // Should clear internal references.

Field tableField = m.getClass().getDeclaredField(table);
tableField.setAccessible(true);
Object[] table = (Object[]) tableField.get(m);

assertEquals(
new HashSetObject(Arrays.asList(new Object [] { null } )),
new HashSetObject(Arrays.asList(table)));
}
{code}

This fails because clear() does not explicitly erase the table of references. 
It does call rehash but not always (not if there's no need) in which case the 
references stay hard-linked. The fix is to:

{code}
   public void clear() {
 Arrays.fill(this.state, FREE);
+Arrays.fill(this.table, null);
+
 distinct = 0;
 freeEntries = table.length; // delta
 trimToSize();
{code}

You could avoid this by returning a boolean from trimToSize() and checking 
whether internal buffers have been reallocated (and thus references freed).

 Sets and maps incorrectly clear() their state arrays (potential endless loops)
 --

 Key: MAHOUT-1225
 URL: https://issues.apache.org/jira/browse/MAHOUT-1225
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
 collections (Set) 0.7, hppc 0.4.3
Reporter: Sophie Sperner
Assignee: Dawid Weiss
  Labels: hashset, java, mahout, test
 Fix For: 0.7

 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
 MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar

   Original Estimate: 48h
  Remaining Estimate: 48h

 The code I attached hangs on forever, Eclipse does not print me its stack 
 trace because it does not terminate the program. So I decided to make a small 
 test.java file that you can easily run.
 This code has the main function that simply runs getItemList() method which 
 successfully executes getDataset() method (here please download mushroom.dat 
 dataset and set the full path into filePath string variable) and the hangs on 
 (the problem happens on a fourth columnValues.add() call). After the dataset 
 was taken into X array, the code simply goes through X column by column and 
 searches for different items in it.
 If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
 corresponding import headers then everything will work just fine (you will 
 also need to include hppc jar file found here 
 http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673071#comment-13673071
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

Matt, out of curiosity, what's your use case for the clusterpp?  [~robinanil] 
and I are both looking at this code and wondering why it is useful to separate 
out the clusters into their own directory.  MAHOUT-843 doesn't shed any light 
on it for us either.

Also, I don't think the current patch partitions correctly.  For instance, try 
a numPartitions of 2 and cluster ids of 153 and 53.  Then, 10^1 means you get 
153 % 10 and 53 % 10 both = 3 and you have a collision.  So, I think I'm back 
to my original thought, which is in the mappers and reducers, we need to load 
up the cluster ids and just map it there.

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Grant Ingersoll
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Performance of primitive collections

2013-06-03 Thread Ted Dunning
Master/trunk is the place to test.


On Mon, Jun 3, 2013 at 7:32 AM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote:

  Dawid, do you have your existing benchmark code against
  fastutil/hppc/trove. Since the performance improvements we made in last
  couple of months I am itching to revisit the numbers.

 I can rerun those that I have -- will let you know! Should I use the
 master branch or the official latest release?

 Dawid



[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)

2013-06-03 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673096#comment-13673096
 ] 

Dawid Weiss commented on MAHOUT-1225:
-

Nope, only that. It's fun to see how everything else goes bust when you run 
those tests on that dead collections branch though.
I'll run those microbenchmarks when I get a spare minute.

 Sets and maps incorrectly clear() their state arrays (potential endless loops)
 --

 Key: MAHOUT-1225
 URL: https://issues.apache.org/jira/browse/MAHOUT-1225
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7
 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths 
 collections (Set) 0.7, hppc 0.4.3
Reporter: Sophie Sperner
Assignee: Dawid Weiss
  Labels: hashset, java, mahout, test
 Fix For: 0.7

 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, 
 MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar

   Original Estimate: 48h
  Remaining Estimate: 48h

 The code I attached hangs on forever, Eclipse does not print me its stack 
 trace because it does not terminate the program. So I decided to make a small 
 test.java file that you can easily run.
 This code has the main function that simply runs getItemList() method which 
 successfully executes getDataset() method (here please download mushroom.dat 
 dataset and set the full path into filePath string variable) and the hangs on 
 (the problem happens on a fourth columnValues.add() call). After the dataset 
 was taken into X array, the code simply goes through X column by column and 
 searches for different items in it.
 If you uncomment IntSet columnValues = new IntOpenHashSet(); and 
 corresponding import headers then everything will work just fine (you will 
 also need to include hppc jar file found here 
 http://labs.carrotsearch.com/hppc.html or below in the attachment).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673108#comment-13673108
 ] 

Robin Anil commented on MAHOUT-1238:


There is a getNumNonZeroElements() method in AbstractVector try using that.

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Assignee: Robin Anil
  Labels: reduce, test
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-976) Implement Multilayer Perceptron

2013-06-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673111#comment-13673111
 ] 

Robin Anil commented on MAHOUT-976:
---

I see a few system.out.println() please remove those. Also use the Mahout 
eclipse code formatter to format the files. [~chrisberlin] will you be able to 
work on these quickly? I am pushing it off the 0.8 list. If you can work on it, 
please update it and we will review it.

 Implement Multilayer Perceptron
 ---

 Key: MAHOUT-976
 URL: https://issues.apache.org/jira/browse/MAHOUT-976
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.7
Reporter: Christian Herta
Assignee: Ted Dunning
Priority: Minor
  Labels: multilayer, networks, neural, perceptron
 Fix For: 0.8

 Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, 
 MAHOUT-976.patch

   Original Estimate: 80h
  Remaining Estimate: 80h

 Implement a multi layer perceptron
  * via Matrix Multiplication
  * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: 
 Efficent Backprop
  * arbitrary number of hidden layers (also 0  - just the linear model)
  * connection between proximate layers only 
  * different cost and activation functions (different activation function in 
 each layer) 
  * test of backprop by gradient checking 
  * normalization of the inputs (storeable) as part of the model
  
 First:
  * implementation stocastic gradient descent like gradient machine
  * simple gradient descent incl. momentum
 Later (new jira issues):  
  * Distributed Batch learning (see below)  
  * Stacked (Denoising) Autoencoder - Feature Learning
  * advanced cost minimazation like 2nd order methods, conjugate gradient etc.
 Distribution of learning can be done by (batch learning):
  1 Partioning of the data in x chunks 
  2 Learning the weight changes as matrices in each chunk
  3 Combining the matrixes and update of the weights - back to 2
 Maybe this procedure can be done with random parts of the chunks (distributed 
 quasi online learning). 
 Batch learning with delta-bar-delta heuristics for adapting the learning 
 rates.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Maysam Yabandeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated MAHOUT-1238:


Attachment: MAHOUT-1238.patch

The attached fixes the bug using AbstractVector#getNumNonZeroElements suggested 
by [~robinanil]

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Assignee: Robin Anil
  Labels: reduce, test
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1238:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Tested and Submitted 

 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Assignee: Robin Anil
  Labels: reduce, test
 Fix For: 0.8, 0.7

 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-03 Thread Ikumasa Mukai (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ikumasa Mukai updated MAHOUT-961:
-

Attachment: MAHOUT-961.patch

Thank you for checking!!!
I have recreated the patch for the latest codebase.

This patch becomes a little big as I applied Eclipse-Lucene-Formatter.xml to 
sources according to How To Contribute  .

 Modify the Tree/Forest Visualizer on DF.
 

 Key: MAHOUT-961
 URL: https://issues.apache.org/jira/browse/MAHOUT-961
 Project: Mahout
  Issue Type: Bug
Reporter: Ikumasa Mukai
Assignee: Sebastian Schelter
  Labels: RandomForest
 Fix For: 0.8

 Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
 MAHOUT-961.patch


 The Tree/Forest visualizer (MAHOUT-926) has problems.
 1) a un-complemented stem which has no leaf or node is shown.
 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Mahout-Examples-Cluster-Reuters #328

2013-06-03 Thread Apache Jenkins Server
See https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/328/changes

Changes:

[robinanil] MAHOUT-1238 VectorWritable's bug with VectorView of sparse vectors 
(Maysam Yabandeh)

--
[...truncated 836 lines...]
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java:
 Some input files use or override a deprecated API.
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java:
 Recompile with -Xlint:deprecation for details.
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java:
 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java
 uses unchecked or unsafe operations.
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java:
 Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
mahout-examples ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 4 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
mahout-examples ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 5 source files to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/test-classes
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java:
 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java
 uses or overrides a deprecated API.
[WARNING] 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java:
 Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ mahout-examples 
---
[INFO] Tests are skipped.
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-examples ---
[INFO] Building jar: 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/mahout-examples-0.8-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-dependency-plugin:2.7:copy-dependencies (copy-dependencies) @ 
mahout-examples ---
[INFO] Copying servlet-api-2.5-20081211.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/servlet-api-2.5-20081211.jar
[INFO] Copying netty-3.5.9.Final.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/netty-3.5.9.Final.jar
[INFO] Copying lucene-facet-4.2.1.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/lucene-facet-4.2.1.jar
[INFO] Copying jackson-core-asl-1.9.12.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/jackson-core-asl-1.9.12.jar
[INFO] Copying stax-api-1.0.1.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/stax-api-1.0.1.jar
[INFO] Copying commons-net-1.4.1.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-net-1.4.1.jar
[INFO] Copying json-simple-1.1.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/json-simple-1.1.jar
[INFO] Copying jakarta-regexp-1.4.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/jakarta-regexp-1.4.jar
[INFO] Copying commons-dbcp-1.4.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-dbcp-1.4.jar
[INFO] Copying mongo-java-driver-2.11.1.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/mongo-java-driver-2.11.1.jar
[INFO] Copying mahout-core-0.8-SNAPSHOT-tests.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/mahout-core-0.8-SNAPSHOT-tests.jar
[INFO] Copying libthrift-0.7.0.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/libthrift-0.7.0.jar
[INFO] Copying commons-beanutils-1.7.0.jar to 
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-beanutils-1.7.0.jar
[INFO] Copying jbcrypt-0.3m.jar to 

[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors

2013-06-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673231#comment-13673231
 ] 

Hudson commented on MAHOUT-1238:


Integrated in Mahout-Quality #2034 (See 
[https://builds.apache.org/job/Mahout-Quality/2034/])
MAHOUT-1238 VectorWritable's bug with VectorView of sparse vectors (Maysam 
Yabandeh) (Revision 1489001)

 Result = SUCCESS
robinanil : 
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/VectorWritable.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/math/VectorWritableTest.java


 VectorWritable's bug with VectorView of sparse vectors
 --

 Key: MAHOUT-1238
 URL: https://issues.apache.org/jira/browse/MAHOUT-1238
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.7, 0.8
Reporter: Maysam Yabandeh
Assignee: Robin Anil
  Labels: reduce, test
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch


 VectorWritable raises an exception if it is used on a VectorView of a sparse 
 vector. The reason is that the sparse vector writes only the non-zero 
 elements, while VectorView's implementation of getNumNondefaultElements() 
 returns the size of the entire data. Later when reading the vector, 
 VectorWritable expects reading more items that was written.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass

2013-06-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Angel López Otero updated MAHOUT-663:
--

Attachment: MAHOUT-663.patch

 Rationalize hadoop job creation with respect to setJarByClass
 -

 Key: MAHOUT-663
 URL: https://issues.apache.org/jira/browse/MAHOUT-663
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.4, 0.5
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.6

 Attachments: MAHOUT-663.patch, MAHOUT-663.patch


 Mahout includes a series of driver classes that create hadoop jobs via static 
 methods.
 Each one of these calls job.setJarByClass(itself.class).
 Unfortunately, this subverts the hadoop support for putting additional jars 
 in the lib directory of a job jar, since the class passed in is not a class 
 that lives in the ordinary section of the job jar.
 The effect of this is to force users of Mahout (and Mahout's own example job 
 jar) to unpack the mahout-core jar into the main section, instead of just 
 treating it as a 'lib' dependency.
 It seems to me that all the static job creators should be refactored into a 
 public function that returns a job object (and does NOT call 
 waitForCompletion), and then the existing wrapper. Users could call the new 
 functions, and make their own call to setJarByClass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass

2013-06-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Angel López Otero updated MAHOUT-663:
--

Attachment: (was: MAHOUT-663.patch)

 Rationalize hadoop job creation with respect to setJarByClass
 -

 Key: MAHOUT-663
 URL: https://issues.apache.org/jira/browse/MAHOUT-663
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.4, 0.5
Reporter: Benson Margulies
Assignee: Sean Owen
 Fix For: 0.6

 Attachments: MAHOUT-663.patch


 Mahout includes a series of driver classes that create hadoop jobs via static 
 methods.
 Each one of these calls job.setJarByClass(itself.class).
 Unfortunately, this subverts the hadoop support for putting additional jars 
 in the lib directory of a job jar, since the class passed in is not a class 
 that lives in the ordinary section of the job jar.
 The effect of this is to force users of Mahout (and Mahout's own example job 
 jar) to unpack the mahout-core jar into the main section, instead of just 
 treating it as a 'lib' dependency.
 It seems to me that all the static job creators should be refactored into a 
 public function that returns a job object (and does NOT call 
 waitForCompletion), and then the existing wrapper. Users could call the new 
 functions, and make their own call to setJarByClass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: mutual information issue in logLikelihoodRatio

2013-06-03 Thread Ted Dunning
Thanks Sean!


On Mon, Jun 3, 2013 at 12:14 PM, Sean Owen sro...@gmail.com wrote:

 I glanced at this and I am confused too.

 Ted I double-checked your blog post and it seems fine -- you popped
 the minus sign out of the entropy expression and reversed the args in
 the mutual info term, which will be relevant in a second. This is
 computing the value of the G test, right, and you are computing
 regular entropy and multiplying by the sum later. For the matrix [1 0
 ; 0 1] I get an unnormalized LLR of 2.772, yes.

 In the Java code, the expression for unnormalized entropy looks
 correct. This is how it gets the N term in there explicitly. It
 hasn't omitted the minus sign in entropy. But then the final
 expression should have a minus sign in front of H(k) (matrix entropy)
 right? and it looks like it does the opposite.

 The proposed change in this thread doesn't quite work as it results in
 0. But I somehow suspect it is prevented from working directly by the
 previous point. Indeed, if you negate all the entropy calculation (or,
 flip around the mutual information expression) the tests pass. (Except
 when it comes to how root LLR is handled for negative LLR, but that's
 a detail)

 I suppose it would be best to make the code reflect Ted's nice clear
 post. It is actually a little faster too.

 I am still not clear on why the current expression works, though it
 evidently does. I don't know it's history or if it's just an alternate
 formulation.

 Since I'm already here let me see if I can sort out a patch that also
 addresses negative LLR correctly.

 On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:
  So looking at the tests, this makes things look less horrifying.
 
  org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood
 
  assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1),
  0.01);
  assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0,
 10),
  0.1);
  assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0,
  10), 0.1);
  assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995,
  1000, 10), 0.001);
  assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000,
  1000, 10), 0.001);
  assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000,
  1000, 99000), 0.001);
 
  Next step is to determine whether these values are correct.  I recognize
  the first two.
 
  I put these values into my R script and got a successful load.  I think
  that this means that the code is somehow correct, regardless of your
  reading of it.  I don't have time right now to read the code in detail,
 but
  I think that things are working.
 
  You can find my R code at
 https://dl.dropboxusercontent.com/u/36863361/llr.R
 
 
 
  On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 
  This is a horrifying possibility.  I thought we had several test cases
 in
  place to verify this code.
 
  Let me look.  I wonder if the code you have found is not referenced
  somehow.
 
 
  On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote:
 
  The definition of
  org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
  long k12, long k21, long k22):
 
  public static double logLikelihoodRatio(long k11, long k12, long
 k21,
  long k22) {
Preconditions.checkArgument(k11 = 0  k12 = 0  k21 = 0 
 k22
  = 0);
// note that we have counts here, not probabilities, and that the
  entropy is not normalized.
  *  double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
  *  double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
double matrixEntropy = entropy(k11, k12, k21, k22);
if (rowEntropy + columnEntropy  matrixEntropy) {
  // round off error
  return 0.0;
}
return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
  }
 
  The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
  think it should be:
 
  *  double rowEntropy = entropy(k11+k12, k21+k22)*
  *  double columnEntropy = entropy(k11+k21, k12+k22)*
  *
  *
  which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
  H(colSums(k))) *referred from
  http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .
 
  LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22
 in
  this example), and I is the mutual infomation.
 
  [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
  value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g.
 p(1,1) =
  k11/N.
 
 
  [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
  H(colSums(k)
 
  The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k),
  we get:
 
  *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
  entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
  *
  *
  that multiplied by 2.0 is just the LLR.
 
  Is the 

Re: mutual information issue in logLikelihoodRatio

2013-06-03 Thread Sean Owen
I glanced at this and I am confused too.

Ted I double-checked your blog post and it seems fine -- you popped
the minus sign out of the entropy expression and reversed the args in
the mutual info term, which will be relevant in a second. This is
computing the value of the G test, right, and you are computing
regular entropy and multiplying by the sum later. For the matrix [1 0
; 0 1] I get an unnormalized LLR of 2.772, yes.

In the Java code, the expression for unnormalized entropy looks
correct. This is how it gets the N term in there explicitly. It
hasn't omitted the minus sign in entropy. But then the final
expression should have a minus sign in front of H(k) (matrix entropy)
right? and it looks like it does the opposite.

The proposed change in this thread doesn't quite work as it results in
0. But I somehow suspect it is prevented from working directly by the
previous point. Indeed, if you negate all the entropy calculation (or,
flip around the mutual information expression) the tests pass. (Except
when it comes to how root LLR is handled for negative LLR, but that's
a detail)

I suppose it would be best to make the code reflect Ted's nice clear
post. It is actually a little faster too.

I am still not clear on why the current expression works, though it
evidently does. I don't know it's history or if it's just an alternate
formulation.

Since I'm already here let me see if I can sort out a patch that also
addresses negative LLR correctly.

On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 So looking at the tests, this makes things look less horrifying.

 org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood

 assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1),
 0.01);
 assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10),
 0.1);
 assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0,
 10), 0.1);
 assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995,
 1000, 10), 0.001);
 assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000,
 1000, 10), 0.001);
 assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000,
 1000, 99000), 0.001);

 Next step is to determine whether these values are correct.  I recognize
 the first two.

 I put these values into my R script and got a successful load.  I think
 that this means that the code is somehow correct, regardless of your
 reading of it.  I don't have time right now to read the code in detail, but
 I think that things are working.

 You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R



 On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com wrote:


 This is a horrifying possibility.  I thought we had several test cases in
 place to verify this code.

 Let me look.  I wonder if the code you have found is not referenced
 somehow.


 On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote:

 The definition of
 org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
 long k12, long k21, long k22):

 public static double logLikelihoodRatio(long k11, long k12, long k21,
 long k22) {
   Preconditions.checkArgument(k11 = 0  k12 = 0  k21 = 0  k22
 = 0);
   // note that we have counts here, not probabilities, and that the
 entropy is not normalized.
 *  double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
 *  double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
   double matrixEntropy = entropy(k11, k12, k21, k22);
   if (rowEntropy + columnEntropy  matrixEntropy) {
 // round off error
 return 0.0;
   }
   return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
 }

 The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
 think it should be:

 *  double rowEntropy = entropy(k11+k12, k21+k22)*
 *  double columnEntropy = entropy(k11+k21, k12+k22)*
 *
 *
 which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
 H(colSums(k))) *referred from
 http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .

 LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in
 this example), and I is the mutual infomation.

 [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
 value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) =
 k11/N.


 [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
 H(colSums(k)

 The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k),
 we get:

 *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
 entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
 *
 *
 that multiplied by 2.0 is just the LLR.

 Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong
 or have I misunderstood something?





[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673255#comment-13673255
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

As I understand the only blocker for this issue is a small, self contained 
example which the users can run in a reasonable amount of time and see the 
results. The parts of speech tagger example which I originally adapted for this 
trainer can take hours to converge, and sometimes it fails with arithmetic 
underflow due to an unusually large set of states for the Observations 
(observed states are the words of the corpus in the POS tagger's model). 

When is 0.8 due? I can chip away on this issue for the next few days in the 
evenings and hunt for a short example from the book mentioned above. Should 
require a week or two at least to sign off from my side. 

There are also unit tests with the trainer which demonstrate that it works--the 
results of Map Reduce based training are identical to the ones obtained in the 
sequential version.

 Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
 -

 Key: MAHOUT-627
 URL: https://issues.apache.org/jira/browse/MAHOUT-627
 Project: Mahout
  Issue Type: Task
  Components: Classification
Affects Versions: 0.4, 0.5
Reporter: Dhruv Kumar
Assignee: Grant Ingersoll
  Labels: gsoc, gsoc2011, mahout-gsoc-11
 Fix For: 0.8

 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch


 Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
 Model Training. 
 Student Name: Dhruv Kumar 
 Student E-mail: dku...@ecs.umass.edu 
 Organization/Project: Apache Mahout 
 Assigned Mentor: 
 Proposal Abstract: 
 The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
 because of its superior numerical stability and its ability to guarantee the 
 discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
 presence of incomplete training data. Currently, Apache Mahout has a 
 sequential implementation of the Baum-Welch which cannot be scaled to train 
 over large data sets. This restriction reduces the quality of training and 
 constrains generalization of the learned model when used for prediction. This 
 project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
 version using the Map-Reduce programming framework for enhanced model fitting 
 over large data sets. 
 Detailed Description: 
 Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
 for applications generating temporal or spatial sequential data. Relative 
 simplicity of implementation, combined with their ability to discover latent 
 domain knowledge have made them very popular in diverse fields such as DNA 
 sequence alignment, gene discovery, handwriting analysis, voice recognition, 
 computer vision, language translation and parts-of-speech tagging. 
 A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
 unobservable, hidden states emitting symbols from a finite observable 
 vocabulary set O according to a probabilistic model Theta. The parameters of 
 the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
 transition matrix of the hidden states of size |S| X |S|. The elements 
 a_(i,j) of A specify the probability of transitioning from a state i to state 
 j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
 elements b_(s, o) provide the probability that a symbol o will be emitted 
 from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
 determine the probability that the system starts in the hidden state s. The 
 transitions of hidden states are unobservable and follow the Markov property 
 of memorylessness. 
 Rabiner [1] defined three main problems for HMMs: 
 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
 observation sequence, determine the probability that the model generated the 
 observed sequence. This is useful for evaluating the quality of the model and 
 is solved using the so called Forward algorithm. 
 2. Decoding: Given the complete model (S, O, Theta) and an observation 
 sequence, determine the hidden state sequence which generated the observed 
 sequence. This can be viewed as an inference problem where the model and 
 observed sequence are used to predict the value of the unobservable random 
 variables. The backward algorithm, also known as the Viterbi decoding 
 algorithm is used for predicting the hidden state 

Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Ted Dunning
On Mon, Jun 3, 2013 at 12:11 PM, Dhruv Kumar (JIRA) j...@apache.org wrote:

 When is 0.8 due? I can chip away on this issue for the next few days in
 the evenings and hunt for a short example from the book mentioned above.
 Should require a week or two at least to sign off from my side.

 There are also unit tests with the trainer which demonstrate that it
 works--the results of Map Reduce based training are identical to the ones
 obtained in the sequential version.


Code freeze is the 10th.  If you run, you might make it.


Re: Suggested 0.8 Code Freeze Date

2013-06-03 Thread Jake Mannix
+1

Although does anyone else want to take a crack at the release, so that more
of us get some experience with that?


On Mon, Jun 3, 2013 at 2:14 AM, Dan Filimon dangeorge.fili...@gmail.comwrote:

 +1


 On Jun 3, 2013, at 0:26, Grant Ingersoll gsing...@apache.org wrote:

  I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8
 bugs.
 
  If they aren't in by then, they will get pushed, unless they are
 blockers.
 
  After that, I will create the release candidates.
 
  -Grant




-- 

  -jake


[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)

2013-06-03 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673392#comment-13673392
 ] 

Suneel Marthi commented on MAHOUT-1052:
---

Cleaned up the patch to be compatible with present codebase. Uploading new 
patch.

 Add an option to MinHashDriver that specifies the dimension of vector to hash 
 (indexes or values)
 -

 Key: MAHOUT-1052
 URL: https://issues.apache.org/jira/browse/MAHOUT-1052
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.6
Reporter: Elena Smirnova
Assignee: Suneel Marthi
Priority: Minor
  Labels: minhash
 Fix For: Backlog

 Attachments: MAHOUT-1052.patch


 Add a parameter to MinHash clustering that specifies the dimension of vector 
 to hash (indexes or values). Current version of MinHash clustering only 
 hashed values of vectors. Based on discussion on dev-mahout list, both of the 
 use-cases are possible and frequently met in practice. 
 Preserve backward compatibility with default dimension set to values. Add new 
 unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)

2013-06-03 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1052:
--

Attachment: MAHOUT-1052.patch

 Add an option to MinHashDriver that specifies the dimension of vector to hash 
 (indexes or values)
 -

 Key: MAHOUT-1052
 URL: https://issues.apache.org/jira/browse/MAHOUT-1052
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.6
Reporter: Elena Smirnova
Assignee: Suneel Marthi
Priority: Minor
  Labels: minhash
 Fix For: Backlog

 Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch


 Add a parameter to MinHash clustering that specifies the dimension of vector 
 to hash (indexes or values). Current version of MinHash clustering only 
 hashed values of vectors. Based on discussion on dev-mahout list, both of the 
 use-cases are possible and frequently met in practice. 
 Preserve backward compatibility with default dimension set to values. Add new 
 unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1239) Standardize form of log-likelihood computation

2013-06-03 Thread Sean Owen (JIRA)
Sean Owen created MAHOUT-1239:
-

 Summary: Standardize form of log-likelihood computation
 Key: MAHOUT-1239
 URL: https://issues.apache.org/jira/browse/MAHOUT-1239
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.7
Reporter: Sean Owen
Priority: Minor
 Fix For: 0.8
 Attachments: MAHOUT-1239.patch

qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked like 
its formula was incorrect, at least with respect to 
http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html

It appears that the calculation is correct but in a different form, that is not 
immediately recognizable as correct. The proposal here is to change the code to 
match the blog post and avoid confusion (and ends up avoiding 2 method calls).

(Along the way, I think this fixes a tiny other problem in a related test. We 
have a test case that detects when round-off would produce a negative LLR and 
should be clamped to 0, but the test asserts that the result is 0 not =0.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1239) Standardize form of log-likelihood computation

2013-06-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1239:
--

Assignee: Sean Owen
  Status: Patch Available  (was: Open)

 Standardize form of log-likelihood computation
 --

 Key: MAHOUT-1239
 URL: https://issues.apache.org/jira/browse/MAHOUT-1239
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.7
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1239.patch


 qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked 
 like its formula was incorrect, at least with respect to 
 http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html
 It appears that the calculation is correct but in a different form, that is 
 not immediately recognizable as correct. The proposal here is to change the 
 code to match the blog post and avoid confusion (and ends up avoiding 2 
 method calls).
 (Along the way, I think this fixes a tiny other problem in a related test. We 
 have a test case that detects when round-off would produce a negative LLR and 
 should be clamped to 0, but the test asserts that the result is 0 not =0.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Really want to contribute to mahout

2013-06-03 Thread Yexi Jiang
Certainly, I am always keep an eye on the issue tracker. It is not easy to
find an open issue, most of them are assigned short after it is created.


2013/6/2 Ted Dunning ted.dunn...@gmail.com

 Yexi,

 It is really good that you just spoke up.  The density based clustering
 issue that you filed didn't find a fertile audience, that is true.

 Can you provide a pointer to the other issue?




 On Sat, Jun 1, 2013 at 9:06 PM, Yexi Jiang yexiji...@gmail.com wrote:

  Hi,
 
  I have joined the mailing list for a while and intend to contribute my
 code
  to mahout. However, I tried two issues but didn't get the permission to
  work on them.
 
  I'm wondering how can I contribute to mahout. As I am a graduate student
  working on data mining, I'm really want to do something to make mahout
  better.
 
  Regards,
  Yexi
 




-- 
--
Yexi Jiang,
ECS 251,  yjian...@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/


[jira] [Commented] (MAHOUT-1239) Standardize form of log-likelihood computation

2013-06-03 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673617#comment-13673617
 ] 

Ted Dunning commented on MAHOUT-1239:
-

Looks fine to me.  Go ahead and drop it in.

 Standardize form of log-likelihood computation
 --

 Key: MAHOUT-1239
 URL: https://issues.apache.org/jira/browse/MAHOUT-1239
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.7
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1239.patch


 qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked 
 like its formula was incorrect, at least with respect to 
 http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html
 It appears that the calculation is correct but in a different form, that is 
 not immediately recognizable as correct. The proposal here is to change the 
 code to match the blog post and avoid confusion (and ends up avoiding 2 
 method calls).
 (Along the way, I think this fixes a tiny other problem in a related test. We 
 have a test case that detects when round-off would produce a negative LLR and 
 should be clamped to 0, but the test asserts that the result is 0 not =0.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673689#comment-13673689
 ] 

Grant Ingersoll commented on MAHOUT-627:


Hi Dhruv,

Thanks for the response.  We are trying to get 0.8 in the next week or two.  
Any help on a short example as well as updating the code to trunk would be 
awesome.

Thanks,
Grant

 Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
 -

 Key: MAHOUT-627
 URL: https://issues.apache.org/jira/browse/MAHOUT-627
 Project: Mahout
  Issue Type: Task
  Components: Classification
Affects Versions: 0.4, 0.5
Reporter: Dhruv Kumar
Assignee: Grant Ingersoll
  Labels: gsoc, gsoc2011, mahout-gsoc-11
 Fix For: 0.8

 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
 MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch


 Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
 Model Training. 
 Student Name: Dhruv Kumar 
 Student E-mail: dku...@ecs.umass.edu 
 Organization/Project: Apache Mahout 
 Assigned Mentor: 
 Proposal Abstract: 
 The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
 because of its superior numerical stability and its ability to guarantee the 
 discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
 presence of incomplete training data. Currently, Apache Mahout has a 
 sequential implementation of the Baum-Welch which cannot be scaled to train 
 over large data sets. This restriction reduces the quality of training and 
 constrains generalization of the learned model when used for prediction. This 
 project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
 version using the Map-Reduce programming framework for enhanced model fitting 
 over large data sets. 
 Detailed Description: 
 Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
 for applications generating temporal or spatial sequential data. Relative 
 simplicity of implementation, combined with their ability to discover latent 
 domain knowledge have made them very popular in diverse fields such as DNA 
 sequence alignment, gene discovery, handwriting analysis, voice recognition, 
 computer vision, language translation and parts-of-speech tagging. 
 A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
 unobservable, hidden states emitting symbols from a finite observable 
 vocabulary set O according to a probabilistic model Theta. The parameters of 
 the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
 transition matrix of the hidden states of size |S| X |S|. The elements 
 a_(i,j) of A specify the probability of transitioning from a state i to state 
 j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
 elements b_(s, o) provide the probability that a symbol o will be emitted 
 from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
 determine the probability that the system starts in the hidden state s. The 
 transitions of hidden states are unobservable and follow the Markov property 
 of memorylessness. 
 Rabiner [1] defined three main problems for HMMs: 
 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
 observation sequence, determine the probability that the model generated the 
 observed sequence. This is useful for evaluating the quality of the model and 
 is solved using the so called Forward algorithm. 
 2. Decoding: Given the complete model (S, O, Theta) and an observation 
 sequence, determine the hidden state sequence which generated the observed 
 sequence. This can be viewed as an inference problem where the model and 
 observed sequence are used to predict the value of the unobservable random 
 variables. The backward algorithm, also known as the Viterbi decoding 
 algorithm is used for predicting the hidden state sequence. 
 3. Training: Given the set of hidden states S, the set of observation 
 vocabulary O and the observation sequence, determine the parameters (A, B, 
 Pi) of the model Theta. This problem can be viewed as a statistical machine 
 learning problem of model fitting to a large set of training data. The 
 Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
 the Viterbi training algorithm are commonly used for model fitting. 
 In general, the quality of HMM training can be improved by employing large 
 training vectors but currently, Mahout only supports sequential versions of 
 HMM trainers which are incapable of scaling.  Among 

[jira] [Updated] (MAHOUT-1239) Standardize form of log-likelihood computation

2013-06-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1239:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Standardize form of log-likelihood computation
 --

 Key: MAHOUT-1239
 URL: https://issues.apache.org/jira/browse/MAHOUT-1239
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.7
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1239.patch


 qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked 
 like its formula was incorrect, at least with respect to 
 http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html
 It appears that the calculation is correct but in a different form, that is 
 not immediately recognizable as correct. The proposal here is to change the 
 code to match the blog post and avoid confusion (and ends up avoiding 2 
 method calls).
 (Along the way, I think this fixes a tiny other problem in a related test. We 
 have a test case that detects when round-off would produce a negative LLR and 
 should be clamped to 0, but the test asserts that the result is 0 not =0.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : mahout-nightly #1251

2013-06-03 Thread Apache Jenkins Server
See https://builds.apache.org/job/mahout-nightly/1251/changes



[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)

2013-06-03 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674001#comment-13674001
 ] 

Suneel Marthi commented on MAHOUT-1052:
---

Patch committed to trunk

 Add an option to MinHashDriver that specifies the dimension of vector to hash 
 (indexes or values)
 -

 Key: MAHOUT-1052
 URL: https://issues.apache.org/jira/browse/MAHOUT-1052
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.6
Reporter: Elena Smirnova
Assignee: Suneel Marthi
Priority: Minor
  Labels: minhash
 Fix For: Backlog

 Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch


 Add a parameter to MinHash clustering that specifies the dimension of vector 
 to hash (indexes or values). Current version of MinHash clustering only 
 hashed values of vectors. Based on discussion on dev-mahout list, both of the 
 use-cases are possible and frequently met in practice. 
 Preserve backward compatibility with default dimension set to values. Add new 
 unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)

2013-06-03 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1052:
--

   Resolution: Fixed
Fix Version/s: (was: Backlog)
   0.8
   Status: Resolved  (was: Patch Available)

 Add an option to MinHashDriver that specifies the dimension of vector to hash 
 (indexes or values)
 -

 Key: MAHOUT-1052
 URL: https://issues.apache.org/jira/browse/MAHOUT-1052
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.6
Reporter: Elena Smirnova
Assignee: Suneel Marthi
Priority: Minor
  Labels: minhash
 Fix For: 0.8

 Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch


 Add a parameter to MinHash clustering that specifies the dimension of vector 
 to hash (indexes or values). Current version of MinHash clustering only 
 hashed values of vectors. Based on discussion on dev-mahout list, both of the 
 use-cases are possible and frequently met in practice. 
 Preserve backward compatibility with default dimension set to values. Add new 
 unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira