[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)
[ https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672852#comment-13672852 ] Dawid Weiss commented on MAHOUT-1225: - Ehm. I've closed this issue as per Robin's comment above but I don't think this was the right way to go -- this should have been left open (with a fixed resolution) until a release is made. Apologies for the noise. I can't reopen it now -- probably missing some Jira's karma to do this. Please correct my mistake if you have admin rights, reopen and then bulk close at release time. Thanks! Sets and maps incorrectly clear() their state arrays (potential endless loops) -- Key: MAHOUT-1225 URL: https://issues.apache.org/jira/browse/MAHOUT-1225 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths collections (Set) 0.7, hppc 0.4.3 Reporter: Sophie Sperner Assignee: Dawid Weiss Labels: hashset, java, mahout, test Fix For: 0.7 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar Original Estimate: 48h Remaining Estimate: 48h The code I attached hangs on forever, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can easily run. This code has the main function that simply runs getItemList() method which successfully executes getDataset() method (here please download mushroom.dat dataset and set the full path into filePath string variable) and the hangs on (the problem happens on a fourth columnValues.add() call). After the dataset was taken into X array, the code simply goes through X column by column and searches for different items in it. If you uncomment IntSet columnValues = new IntOpenHashSet(); and corresponding import headers then everything will work just fine (you will also need to include hppc jar file found here http://labs.carrotsearch.com/hppc.html or below in the attachment). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mutual information issue in logLikelihoodRatio
So looking at the tests, this makes things look less horrifying. org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), 0.01); assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10), 0.1); assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, 10), 0.1); assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, 1000, 10), 0.001); assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 10), 0.001); assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 99000), 0.001); Next step is to determine whether these values are correct. I recognize the first two. I put these values into my R script and got a successful load. I think that this means that the code is somehow correct, regardless of your reading of it. I don't have time right now to read the code in detail, but I think that things are working. You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: This is a horrifying possibility. I thought we had several test cases in place to verify this code. Let me look. I wonder if the code you have found is not referenced somehow. On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote: The definition of org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, long k12, long k21, long k22): public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) { Preconditions.checkArgument(k11 = 0 k12 = 0 k21 = 0 k22 = 0); // note that we have counts here, not probabilities, and that the entropy is not normalized. * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* double matrixEntropy = entropy(k11, k12, k21, k22); if (rowEntropy + columnEntropy matrixEntropy) { // round off error return 0.0; } return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); } The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I think it should be: * double rowEntropy = entropy(k11+k12, k21+k22)* * double columnEntropy = entropy(k11+k21, k12+k22)* * * which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) *referred from http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in this example), and I is the mutual infomation. [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = k11/N. [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - H(colSums(k) The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we get: *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* * * that multiplied by 2.0 is just the LLR. Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong or have I misunderstood something?
[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)
[ https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672909#comment-13672909 ] Robin Anil commented on MAHOUT-1225: Could you elaborate on the buggy scenario. I dont see an option to reopen myself. Sets and maps incorrectly clear() their state arrays (potential endless loops) -- Key: MAHOUT-1225 URL: https://issues.apache.org/jira/browse/MAHOUT-1225 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths collections (Set) 0.7, hppc 0.4.3 Reporter: Sophie Sperner Assignee: Dawid Weiss Labels: hashset, java, mahout, test Fix For: 0.7 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar Original Estimate: 48h Remaining Estimate: 48h The code I attached hangs on forever, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can easily run. This code has the main function that simply runs getItemList() method which successfully executes getDataset() method (here please download mushroom.dat dataset and set the full path into filePath string variable) and the hangs on (the problem happens on a fourth columnValues.add() call). After the dataset was taken into X array, the code simply goes through X column by column and searches for different items in it. If you uncomment IntSet columnValues = new IntOpenHashSet(); and corresponding import headers then everything will work just fine (you will also need to include hppc jar file found here http://labs.carrotsearch.com/hppc.html or below in the attachment). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-950) Change BtJob to use new MultipleOutputs API
[ https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672918#comment-13672918 ] Tom White commented on MAHOUT-950: -- MAPREDUCE-3607 is in Hadoop 1.0.1, so this patch should allow Mahout to work against that version of Hadoop or a later one. Change BtJob to use new MultipleOutputs API --- Key: MAHOUT-950 URL: https://issues.apache.org/jira/browse/MAHOUT-950 Project: Mahout Issue Type: Improvement Components: Math Reporter: Tom White Fix For: 1.0 Attachments: MAHOUT-950.patch BtJob uses a mixture of the old and new MapReduce API to allow it to use MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made available in a 1.x release - see MAPREDUCE-3607.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass
[ https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Angel López Otero updated MAHOUT-663: -- Attachment: MAHOUT-663.patch Rationalize hadoop job creation with respect to setJarByClass - Key: MAHOUT-663 URL: https://issues.apache.org/jira/browse/MAHOUT-663 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.4, 0.5 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.6 Attachments: MAHOUT-663.patch Mahout includes a series of driver classes that create hadoop jobs via static methods. Each one of these calls job.setJarByClass(itself.class). Unfortunately, this subverts the hadoop support for putting additional jars in the lib directory of a job jar, since the class passed in is not a class that lives in the ordinary section of the job jar. The effect of this is to force users of Mahout (and Mahout's own example job jar) to unpack the mahout-core jar into the main section, instead of just treating it as a 'lib' dependency. It seems to me that all the static job creators should be refactored into a public function that returns a job object (and does NOT call waitForCompletion), and then the existing wrapper. Users could call the new functions, and make their own call to setJarByClass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass
[ https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Angel López Otero updated MAHOUT-663: -- Attachment: (was: MAHOUT-663.patch) Rationalize hadoop job creation with respect to setJarByClass - Key: MAHOUT-663 URL: https://issues.apache.org/jira/browse/MAHOUT-663 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.4, 0.5 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.6 Attachments: MAHOUT-663.patch Mahout includes a series of driver classes that create hadoop jobs via static methods. Each one of these calls job.setJarByClass(itself.class). Unfortunately, this subverts the hadoop support for putting additional jars in the lib directory of a job jar, since the class passed in is not a class that lives in the ordinary section of the job jar. The effect of this is to force users of Mahout (and Mahout's own example job jar) to unpack the mahout-core jar into the main section, instead of just treating it as a 'lib' dependency. It seems to me that all the static job creators should be refactored into a public function that returns a job object (and does NOT call waitForCompletion), and then the existing wrapper. Users could call the new functions, and make their own call to setJarByClass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Suggested 0.8 Code Freeze Date
+1 on that. On 03.06.2013 00:26, Grant Ingersoll wrote: I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs. If they aren't in by then, they will get pushed, unless they are blockers. After that, I will create the release candidates. -Grant
Re: Suggested 0.8 Code Freeze Date
+1 On Jun 3, 2013, at 0:26, Grant Ingersoll gsing...@apache.org wrote: I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs. If they aren't in by then, they will get pushed, unless they are blockers. After that, I will create the release candidates. -Grant
[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters
[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672981#comment-13672981 ] Grant Ingersoll commented on MAHOUT-1103: - OK, I read up on partitioners and I'd agree, Matt, this is effectively hadoop's way of doing what I proposed and doesn't pollute the M/R code, so I'm going to go forward w/ your patch. clusterpp is not writing directories for all clusters - Key: MAHOUT-1103 URL: https://issues.apache.org/jira/browse/MAHOUT-1103 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Matt Molek Assignee: Grant Ingersoll Labels: clusterpp Fix For: 0.8 Attachments: MAHOUT-1103.patch After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: VL-3742464 - -685560454 VL-3742466 - -685560452 Finally, when running with -xm sequential, everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Performance of primitive collections
Dawid, do you have your existing benchmark code against fastutil/hppc/trove. Since the performance improvements we made in last couple of months I am itching to revisit the numbers. Robin
[jira] [Created] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
Maysam Yabandeh created MAHOUT-1238: --- Summary: VectorWritable's bug with VectorView of sparse vectors Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Fix For: 0.8, 0.7 VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1238: Attachment: MAHOUT-1238.patch I am attaching the patch that fixes the bug. VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Labels: reduce, test Fix For: 0.7, 0.8 Attachments: MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1238: Status: Patch Available (was: Open) Submitting the patch to get Hudson comments. VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Labels: reduce, test Fix For: 0.8, 0.7 Attachments: MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1238: --- Assignee: Robin Anil VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Assignee: Robin Anil Labels: reduce, test Fix For: 0.7, 0.8 Attachments: MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Performance of primitive collections
There has been some improvements in the hashmaps, so I would like to re-run these tests. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Mon, Jun 3, 2013 at 5:46 AM, Sebastiano Vigna vi...@di.unimi.it wrote: On 3 Jun 2013, at 12:43 PM, Robin Anil robin.a...@gmail.com wrote: Dawid, do you have your existing benchmark code against fastutil/hppc/trove. Since the performance improvements we made in last couple of months I am itching to revisit the numbers. If you're interested, they did a thorough job here: http://blog.aggregateknowledge.com/2011/12/12/big-memory-part-4/ Ciao, seba
Re: Performance of primitive collections
Dawid, do you have your existing benchmark code against fastutil/hppc/trove. Since the performance improvements we made in last couple of months I am itching to revisit the numbers. I can rerun those that I have -- will let you know! Should I use the master branch or the official latest release? Dawid
[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)
[ https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673031#comment-13673031 ] Dawid Weiss commented on MAHOUT-1225: - Take a look at this test: {code} @Test public void testClearTable() throws Exception { OpenObjectIntHashMapInteger m = new OpenObjectIntHashMapInteger(); m.clear(); // rehash from the default capacity to the next prime after 1 (3). m.put(1, 2); m.clear(); // Should clear internal references. Field tableField = m.getClass().getDeclaredField(table); tableField.setAccessible(true); Object[] table = (Object[]) tableField.get(m); assertEquals( new HashSetObject(Arrays.asList(new Object [] { null } )), new HashSetObject(Arrays.asList(table))); } {code} This fails because clear() does not explicitly erase the table of references. It does call rehash but not always (not if there's no need) in which case the references stay hard-linked. The fix is to: {code} public void clear() { Arrays.fill(this.state, FREE); +Arrays.fill(this.table, null); + distinct = 0; freeEntries = table.length; // delta trimToSize(); {code} You could avoid this by returning a boolean from trimToSize() and checking whether internal buffers have been reallocated (and thus references freed). Sets and maps incorrectly clear() their state arrays (potential endless loops) -- Key: MAHOUT-1225 URL: https://issues.apache.org/jira/browse/MAHOUT-1225 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths collections (Set) 0.7, hppc 0.4.3 Reporter: Sophie Sperner Assignee: Dawid Weiss Labels: hashset, java, mahout, test Fix For: 0.7 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar Original Estimate: 48h Remaining Estimate: 48h The code I attached hangs on forever, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can easily run. This code has the main function that simply runs getItemList() method which successfully executes getDataset() method (here please download mushroom.dat dataset and set the full path into filePath string variable) and the hangs on (the problem happens on a fourth columnValues.add() call). After the dataset was taken into X array, the code simply goes through X column by column and searches for different items in it. If you uncomment IntSet columnValues = new IntOpenHashSet(); and corresponding import headers then everything will work just fine (you will also need to include hppc jar file found here http://labs.carrotsearch.com/hppc.html or below in the attachment). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters
[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673071#comment-13673071 ] Grant Ingersoll commented on MAHOUT-1103: - Matt, out of curiosity, what's your use case for the clusterpp? [~robinanil] and I are both looking at this code and wondering why it is useful to separate out the clusters into their own directory. MAHOUT-843 doesn't shed any light on it for us either. Also, I don't think the current patch partitions correctly. For instance, try a numPartitions of 2 and cluster ids of 153 and 53. Then, 10^1 means you get 153 % 10 and 53 % 10 both = 3 and you have a collision. So, I think I'm back to my original thought, which is in the mappers and reducers, we need to load up the cluster ids and just map it there. clusterpp is not writing directories for all clusters - Key: MAHOUT-1103 URL: https://issues.apache.org/jira/browse/MAHOUT-1103 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Matt Molek Assignee: Grant Ingersoll Labels: clusterpp Fix For: 0.8 Attachments: MAHOUT-1103.patch After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: VL-3742464 - -685560454 VL-3742466 - -685560452 Finally, when running with -xm sequential, everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Performance of primitive collections
Master/trunk is the place to test. On Mon, Jun 3, 2013 at 7:32 AM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: Dawid, do you have your existing benchmark code against fastutil/hppc/trove. Since the performance improvements we made in last couple of months I am itching to revisit the numbers. I can rerun those that I have -- will let you know! Should I use the master branch or the official latest release? Dawid
[jira] [Commented] (MAHOUT-1225) Sets and maps incorrectly clear() their state arrays (potential endless loops)
[ https://issues.apache.org/jira/browse/MAHOUT-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673096#comment-13673096 ] Dawid Weiss commented on MAHOUT-1225: - Nope, only that. It's fun to see how everything else goes bust when you run those tests on that dead collections branch though. I'll run those microbenchmarks when I get a spare minute. Sets and maps incorrectly clear() their state arrays (potential endless loops) -- Key: MAHOUT-1225 URL: https://issues.apache.org/jira/browse/MAHOUT-1225 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7 Environment: Eclipse, linux Fedora 17, Java 1.7, Mahout Maths collections (Set) 0.7, hppc 0.4.3 Reporter: Sophie Sperner Assignee: Dawid Weiss Labels: hashset, java, mahout, test Fix For: 0.7 Attachments: hppc-0.4.3.jar, MAHOUT-1225.patch, MAHOUT-1225.patch, MAHOUT-1225.patch, mahout-math-0.8-SNAPSHOT.jar Original Estimate: 48h Remaining Estimate: 48h The code I attached hangs on forever, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can easily run. This code has the main function that simply runs getItemList() method which successfully executes getDataset() method (here please download mushroom.dat dataset and set the full path into filePath string variable) and the hangs on (the problem happens on a fourth columnValues.add() call). After the dataset was taken into X array, the code simply goes through X column by column and searches for different items in it. If you uncomment IntSet columnValues = new IntOpenHashSet(); and corresponding import headers then everything will work just fine (you will also need to include hppc jar file found here http://labs.carrotsearch.com/hppc.html or below in the attachment). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673108#comment-13673108 ] Robin Anil commented on MAHOUT-1238: There is a getNumNonZeroElements() method in AbstractVector try using that. VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Assignee: Robin Anil Labels: reduce, test Fix For: 0.7, 0.8 Attachments: MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-976) Implement Multilayer Perceptron
[ https://issues.apache.org/jira/browse/MAHOUT-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673111#comment-13673111 ] Robin Anil commented on MAHOUT-976: --- I see a few system.out.println() please remove those. Also use the Mahout eclipse code formatter to format the files. [~chrisberlin] will you be able to work on these quickly? I am pushing it off the 0.8 list. If you can work on it, please update it and we will review it. Implement Multilayer Perceptron --- Key: MAHOUT-976 URL: https://issues.apache.org/jira/browse/MAHOUT-976 Project: Mahout Issue Type: New Feature Affects Versions: 0.7 Reporter: Christian Herta Assignee: Ted Dunning Priority: Minor Labels: multilayer, networks, neural, perceptron Fix For: 0.8 Attachments: MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch, MAHOUT-976.patch Original Estimate: 80h Remaining Estimate: 80h Implement a multi layer perceptron * via Matrix Multiplication * Learning by Backpropagation; implementing tricks by Yann LeCun et al.: Efficent Backprop * arbitrary number of hidden layers (also 0 - just the linear model) * connection between proximate layers only * different cost and activation functions (different activation function in each layer) * test of backprop by gradient checking * normalization of the inputs (storeable) as part of the model First: * implementation stocastic gradient descent like gradient machine * simple gradient descent incl. momentum Later (new jira issues): * Distributed Batch learning (see below) * Stacked (Denoising) Autoencoder - Feature Learning * advanced cost minimazation like 2nd order methods, conjugate gradient etc. Distribution of learning can be done by (batch learning): 1 Partioning of the data in x chunks 2 Learning the weight changes as matrices in each chunk 3 Combining the matrixes and update of the weights - back to 2 Maybe this procedure can be done with random parts of the chunks (distributed quasi online learning). Batch learning with delta-bar-delta heuristics for adapting the learning rates. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated MAHOUT-1238: Attachment: MAHOUT-1238.patch The attached fixes the bug using AbstractVector#getNumNonZeroElements suggested by [~robinanil] VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Assignee: Robin Anil Labels: reduce, test Fix For: 0.7, 0.8 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1238: --- Resolution: Fixed Status: Resolved (was: Patch Available) Tested and Submitted VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Assignee: Robin Anil Labels: reduce, test Fix For: 0.8, 0.7 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.
[ https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ikumasa Mukai updated MAHOUT-961: - Attachment: MAHOUT-961.patch Thank you for checking!!! I have recreated the patch for the latest codebase. This patch becomes a little big as I applied Eclipse-Lucene-Formatter.xml to sources according to How To Contribute . Modify the Tree/Forest Visualizer on DF. Key: MAHOUT-961 URL: https://issues.apache.org/jira/browse/MAHOUT-961 Project: Mahout Issue Type: Bug Reporter: Ikumasa Mukai Assignee: Sebastian Schelter Labels: RandomForest Fix For: 0.8 Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch The Tree/Forest visualizer (MAHOUT-926) has problems. 1) a un-complemented stem which has no leaf or node is shown. 2) all stems are not shown when the data doesn't have all categories. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Mahout-Examples-Cluster-Reuters #328
See https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/328/changes Changes: [robinanil] MAHOUT-1238 VectorWritable's bug with VectorView of sparse vectors (Maysam Yabandeh) -- [...truncated 836 lines...] [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java: Some input files use or override a deprecated API. [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java: Recompile with -Xlint:deprecation for details. [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java: https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java uses unchecked or unsafe operations. [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/JesterDataModel.java: Recompile with -Xlint:unchecked for details. [INFO] [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ mahout-examples --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 4 resources [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ mahout-examples --- [INFO] Changes detected - recompiling the module! [INFO] Compiling 5 source files to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/test-classes [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java: https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java uses or overrides a deprecated API. [WARNING] https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/src/test/java/org/apache/mahout/classifier/sgd/TrainLogisticTest.java: Recompile with -Xlint:deprecation for details. [INFO] [INFO] --- maven-surefire-plugin:2.14.1:test (default-test) @ mahout-examples --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-examples --- [INFO] Building jar: https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/mahout-examples-0.8-SNAPSHOT.jar [INFO] [INFO] --- maven-dependency-plugin:2.7:copy-dependencies (copy-dependencies) @ mahout-examples --- [INFO] Copying servlet-api-2.5-20081211.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/servlet-api-2.5-20081211.jar [INFO] Copying netty-3.5.9.Final.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/netty-3.5.9.Final.jar [INFO] Copying lucene-facet-4.2.1.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/lucene-facet-4.2.1.jar [INFO] Copying jackson-core-asl-1.9.12.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/jackson-core-asl-1.9.12.jar [INFO] Copying stax-api-1.0.1.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/stax-api-1.0.1.jar [INFO] Copying commons-net-1.4.1.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-net-1.4.1.jar [INFO] Copying json-simple-1.1.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/json-simple-1.1.jar [INFO] Copying jakarta-regexp-1.4.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/jakarta-regexp-1.4.jar [INFO] Copying commons-dbcp-1.4.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-dbcp-1.4.jar [INFO] Copying mongo-java-driver-2.11.1.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/mongo-java-driver-2.11.1.jar [INFO] Copying mahout-core-0.8-SNAPSHOT-tests.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/mahout-core-0.8-SNAPSHOT-tests.jar [INFO] Copying libthrift-0.7.0.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/libthrift-0.7.0.jar [INFO] Copying commons-beanutils-1.7.0.jar to https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/ws/trunk/examples/target/dependency/commons-beanutils-1.7.0.jar [INFO] Copying jbcrypt-0.3m.jar to
[jira] [Commented] (MAHOUT-1238) VectorWritable's bug with VectorView of sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673231#comment-13673231 ] Hudson commented on MAHOUT-1238: Integrated in Mahout-Quality #2034 (See [https://builds.apache.org/job/Mahout-Quality/2034/]) MAHOUT-1238 VectorWritable's bug with VectorView of sparse vectors (Maysam Yabandeh) (Revision 1489001) Result = SUCCESS robinanil : Files : * /mahout/trunk/core/src/main/java/org/apache/mahout/math/VectorWritable.java * /mahout/trunk/core/src/test/java/org/apache/mahout/math/VectorWritableTest.java VectorWritable's bug with VectorView of sparse vectors -- Key: MAHOUT-1238 URL: https://issues.apache.org/jira/browse/MAHOUT-1238 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.7, 0.8 Reporter: Maysam Yabandeh Assignee: Robin Anil Labels: reduce, test Fix For: 0.7, 0.8 Attachments: MAHOUT-1238.patch, MAHOUT-1238.patch VectorWritable raises an exception if it is used on a VectorView of a sparse vector. The reason is that the sparse vector writes only the non-zero elements, while VectorView's implementation of getNumNondefaultElements() returns the size of the entire data. Later when reading the vector, VectorWritable expects reading more items that was written. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass
[ https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Angel López Otero updated MAHOUT-663: -- Attachment: MAHOUT-663.patch Rationalize hadoop job creation with respect to setJarByClass - Key: MAHOUT-663 URL: https://issues.apache.org/jira/browse/MAHOUT-663 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.4, 0.5 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.6 Attachments: MAHOUT-663.patch, MAHOUT-663.patch Mahout includes a series of driver classes that create hadoop jobs via static methods. Each one of these calls job.setJarByClass(itself.class). Unfortunately, this subverts the hadoop support for putting additional jars in the lib directory of a job jar, since the class passed in is not a class that lives in the ordinary section of the job jar. The effect of this is to force users of Mahout (and Mahout's own example job jar) to unpack the mahout-core jar into the main section, instead of just treating it as a 'lib' dependency. It seems to me that all the static job creators should be refactored into a public function that returns a job object (and does NOT call waitForCompletion), and then the existing wrapper. Users could call the new functions, and make their own call to setJarByClass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-663) Rationalize hadoop job creation with respect to setJarByClass
[ https://issues.apache.org/jira/browse/MAHOUT-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Angel López Otero updated MAHOUT-663: -- Attachment: (was: MAHOUT-663.patch) Rationalize hadoop job creation with respect to setJarByClass - Key: MAHOUT-663 URL: https://issues.apache.org/jira/browse/MAHOUT-663 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.4, 0.5 Reporter: Benson Margulies Assignee: Sean Owen Fix For: 0.6 Attachments: MAHOUT-663.patch Mahout includes a series of driver classes that create hadoop jobs via static methods. Each one of these calls job.setJarByClass(itself.class). Unfortunately, this subverts the hadoop support for putting additional jars in the lib directory of a job jar, since the class passed in is not a class that lives in the ordinary section of the job jar. The effect of this is to force users of Mahout (and Mahout's own example job jar) to unpack the mahout-core jar into the main section, instead of just treating it as a 'lib' dependency. It seems to me that all the static job creators should be refactored into a public function that returns a job object (and does NOT call waitForCompletion), and then the existing wrapper. Users could call the new functions, and make their own call to setJarByClass. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mutual information issue in logLikelihoodRatio
Thanks Sean! On Mon, Jun 3, 2013 at 12:14 PM, Sean Owen sro...@gmail.com wrote: I glanced at this and I am confused too. Ted I double-checked your blog post and it seems fine -- you popped the minus sign out of the entropy expression and reversed the args in the mutual info term, which will be relevant in a second. This is computing the value of the G test, right, and you are computing regular entropy and multiplying by the sum later. For the matrix [1 0 ; 0 1] I get an unnormalized LLR of 2.772, yes. In the Java code, the expression for unnormalized entropy looks correct. This is how it gets the N term in there explicitly. It hasn't omitted the minus sign in entropy. But then the final expression should have a minus sign in front of H(k) (matrix entropy) right? and it looks like it does the opposite. The proposed change in this thread doesn't quite work as it results in 0. But I somehow suspect it is prevented from working directly by the previous point. Indeed, if you negate all the entropy calculation (or, flip around the mutual information expression) the tests pass. (Except when it comes to how root LLR is handled for negative LLR, but that's a detail) I suppose it would be best to make the code reflect Ted's nice clear post. It is actually a little faster too. I am still not clear on why the current expression works, though it evidently does. I don't know it's history or if it's just an alternate formulation. Since I'm already here let me see if I can sort out a patch that also addresses negative LLR correctly. On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: So looking at the tests, this makes things look less horrifying. org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), 0.01); assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10), 0.1); assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, 10), 0.1); assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, 1000, 10), 0.001); assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 10), 0.001); assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 99000), 0.001); Next step is to determine whether these values are correct. I recognize the first two. I put these values into my R script and got a successful load. I think that this means that the code is somehow correct, regardless of your reading of it. I don't have time right now to read the code in detail, but I think that things are working. You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: This is a horrifying possibility. I thought we had several test cases in place to verify this code. Let me look. I wonder if the code you have found is not referenced somehow. On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote: The definition of org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, long k12, long k21, long k22): public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) { Preconditions.checkArgument(k11 = 0 k12 = 0 k21 = 0 k22 = 0); // note that we have counts here, not probabilities, and that the entropy is not normalized. * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* double matrixEntropy = entropy(k11, k12, k21, k22); if (rowEntropy + columnEntropy matrixEntropy) { // round off error return 0.0; } return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); } The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I think it should be: * double rowEntropy = entropy(k11+k12, k21+k22)* * double columnEntropy = entropy(k11+k21, k12+k22)* * * which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) *referred from http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in this example), and I is the mutual infomation. [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = k11/N. [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - H(colSums(k) The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we get: *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* * * that multiplied by 2.0 is just the LLR. Is the
Re: mutual information issue in logLikelihoodRatio
I glanced at this and I am confused too. Ted I double-checked your blog post and it seems fine -- you popped the minus sign out of the entropy expression and reversed the args in the mutual info term, which will be relevant in a second. This is computing the value of the G test, right, and you are computing regular entropy and multiplying by the sum later. For the matrix [1 0 ; 0 1] I get an unnormalized LLR of 2.772, yes. In the Java code, the expression for unnormalized entropy looks correct. This is how it gets the N term in there explicitly. It hasn't omitted the minus sign in entropy. But then the final expression should have a minus sign in front of H(k) (matrix entropy) right? and it looks like it does the opposite. The proposed change in this thread doesn't quite work as it results in 0. But I somehow suspect it is prevented from working directly by the previous point. Indeed, if you negate all the entropy calculation (or, flip around the mutual information expression) the tests pass. (Except when it comes to how root LLR is handled for negative LLR, but that's a detail) I suppose it would be best to make the code reflect Ted's nice clear post. It is actually a little faster too. I am still not clear on why the current expression works, though it evidently does. I don't know it's history or if it's just an alternate formulation. Since I'm already here let me see if I can sort out a patch that also addresses negative LLR correctly. On Mon, Jun 3, 2013 at 2:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: So looking at the tests, this makes things look less horrifying. org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), 0.01); assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10), 0.1); assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, 10), 0.1); assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, 1000, 10), 0.001); assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 10), 0.001); assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 99000), 0.001); Next step is to determine whether these values are correct. I recognize the first two. I put these values into my R script and got a successful load. I think that this means that the code is somehow correct, regardless of your reading of it. I don't have time right now to read the code in detail, but I think that things are working. You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: This is a horrifying possibility. I thought we had several test cases in place to verify this code. Let me look. I wonder if the code you have found is not referenced somehow. On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 qzche...@gmail.com wrote: The definition of org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, long k12, long k21, long k22): public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) { Preconditions.checkArgument(k11 = 0 k12 = 0 k21 = 0 k22 = 0); // note that we have counts here, not probabilities, and that the entropy is not normalized. * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* double matrixEntropy = entropy(k11, k12, k21, k22); if (rowEntropy + columnEntropy matrixEntropy) { // round off error return 0.0; } return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); } The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I think it should be: * double rowEntropy = entropy(k11+k12, k21+k22)* * double columnEntropy = entropy(k11+k21, k12+k22)* * * which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) *referred from http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in this example), and I is the mutual infomation. [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = k11/N. [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - H(colSums(k) The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we get: *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* * * that multiplied by 2.0 is just the LLR. Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong or have I misunderstood something?
[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
[ https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673255#comment-13673255 ] Dhruv Kumar commented on MAHOUT-627: Hi Grant, As I understand the only blocker for this issue is a small, self contained example which the users can run in a reasonable amount of time and see the results. The parts of speech tagger example which I originally adapted for this trainer can take hours to converge, and sometimes it fails with arithmetic underflow due to an unusually large set of states for the Observations (observed states are the words of the corpus in the POS tagger's model). When is 0.8 due? I can chip away on this issue for the next few days in the evenings and hunt for a short example from the book mentioned above. Should require a week or two at least to sign off from my side. There are also unit tests with the trainer which demonstrate that it works--the results of Map Reduce based training are identical to the ones obtained in the sequential version. Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. - Key: MAHOUT-627 URL: https://issues.apache.org/jira/browse/MAHOUT-627 Project: Mahout Issue Type: Task Components: Classification Affects Versions: 0.4, 0.5 Reporter: Dhruv Kumar Assignee: Grant Ingersoll Labels: gsoc, gsoc2011, mahout-gsoc-11 Fix For: 0.8 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. Student Name: Dhruv Kumar Student E-mail: dku...@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state
Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
On Mon, Jun 3, 2013 at 12:11 PM, Dhruv Kumar (JIRA) j...@apache.org wrote: When is 0.8 due? I can chip away on this issue for the next few days in the evenings and hunt for a short example from the book mentioned above. Should require a week or two at least to sign off from my side. There are also unit tests with the trainer which demonstrate that it works--the results of Map Reduce based training are identical to the ones obtained in the sequential version. Code freeze is the 10th. If you run, you might make it.
Re: Suggested 0.8 Code Freeze Date
+1 Although does anyone else want to take a crack at the release, so that more of us get some experience with that? On Mon, Jun 3, 2013 at 2:14 AM, Dan Filimon dangeorge.fili...@gmail.comwrote: +1 On Jun 3, 2013, at 0:26, Grant Ingersoll gsing...@apache.org wrote: I'd like to suggest a code freeze of June 10th 2013 for finishing 0.8 bugs. If they aren't in by then, they will get pushed, unless they are blockers. After that, I will create the release candidates. -Grant -- -jake
[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)
[ https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673392#comment-13673392 ] Suneel Marthi commented on MAHOUT-1052: --- Cleaned up the patch to be compatible with present codebase. Uploading new patch. Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) - Key: MAHOUT-1052 URL: https://issues.apache.org/jira/browse/MAHOUT-1052 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.6 Reporter: Elena Smirnova Assignee: Suneel Marthi Priority: Minor Labels: minhash Fix For: Backlog Attachments: MAHOUT-1052.patch Add a parameter to MinHash clustering that specifies the dimension of vector to hash (indexes or values). Current version of MinHash clustering only hashed values of vectors. Based on discussion on dev-mahout list, both of the use-cases are possible and frequently met in practice. Preserve backward compatibility with default dimension set to values. Add new unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)
[ https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1052: -- Attachment: MAHOUT-1052.patch Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) - Key: MAHOUT-1052 URL: https://issues.apache.org/jira/browse/MAHOUT-1052 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.6 Reporter: Elena Smirnova Assignee: Suneel Marthi Priority: Minor Labels: minhash Fix For: Backlog Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch Add a parameter to MinHash clustering that specifies the dimension of vector to hash (indexes or values). Current version of MinHash clustering only hashed values of vectors. Based on discussion on dev-mahout list, both of the use-cases are possible and frequently met in practice. Preserve backward compatibility with default dimension set to values. Add new unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1239) Standardize form of log-likelihood computation
Sean Owen created MAHOUT-1239: - Summary: Standardize form of log-likelihood computation Key: MAHOUT-1239 URL: https://issues.apache.org/jira/browse/MAHOUT-1239 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Sean Owen Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1239.patch qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked like its formula was incorrect, at least with respect to http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html It appears that the calculation is correct but in a different form, that is not immediately recognizable as correct. The proposal here is to change the code to match the blog post and avoid confusion (and ends up avoiding 2 method calls). (Along the way, I think this fixes a tiny other problem in a related test. We have a test case that detects when round-off would produce a negative LLR and should be clamped to 0, but the test asserts that the result is 0 not =0.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1239) Standardize form of log-likelihood computation
[ https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-1239: -- Assignee: Sean Owen Status: Patch Available (was: Open) Standardize form of log-likelihood computation -- Key: MAHOUT-1239 URL: https://issues.apache.org/jira/browse/MAHOUT-1239 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1239.patch qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked like its formula was incorrect, at least with respect to http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html It appears that the calculation is correct but in a different form, that is not immediately recognizable as correct. The proposal here is to change the code to match the blog post and avoid confusion (and ends up avoiding 2 method calls). (Along the way, I think this fixes a tiny other problem in a related test. We have a test case that detects when round-off would produce a negative LLR and should be clamped to 0, but the test asserts that the result is 0 not =0.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Really want to contribute to mahout
Certainly, I am always keep an eye on the issue tracker. It is not easy to find an open issue, most of them are assigned short after it is created. 2013/6/2 Ted Dunning ted.dunn...@gmail.com Yexi, It is really good that you just spoke up. The density based clustering issue that you filed didn't find a fertile audience, that is true. Can you provide a pointer to the other issue? On Sat, Jun 1, 2013 at 9:06 PM, Yexi Jiang yexiji...@gmail.com wrote: Hi, I have joined the mailing list for a while and intend to contribute my code to mahout. However, I tried two issues but didn't get the permission to work on them. I'm wondering how can I contribute to mahout. As I am a graduate student working on data mining, I'm really want to do something to make mahout better. Regards, Yexi -- -- Yexi Jiang, ECS 251, yjian...@cs.fiu.edu School of Computer and Information Science, Florida International University Homepage: http://users.cis.fiu.edu/~yjian004/
[jira] [Commented] (MAHOUT-1239) Standardize form of log-likelihood computation
[ https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673617#comment-13673617 ] Ted Dunning commented on MAHOUT-1239: - Looks fine to me. Go ahead and drop it in. Standardize form of log-likelihood computation -- Key: MAHOUT-1239 URL: https://issues.apache.org/jira/browse/MAHOUT-1239 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1239.patch qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked like its formula was incorrect, at least with respect to http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html It appears that the calculation is correct but in a different form, that is not immediately recognizable as correct. The proposal here is to change the code to match the blog post and avoid confusion (and ends up avoiding 2 method calls). (Along the way, I think this fixes a tiny other problem in a related test. We have a test case that detects when round-off would produce a negative LLR and should be clamped to 0, but the test asserts that the result is 0 not =0.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
[ https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673689#comment-13673689 ] Grant Ingersoll commented on MAHOUT-627: Hi Dhruv, Thanks for the response. We are trying to get 0.8 in the next week or two. Any help on a short example as well as updating the code to trunk would be awesome. Thanks, Grant Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. - Key: MAHOUT-627 URL: https://issues.apache.org/jira/browse/MAHOUT-627 Project: Mahout Issue Type: Task Components: Classification Affects Versions: 0.4, 0.5 Reporter: Dhruv Kumar Assignee: Grant Ingersoll Labels: gsoc, gsoc2011, mahout-gsoc-11 Fix For: 0.8 Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. Student Name: Dhruv Kumar Student E-mail: dku...@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among
[jira] [Updated] (MAHOUT-1239) Standardize form of log-likelihood computation
[ https://issues.apache.org/jira/browse/MAHOUT-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-1239: -- Resolution: Fixed Status: Resolved (was: Patch Available) Standardize form of log-likelihood computation -- Key: MAHOUT-1239 URL: https://issues.apache.org/jira/browse/MAHOUT-1239 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1239.patch qzche...@gmail.com reported that LogLikelihood.logLikelihoodRatio() looked like its formula was incorrect, at least with respect to http://tdunning.blogspot.mx/2008/03/surprise-and-coincidence.html It appears that the calculation is correct but in a different form, that is not immediately recognizable as correct. The proposal here is to change the code to match the blog post and avoid confusion (and ends up avoiding 2 method calls). (Along the way, I think this fixes a tiny other problem in a related test. We have a test case that detects when round-off would produce a negative LLR and should be clamped to 0, but the test asserts that the result is 0 not =0.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : mahout-nightly #1251
See https://builds.apache.org/job/mahout-nightly/1251/changes
[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)
[ https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13674001#comment-13674001 ] Suneel Marthi commented on MAHOUT-1052: --- Patch committed to trunk Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) - Key: MAHOUT-1052 URL: https://issues.apache.org/jira/browse/MAHOUT-1052 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.6 Reporter: Elena Smirnova Assignee: Suneel Marthi Priority: Minor Labels: minhash Fix For: Backlog Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch Add a parameter to MinHash clustering that specifies the dimension of vector to hash (indexes or values). Current version of MinHash clustering only hashed values of vectors. Based on discussion on dev-mahout list, both of the use-cases are possible and frequently met in practice. Preserve backward compatibility with default dimension set to values. Add new unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)
[ https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1052: -- Resolution: Fixed Fix Version/s: (was: Backlog) 0.8 Status: Resolved (was: Patch Available) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) - Key: MAHOUT-1052 URL: https://issues.apache.org/jira/browse/MAHOUT-1052 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.6 Reporter: Elena Smirnova Assignee: Suneel Marthi Priority: Minor Labels: minhash Fix For: 0.8 Attachments: MAHOUT-1052.patch, MAHOUT-1052.patch Add a parameter to MinHash clustering that specifies the dimension of vector to hash (indexes or values). Current version of MinHash clustering only hashed values of vectors. Based on discussion on dev-mahout list, both of the use-cases are possible and frequently met in practice. Preserve backward compatibility with default dimension set to values. Add new unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira