0.8 and bug squashing on June 1
A few of us are at Berlin Buzzwords hanging out and working on Mahout, so if you are interested, feel free to jump on IRC (#mahout on freenode) for some discussion. Not all of our conversation will be translated to IRC, but we are happy to interact w/ others if interested. Also, sounds like maybe we are ready for 0.8? Or at least close? I volunteered to do the release, so I'm going to start going through the 0.8 JIRA issues and triaging them. If you want something in for 0.8, speak now (or relatively soon). I'd like to suggest trying to get an RC out this coming week or the following. -Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Updated] (MAHOUT-1201) Some Mahout jobs do not pass user supplied Configuration object to sub jobs
[ https://issues.apache.org/jira/browse/MAHOUT-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1201: Fix Version/s: 0.8 Some Mahout jobs do not pass user supplied Configuration object to sub jobs --- Key: MAHOUT-1201 URL: https://issues.apache.org/jira/browse/MAHOUT-1201 Project: Mahout Issue Type: Bug Components: Clustering, Frequent Itemset/Association Rule Mining, Math Affects Versions: 0.7 Reporter: Isabel Drost-Fromm Fix For: 0.8 Attachments: MAHOUT-1201-clustering.patch, MAHOUT-1201-entropy.patch, MAHOUT-1201-pfpgrowth.patch, MAHOUT-1201-solver.patch Some (see patch) of our Hadoop jobs do not pass a user supplied configuration object down to sub jobs created. As a result some Hadoop related settings may not be honored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1154) Implementing Streaming KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1154: Fix Version/s: 0.8 Implementing Streaming KMeans - Key: MAHOUT-1154 URL: https://issues.apache.org/jira/browse/MAHOUT-1154 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.8 Reporter: Dan Filimon Fix For: 0.8 An implementation of Streaming KMeans as mentioned in [1] is available here [2]. [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages: - cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses k-means++ for the initialization, then does a normal k-means pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data. - mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood - neighborhood: various searcher classes that implement nearest-neighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming The relevant issues are: - MAHOUT-1155 (Centroid, WeightedVector) - MAHOUT-1156 (searchers) - MAHOUT-1162 (clustering, non map-reduce) - MAHOUT-1181 (map-reduce, command-line changes, pom.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner
Sebastian Schelter created MAHOUT-1235: -- Summary: ParallelALSFactorizationJob does not use VectorSumCombiner Key: MAHOUT-1235 URL: https://issues.apache.org/jira/browse/MAHOUT-1235 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1162) Adding BallKMeans and StreamingKMeans classes
[ https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1162: Fix Version/s: 0.8 Adding BallKMeans and StreamingKMeans classes - Key: MAHOUT-1162 URL: https://issues.apache.org/jira/browse/MAHOUT-1162 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.8 Reporter: Dan Filimon Fix For: 0.8 Attachments: MAHOUT_1162_with_test.patch Adding BallKMeans and StreamingKMeans clustering algorithms. These both implement IterableCentroid and thus return the resulting centroids after clustering. BallKMeans implements: - kmeans++ initialization; - a normal k-means pass; - a trimming threshold so that points that are too far from the cluster they were assigned to are not used in the new centroid computation. StreamingKMeans implements [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]: - an online clustering algorithm that takes each point into account one by one - for each point, it computes the distance to the nearest existing cluster - if the distance is greater than a set distanceCutoff, it will create a new cluster, otherwise it might be added to the cluster it's closest to (proportional to the value of the distance / distanceCutoff) - if there are too many clusters, the clusters will be *collapsed* (the same method gets called, but the number of clusters is re-adjusted) - finally, *about as many* clusters as requested are returned (not precise!); this represents a sketch of the original points. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar
[ https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672126#comment-13672126 ] Grant Ingersoll commented on MAHOUT-1126: - When I build the examples job jar, I don't see a META-INF/LICENSES directory anymore. There is a /META-INF/LICENSE file. There is also a /licenses directory, but it is not in /META-INF Mac builds won't unjar -- Key: MAHOUT-1126 URL: https://issues.apache.org/jira/browse/MAHOUT-1126 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.8 Environment: Builds on the Mac Reporter: Pat Ferrel Labels: build Fix For: 0.8 On the Mac you have to remove the licenses in the mahout jar or hadoop can't unjar mahout. The Mac has a case insensitive file system and so can't tell the difference between LICENSE and license. This was fixed at one point https://issues.apache.org/jira/browse/MAHOUT-780 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/license/ zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/LICENSE/ Looks like as is mentioned in https://issues.apache.org/jira/browse/MAHOUT-780 mv target/maven-shared-archive-resources/META-INF/LICENSE target/maven-shared-archive-resources/META-INF/LICENSES works too. Can this get a permanent fix? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1132) fpgrowth2 crash when have not unique items in one line
[ https://issues.apache.org/jira/browse/MAHOUT-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1132: Fix Version/s: Backlog fpgrowth2 crash when have not unique items in one line -- Key: MAHOUT-1132 URL: https://issues.apache.org/jira/browse/MAHOUT-1132 Project: Mahout Issue Type: Bug Reporter: Kirill A. Korinskiy Fix For: Backlog Attachments: MAHOUT-1132.patch I create follow file as input for fpgrowth2: 0, 0, 0 0, 0, 0 0, 0, 0 and when I run ./bin/mahout -i kv -o output -2 --mathod mapreduct I take a crash: java.lang.IllegalStateException: mismatched counts for targetAttr=0, (3 != 9); thisTree=[FPTree -{attr:-1, cnt:0}-1--{attr:0, cnt:3} ] at org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPTree.createMoreFreqConditionalTree(FPTree.java:259) at org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.growth(FPGrowthIds.java:238) at org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.fpGrowth(FPGrowthIds.java:163) at org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:220) at org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:115) at org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:99) at org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:48) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Follow patch fix it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-684) Topics regularization for LDA
[ https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672128#comment-13672128 ] Grant Ingersoll commented on MAHOUT-684: Any update on this? Topics regularization for LDA - Key: MAHOUT-684 URL: https://issues.apache.org/jira/browse/MAHOUT-684 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Vasil Vasilev Priority: Minor Labels: LDA. Attachments: MAHOUT-684.patch, MAHOUT-684.patch, MAHOUT-684.patch Implementation provided for the alpha parameters estimation as described in the paper of Blei, Ng and Jordan (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf). Remark: there is a mistake in the last formula in A.4.2 (the signs are wrong). The correct version is described here: http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-670) Provide a performance measurement framework for Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-670. Resolution: Won't Fix People who want this can get it off of Github, as there isn't a patch and GH is likely fine for this stuff Provide a performance measurement framework for Mahout -- Key: MAHOUT-670 URL: https://issues.apache.org/jira/browse/MAHOUT-670 Project: Mahout Issue Type: New Feature Components: Integration Reporter: Oliver B. Fischer Assignee: Grant Ingersoll Priority: Minor Labels: framework, performance, test, testing, testsuite Fix For: Backlog At the moment Mahout lacks the existence of a performance test framework. The framework should be able to execute user defined performace test of distributed and non-distributed algorithms, generate reports and to detect regressions in the performace of mahout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar
[ https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672129#comment-13672129 ] Pat Ferrel commented on MAHOUT-1126: Right you are and so the solution has changed to delete the file, not the directory. Still it's a post build process thing and new people have to figure out the solution over and over. There used to be a special exclude in the examples/src/main/assembly/job.xml shown below but I don't think that works anymore. Maybe that could be the source of a permanent fix? I'm not a Maven expert. BTW I don't build in examples but I so use it as an example of how to create a separate build and end up with the same problem because it includes the same deps and license. The problem is obviously not Mahout, but that is the infection vector... excludes excludeorg.apache.hadoop:hadoop-core/exclude !-- This jar contains a LICENSE file in the combined package. Another JAR includes a licenses/ directory. That's OK except when unpacked on case-insensitive file systems like Mac HFS+. Since this isn't really needed, we just remove it. -- excludecom.github.stephenc.high-scale-lib:high-scale-lib/exclude /excludes Mac builds won't unjar -- Key: MAHOUT-1126 URL: https://issues.apache.org/jira/browse/MAHOUT-1126 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.8 Environment: Builds on the Mac Reporter: Pat Ferrel Labels: build Fix For: 0.8 On the Mac you have to remove the licenses in the mahout jar or hadoop can't unjar mahout. The Mac has a case insensitive file system and so can't tell the difference between LICENSE and license. This was fixed at one point https://issues.apache.org/jira/browse/MAHOUT-780 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/license/ zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar META-INF/LICENSE/ Looks like as is mentioned in https://issues.apache.org/jira/browse/MAHOUT-780 mv target/maven-shared-archive-resources/META-INF/LICENSE target/maven-shared-archive-resources/META-INF/LICENSES works too. Can this get a permanent fix? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-775) L2 does not work with TrainAdaptiveLogisticRegression
[ https://issues.apache.org/jira/browse/MAHOUT-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-775: --- Fix Version/s: 0.8 L2 does not work with TrainAdaptiveLogisticRegression - Key: MAHOUT-775 URL: https://issues.apache.org/jira/browse/MAHOUT-775 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.6 Reporter: XiaoboGu Fix For: 0.8 Attachments: MAHOUT-775.patch I have post the problem to the dev list, see the following message http://mail-archives.apache.org/mod_mbox/mahout-dev/201106.mbox/%3cbanlktik6153pjgcfnayuprwbv9jzcxp...@mail.gmail.com%3e -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner
[ https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1235: --- Fix Version/s: 0.8 ParallelALSFactorizationJob does not use VectorSumCombiner -- Key: MAHOUT-1235 URL: https://issues.apache.org/jira/browse/MAHOUT-1235 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Priority: Trivial Fix For: 0.8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner
[ https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1235. Resolution: Fixed ParallelALSFactorizationJob does not use VectorSumCombiner -- Key: MAHOUT-1235 URL: https://issues.apache.org/jira/browse/MAHOUT-1235 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Reporter: Sebastian Schelter Assignee: Sebastian Schelter Priority: Trivial -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-804) Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours
[ https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672132#comment-13672132 ] Grant Ingersoll commented on MAHOUT-804: Not sure what to do, perhaps we should move to the ASF CMS? Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours -- Key: MAHOUT-804 URL: https://issues.apache.org/jira/browse/MAHOUT-804 Project: Mahout Issue Type: Improvement Components: Website Reporter: Dan Brickley Labels: atlassian, confluence, wiki There are two styles of URL in circulation for URLs into Mahout's Wiki (presumably an Apache-wide configuration issue): https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition They appear to be the self-same confluence 3.4.9 installation (or its raw filetree). Each has a different search box at the top of the page. The version with 'confluence/' in the path does a confluence search, and returns similar URLs as results. The one with '.html' suffixes does a domain-constrained Google search. Despite markup canonicalising the confluence variant, ie. link rel=canonical href=https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition; appearing in the confluence pages, it seems the Google search results typically throw people into the other version of the Wiki site. This is all mildly confusing, mildly annoying but overall mostly harmless. It could be having some negative impact on google rank suchlike, since incoming links will be split between the two styles. Maybe this could be passed along to the Wiki admins? Which version does the Mahout team consider canonical URLs (for external links etc)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-836) On donating my Robust PCA Java code to Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672133#comment-13672133 ] Grant Ingersoll commented on MAHOUT-836: Hi Sujit, This is interesting, do you have a patch? On donating my Robust PCA Java code to Mahout - Key: MAHOUT-836 URL: https://issues.apache.org/jira/browse/MAHOUT-836 Project: Mahout Issue Type: New JIRA Project Components: Classification Environment: Platform independent Reporter: Sujit Nair Labels: newbie Original Estimate: 672h Remaining Estimate: 672h Hi All, I have an implementation of Robust PCA (a.k.a low rank and sparse decomposition) in Java which I would like to donate to Mahout. I am a MATLAB expert, comfortable with C++ and have just started with Java. I am completely new to Mahout but am very excited to participate and contribute. I have tested my code exhaustively and there does not seem to be any issues. The results are very good but the code definitely needs some optimization. Please let me know if there is interest. Thanks, Sujit -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-865) Refactor Sequential Clustering algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-865. Resolution: Won't Fix We should open issues for individual instances as desired. Refactor Sequential Clustering algorithms - Key: MAHOUT-865 URL: https://issues.apache.org/jira/browse/MAHOUT-865 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor We have a lot of implementations of sequential clustering algorithms that are kind of treated as an afterthought by sticking them into the *Driver classes. We should pull them out into their own classes with real APIs so that people can use them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies
[ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672143#comment-13672143 ] Ted Dunning commented on MAHOUT-874: Jake, Can you confirm that changing Hadoop to provided solved this for you? I would like to mark this as fixed. Extract Writables into a separate module to allow smaller dependencies -- Key: MAHOUT-874 URL: https://issues.apache.org/jira/browse/MAHOUT-874 Project: Mahout Issue Type: Improvement Reporter: Ted Dunning The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies. I have a prototype, but it has some funky characteristics which I would like to discuss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility
[ https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672144#comment-13672144 ] Ted Dunning commented on MAHOUT-884: Suneel, can you commit this if you think it is good? Matrix Concatenate utility -- Key: MAHOUT-884 URL: https://issues.apache.org/jira/browse/MAHOUT-884 Project: Mahout Issue Type: New Feature Components: Integration Reporter: Lance Norskog Priority: Minor Fix For: 0.8 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch Utility to concatenate matrices stored as SequenceFiles of vectors. Each pair in the SequenceFile is the IntWritable row number and a VectorWritable. The input and output files may skip rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility
[ https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672145#comment-13672145 ] Sebastian Schelter commented on MAHOUT-884: --- regarding the patch: please make sure to always close readers in finally blocks and don't throw an InterruptedException if the job fails. Matrix Concatenate utility -- Key: MAHOUT-884 URL: https://issues.apache.org/jira/browse/MAHOUT-884 Project: Mahout Issue Type: New Feature Components: Integration Reporter: Lance Norskog Priority: Minor Fix For: 0.8 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch Utility to concatenate matrices stored as SequenceFiles of vectors. Each pair in the SequenceFile is the IntWritable row number and a VectorWritable. The input and output files may skip rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1206) Add density-based clustering algorithms to mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672147#comment-13672147 ] Yexi Jiang commented on MAHOUT-1206: Still there is no comments? Add density-based clustering algorithms to mahout - Key: MAHOUT-1206 URL: https://issues.apache.org/jira/browse/MAHOUT-1206 Project: Mahout Issue Type: Improvement Reporter: Yexi Jiang Labels: clustering The clustering algorithms (kmeans, fuzzy kmeans, dirichlet clustering, and spectral cluster) clustering data by assuming that the data can be clustered into the regular hyper sphere or ellipsoid. However, in practical, not all the data can be clustered in this way. To enable the data to be clustered in arbitrary shapes, clustering algorithms like DBSCAN, BIRCH, CLARANCE (http://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering) are proposed. It is better that we can implement one or some of these clustering algorithm to enrich the clustering library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-942) Improbe the way to process the missing value for DF.
[ https://issues.apache.org/jira/browse/MAHOUT-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-942. Resolution: Later Please reopen when you have a patch Improbe the way to process the missing value for DF. Key: MAHOUT-942 URL: https://issues.apache.org/jira/browse/MAHOUT-942 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Ikumasa Mukai Labels: DecisionForest If we process the data which contains the missing value(?), the tree cannot be created because DataConverter.convert inserts the null value to the list of Instances. Of cause we can fix this issue with prohibiting DataConverter.convert insert the null value, but I notice that there is a potentiality that the rows which have missing value(?) can be also used to make the tree. We can use them for making all stems on the edge where we use the missing value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies
[ https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672149#comment-13672149 ] Jake Mannix commented on MAHOUT-874: So marking hadoop as provided is nice, a smaller jar is great, but what I as I mentioned above, the size was never my primary concern, it was the dependency graph: It's really nice that mahout-math is a nice little non-hadoop-depending package which just does stats, linear algebra, and ml which don't have to think about hadoop stuff, even for compile time. -core is big, because it's what mahout is. What I has been wanting is something a little in between, that depends on hadoop (but with provided scope), and mahout-math, but has the writables so that someone can work with mahout data inputs/outputs without actually linking to -core. Essentially, it's the distinction between a mahout-api vs mahout-impl package. Since our API is file-format, the mahout-api module is really just the set of writables needed to be able to marshall/unmarshall our binary data. Extract Writables into a separate module to allow smaller dependencies -- Key: MAHOUT-874 URL: https://issues.apache.org/jira/browse/MAHOUT-874 Project: Mahout Issue Type: Improvement Reporter: Ted Dunning The theory is that we can have a smaller jar if we only include writable classes and their exact dependencies. I have a prototype, but it has some funky characteristics which I would like to discuss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility
[ https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672150#comment-13672150 ] Suneel Marthi commented on MAHOUT-884: -- Agree with Sebastian. I can work on this later today. Matrix Concatenate utility -- Key: MAHOUT-884 URL: https://issues.apache.org/jira/browse/MAHOUT-884 Project: Mahout Issue Type: New Feature Components: Integration Reporter: Lance Norskog Priority: Minor Fix For: 0.8 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch Utility to concatenate matrices stored as SequenceFiles of vectors. Each pair in the SequenceFile is the IntWritable row number and a VectorWritable. The input and output files may skip rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-950) Change BtJob to use new MultipleOutputs API
[ https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672151#comment-13672151 ] Grant Ingersoll commented on MAHOUT-950: I think we still need to support 1.0.X, so I'm not sure how to handle this. Change BtJob to use new MultipleOutputs API --- Key: MAHOUT-950 URL: https://issues.apache.org/jira/browse/MAHOUT-950 Project: Mahout Issue Type: Improvement Components: Math Reporter: Tom White Attachments: MAHOUT-950.patch BtJob uses a mixture of the old and new MapReduce API to allow it to use MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made available in a 1.x release - see MAPREDUCE-3607.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility
[ https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672154#comment-13672154 ] Suneel Marthi commented on MAHOUT-884: -- Also will be adding unit tests as part of committing this patch. Matrix Concatenate utility -- Key: MAHOUT-884 URL: https://issues.apache.org/jira/browse/MAHOUT-884 Project: Mahout Issue Type: New Feature Components: Integration Reporter: Lance Norskog Assignee: Suneel Marthi Priority: Minor Fix For: 0.8 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch Utility to concatenate matrices stored as SequenceFiles of vectors. Each pair in the SequenceFile is the IntWritable row number and a VectorWritable. The input and output files may skip rows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-952) ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other ARFF issues
[ https://issues.apache.org/jira/browse/MAHOUT-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-952: --- Fix Version/s: 0.8 I think we can add this to 0.8. Joe or Stuart, can you update this issue? ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other ARFF issues - Key: MAHOUT-952 URL: https://issues.apache.org/jira/browse/MAHOUT-952 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.6 Environment: Latest SVN on ubuntu Reporter: Stuart Smith Priority: Minor Labels: ARFF Fix For: 0.8 Attachments: MAHOUT-952.patch Whatever is parsing the ARFF file for the ARFFVectorIterable (As far as I can tell, it's the class itself) doesn't handle '?' as a marker for unknown value. See: http://www.cs.waikato.ac.nz/~ml/weka/arff.html I just started looking at Mahout classifiers this week, so I'm not sure how to handle this yet. If I figure it out, I'll post a patch, but until then, guidance would be helpful! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-953: --- Fix Version/s: 0.8 ArffVectorIterable does not gracefully handle duplicate attribute name -- Key: MAHOUT-953 URL: https://issues.apache.org/jira/browse/MAHOUT-953 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Reporter: Stuart Smith Priority: Trivial Fix For: 0.8 If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file. Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel. My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names IllegalArgumentException altogether. For example @attribute my_attribute NUMERIC @attribute my_attribute NUMERIC addLabel() addLabel() labelBindings - ('my_attribute', 1) idxLabel - (0, 'my_attribute), (1, 'my_attribute') I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672158#comment-13672158 ] Grant Ingersoll commented on MAHOUT-953: Stuart, any chance you can get a patch for this to add in 0.8? ArffVectorIterable does not gracefully handle duplicate attribute name -- Key: MAHOUT-953 URL: https://issues.apache.org/jira/browse/MAHOUT-953 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Reporter: Stuart Smith Priority: Trivial If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file. Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel. My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names IllegalArgumentException altogether. For example @attribute my_attribute NUMERIC @attribute my_attribute NUMERIC addLabel() addLabel() labelBindings - ('my_attribute', 1) idxLabel - (0, 'my_attribute), (1, 'my_attribute') I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor
[ https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672161#comment-13672161 ] Grant Ingersoll commented on MAHOUT-966: Any update on this? Seems like it should be fixed for 0.8 Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor -- Key: MAHOUT-966 URL: https://issues.apache.org/jira/browse/MAHOUT-966 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.6 Environment: hadoop 0.20.2 mahout 0.6 Reporter: Gaurav Redkar Priority: Minor Attachments: cluster-dumper-output.txt, clusterpp-output.txt, mtestdata.txt, points100dCCNorm.txt After running the post processor the number of points that each cluster contains is not matching the number of points each cluster should contain as stated by clusterdumper. MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} the n mentioned in clusters-n-final against each cluster is different from the number of points actually contained in d directory for each cluster. Any idea why is this happening ...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor
[ https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-966: --- Fix Version/s: 0.8 Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor -- Key: MAHOUT-966 URL: https://issues.apache.org/jira/browse/MAHOUT-966 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.6 Environment: hadoop 0.20.2 mahout 0.6 Reporter: Gaurav Redkar Priority: Minor Fix For: 0.8 Attachments: cluster-dumper-output.txt, clusterpp-output.txt, mtestdata.txt, points100dCCNorm.txt After running the post processor the number of points that each cluster contains is not matching the number of points each cluster should contain as stated by clusterdumper. MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} the n mentioned in clusters-n-final against each cluster is different from the number of points actually contained in d directory for each cluster. Any idea why is this happening ...? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId
[ https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-974: -- Affects Version/s: (was: 0.6) 0.8 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId --- Key: MAHOUT-974 URL: https://issues.apache.org/jira/browse/MAHOUT-974 Project: Mahout Issue Type: Wish Components: Collaborative Filtering Affects Versions: 0.8 Reporter: Han Hui Wen Assignee: Sebastian Schelter Labels: CF,recommendation,als Original Estimate: 2h Remaining Estimate: 2h org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob uses integer as userId and itemId,but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and ItemId. It's best that ParallelALSFactorizationJob also uses Long as userId and itemId ,so that same dataset can use all the recommendation arithrmetic -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId
[ https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672163#comment-13672163 ] Sebastian Schelter commented on MAHOUT-974: --- Saikat, are you still on this? org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId --- Key: MAHOUT-974 URL: https://issues.apache.org/jira/browse/MAHOUT-974 Project: Mahout Issue Type: Wish Components: Collaborative Filtering Affects Versions: 0.8 Reporter: Han Hui Wen Assignee: Sebastian Schelter Labels: CF,recommendation,als Original Estimate: 2h Remaining Estimate: 2h org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob uses integer as userId and itemId,but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and ItemId. It's best that ParallelALSFactorizationJob also uses Long as userId and itemId ,so that same dataset can use all the recommendation arithrmetic -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-978) spectralkmeans utility fails when input filename begins with leading underscore
[ https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-978. Resolution: Won't Fix I'd say, won't fix, as there is a workaround. Please re-open if there is a specific patch. spectralkmeans utility fails when input filename begins with leading underscore --- Key: MAHOUT-978 URL: https://issues.apache.org/jira/browse/MAHOUT-978 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.6 Environment: Tested on a real Linux-based cluster running Hadoop 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 0.20.203.0 running 16 Feb trunk build. Reporter: Dan Brickley Priority: Minor Attachments: jira-underscore-spectral-log.txt The commandline 'bin/mahout spectralkmeans' utility fails with NoSuchElementException after Loading vector from: spectral/output/results2/calculations/diagonal/part-r-0 when input data in hdfs has filename beginning with a leading underscore. This was partially reported in comments for MAHOUT-524 but I believe identified now as a distinct issue (thanks to Shannon for help diagnosing). I have not investigated if there is an equivalent problem for API-based use of this piece of Mahout. Steps to reproduce: 1. put affinity file into hdfs, following https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs count from zero etc. Name your file with a leading underscore. For example, try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in spectral/input/_topic_skm.csv (I'll leave that example input file in place unchanged for others to try. It is built from dbpedia data, encoding associations from Wikipedia pages to categories. Whether it is a good use of spectral clustering I'm not sure, but I'd at least hope the job would run to completion.) 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o spectral/output/results1' 3. Wait for it to fail just after printing Loading vector from: spectral/output/results1/calculations/diagonal/part-r-0, with java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152). 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the command (give a different results dir or cleanup from the first run, to avoid mixing the tests). This attempt should succeed and you'll see it proceed deeper into the job, i.e. something like 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: spectral/output/results2/calculations/diagonal/part-r-0 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005 12/02/19 14:38:45 INFO mapred.JobClient: map 0% reduce 0% 12/02/19 14:39:31 INFO mapred.JobClient: map 1% reduce 0% (5. You might get a memory-based failure some time later; that is a separate problem.) I'll attach a more detailed transcript. I've made no attempt to diagnose internals yet, but did make some other tests and can confirm that it does not seem to matter whether the commandline invocation names the file explicitly, or by directory name only. Also trailing slash does not seem to be an issue. Finally, a related 'gotcha': make sure the results directory is not inside the input directory when testing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR
[ https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-992: --- Fix Version/s: 0.8 Audit DistributedCache use to support EMR - Key: MAHOUT-992 URL: https://issues.apache.org/jira/browse/MAHOUT-992 Project: Mahout Issue Type: Improvement Affects Versions: 0.6 Reporter: tom pierce Priority: Minor Labels: newbie Fix For: 0.8 Apparently some of our DistributedCache use is not EMR-safe. It would be great if someone could audit our uses of DC, and fix up this problem where it exists. For an example of problematic usage (and the fix), see MAHOUT-980. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1234) Canopy Clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1234. Resolution: Won't Fix Canopy Clustering - Key: MAHOUT-1234 URL: https://issues.apache.org/jira/browse/MAHOUT-1234 Project: Mahout Issue Type: Question Components: Clustering Reporter: Sameer Sebastian Hello, I'm trying out Canopy clustering. I want to know, how to determine the optimum value for the distance thresholds t1 and t2. Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1025) Update documentation for LDA before the release.
[ https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1025. - Resolution: Fixed Update documentation for LDA before the release. Key: MAHOUT-1025 URL: https://issues.apache.org/jira/browse/MAHOUT-1025 Project: Mahout Issue Type: Task Affects Versions: 0.7 Reporter: Robin Anil Assignee: Jake Mannix Fix For: 0.8 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1231) No input clusters found in error in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1231: --- Affects Version/s: (was: 0.8) (was: 0.7) Backlog No input clusters found in error in kmeans - Key: MAHOUT-1231 URL: https://issues.apache.org/jira/browse/MAHOUT-1231 Project: Mahout Issue Type: Question Components: Clustering Affects Versions: Backlog Reporter: Summer Lee 1.seqdirectory mahout seqdirectory --input /user/hdfs/input/new1.csv --output /user/hdfs/new1/seqdirectory --tempDir /user/hdfs/new1/seqdirectory/tempDir 2.seq2sparse mahout seq2sparse --input /user/hdfs/new1/seqdirectory --output /user/hdfs/new1/seq2sparse -wt tfidf 3.kmeans mahout kmeans --input /user/hdfs/new1/seq2sparse/tfidf-vectors --output /user/hdfs/new1/kmeans -c /user/hdfs/new1/clusters/kmeans -x 3 -k 3 --tempDir /user/hdfs/new1/kmeans/tempDir and then error is occured Failing Oozie Launcher, Main class [org.apache.mahout.driver.MahoutDriver], main() threw exception, No input clusters found in /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c argument. java.lang.IllegalStateException: No input clusters found in /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:467) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Oozie Launcher failed, finishing Hadoop job gracefully Oozie Launcher ends === Why kmeans driver can't make clusters in Hadoop with oozie system? In hadoop with not oozie system, it worked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1041) Support for PMML
[ https://issues.apache.org/jira/browse/MAHOUT-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1041. - Resolution: Won't Fix Without a patch, I don't see putting this in. Also, I don't see the benefit of storing largish models in XML. I could see a specific issue that can do I/O of PMML into Mahout's, but I don't see any thing running natively off of PMML. Support for PMML Key: MAHOUT-1041 URL: https://issues.apache.org/jira/browse/MAHOUT-1041 Project: Mahout Issue Type: Improvement Components: Integration Environment: Software Platform Reporter: Duraimurugan Fix For: Backlog Would like to request a support for PMML. With that once the predictive models are built and provided in PMML format, we should be able to import into hadoop cluster for scoring. This way models built in external (non-mahout) systems can be imported to Hadoop/Mahout for scalable environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1204) Rewrite Benchmarks using Caliper
[ https://issues.apache.org/jira/browse/MAHOUT-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1204: --- Affects Version/s: 1.0 Rewrite Benchmarks using Caliper Key: MAHOUT-1204 URL: https://issues.apache.org/jira/browse/MAHOUT-1204 Project: Mahout Issue Type: Improvement Affects Versions: 1.0 Reporter: Robin Anil Assignee: Robin Anil https://code.google.com/p/caliper/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1045) Cluster evaluators returning bad results
[ https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1045. - Resolution: Fixed Looks in and passing Cluster evaluators returning bad results Key: MAHOUT-1045 URL: https://issues.apache.org/jira/browse/MAHOUT-1045 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.6, 0.7, 0.8 Environment: Several environments and data sets Reporter: Pat Ferrel Fix For: 0.8 Attachments: first-time-density-nan.txt, MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch With real world crawl data the Intra-cluster density from ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is almost always 0. I have also seen several cases where CDbw fails to return any results but have not tracked down why yet. I have sent a link to an 8G data set that reproduces these errors to Jeff Eastman. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId
[ https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672172#comment-13672172 ] Saikat Kanjilal commented on MAHOUT-974: Yes, although I could use some general guidance being a newbie on this codebase, I've not had time to research this further, can you respond to my comments above? Thanks org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId --- Key: MAHOUT-974 URL: https://issues.apache.org/jira/browse/MAHOUT-974 Project: Mahout Issue Type: Wish Components: Collaborative Filtering Affects Versions: 0.8 Reporter: Han Hui Wen Assignee: Sebastian Schelter Labels: CF,recommendation,als Original Estimate: 2h Remaining Estimate: 2h org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob uses integer as userId and itemId,but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and ItemId. It's best that ParallelALSFactorizationJob also uses Long as userId and itemId ,so that same dataset can use all the recommendation arithrmetic -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1053) Use KMeans++ for cluster Initialization
[ https://issues.apache.org/jira/browse/MAHOUT-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1053. - Resolution: Fixed This is resolved by the new streaming k-means stuff. Use KMeans++ for cluster Initialization --- Key: MAHOUT-1053 URL: https://issues.apache.org/jira/browse/MAHOUT-1053 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Paritosh Ranjan Fix For: 0.8 Use KMeans++ for cluster intialization. Ted has already implemented a similar version. http://github.com/tdunning/knn -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1054) Use ball KMeans for clustering
[ https://issues.apache.org/jira/browse/MAHOUT-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1054. - Resolution: Fixed This is resolved by the new streaming k-means stuff. Use ball KMeans for clustering -- Key: MAHOUT-1054 URL: https://issues.apache.org/jira/browse/MAHOUT-1054 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Paritosh Ranjan Fix For: 0.8 Use ball KMeans for clustering. Ted has already implemented a similar version. http://github.com/tdunning/knn -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1117) Vectors are not hashable
[ https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672176#comment-13672176 ] Robin Anil commented on MAHOUT-1117: There is no single way good to hash a vector most methods are heavy plus the additional overhead of caching the hash. If you do want to hash vector's, you can override the hash-codes for your specific use-cases. This a design choice we should write down. Vectors are not hashable Key: MAHOUT-1117 URL: https://issues.apache.org/jira/browse/MAHOUT-1117 Project: Mahout Issue Type: Improvement Affects Versions: 1.0 Reporter: Dan Filimon Priority: Minor No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode(). In working on improving clustering in Mahout, Ted Dunning wrote prototype code for Streaming KMeans and Ball KMeans, that I'm working with him on. These need to be used together in the MapReduce version. However, in Ball KMeans, we initialize the clusters using a probabilistic approach similar to k-means++. This however requires a MultinomialWeightedVector distribution of the points we want to cluster to pick the centroids. Internally, the MultinomialT uses a HashMap to keep track of the values it can sample from. Since Vectors don't override Object's hashCode(), it is possible to get the same value multiple times in the map (as long as the references differ). This is less of an issue because of how we're adding the vectors to the multinomial (we can guarantee that the references will be unique) and once MAHOUT-1116 is resolved the hashing will work okay for our needs. It still seems that it would be useful to have hashable vectors. What do you think? And what would a hash function look like? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1117) Vectors are not hashable
[ https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-1117. Resolution: Won't Fix Vectors are not hashable Key: MAHOUT-1117 URL: https://issues.apache.org/jira/browse/MAHOUT-1117 Project: Mahout Issue Type: Improvement Affects Versions: 1.0 Reporter: Dan Filimon Priority: Minor No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode(). In working on improving clustering in Mahout, Ted Dunning wrote prototype code for Streaming KMeans and Ball KMeans, that I'm working with him on. These need to be used together in the MapReduce version. However, in Ball KMeans, we initialize the clusters using a probabilistic approach similar to k-means++. This however requires a MultinomialWeightedVector distribution of the points we want to cluster to pick the centroids. Internally, the MultinomialT uses a HashMap to keep track of the values it can sample from. Since Vectors don't override Object's hashCode(), it is possible to get the same value multiple times in the map (as long as the references differ). This is less of an issue because of how we're adding the vectors to the multinomial (we can guarantee that the references will be unique) and once MAHOUT-1116 is resolved the hashing will work okay for our needs. It still seems that it would be useful to have hashable vectors. What do you think? And what would a hash function look like? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1065) Add CassandraDataModelTest
[ https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672173#comment-13672173 ] Grant Ingersoll commented on MAHOUT-1065: - [~eduardo.gurgel] [~srowen] any update on this one? In or out for 0.8? Add CassandraDataModelTest -- Key: MAHOUT-1065 URL: https://issues.apache.org/jira/browse/MAHOUT-1065 Project: Mahout Issue Type: Test Components: Collaborative Filtering, Integration Affects Versions: 0.8 Reporter: Eduardo Gurgel Pinho Priority: Minor Labels: cassandra, collaborative-filtering, datamodel, hector, taste, test Attachments: 0001-Add-CassandraDataModelTest.patch The test class for the CassandraDataModel class. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments
[ https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1070: Fix Version/s: 0.8 DisplayKMeans example has transposed/mislabelled arguments -- Key: MAHOUT-1070 URL: https://issues.apache.org/jira/browse/MAHOUT-1070 Project: Mahout Issue Type: Bug Components: Examples Affects Versions: 0.7 Reporter: Gabriel Reid Assignee: Paritosh Ranjan Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1070.patch The org.apache.mahout.clustering.display.DisplayKMeans example class uses a value for k (numClusters) and maximum number of iterations to come to convergence, but their use is transposed (i.e. the numClusters is used as max iterations, and max iterations is used for numClusters). Furthermore, a second hard-coded version of the value is used. The end result is that it's not directly possible to experiment with different values of numClusters and maxIterations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1060) Search for nearest neighbor
[ https://issues.apache.org/jira/browse/MAHOUT-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1060. - Resolution: Fixed All of this capability has been added by Dan's streaming k-means clustering work except for the knn stuff. Search for nearest neighbor --- Key: MAHOUT-1060 URL: https://issues.apache.org/jira/browse/MAHOUT-1060 Project: Mahout Issue Type: Bug Components: Math Reporter: Ted Dunning Fix For: 0.8 Attachments: 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 0002-MAHOUT-1059-Stylistic-cleanups.patch, 0002-MAHOUT-1059-Stylistic-cleanups.patch, 0003-MAHOUT-1059-Add-generic-vector-test.patch, 0003-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 0004-MAHOUT-1059-Indentation.patch, 0004-MAHOUT-1060-Add-basic-knn-capabilities.patch, 0005-MAHOUT-1059-Abstract-the-idea-of-a-cached-length.patch, 0006-MAHOUT-1059-Additional-test-for-weighted-vectors.patch, 0007-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 0008-MAHOUT-1060-Add-basic-knn-capabilities.patch, 0009-MAHOUT-1060-shorten-test-sizes.patch This will contain a patch for sequential nearest neighbor search routines that underpin new clustering algorithms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672182#comment-13672182 ] Grant Ingersoll commented on MAHOUT-1080: - Here's a thought: kill NamedVector, and move the single name string to Vector. It seems to me naming a Vector is very, very common. A possible issue, however, is dealing with older Vectors that don't have a name, but we could just treat it as an empty string. IMO, this should be fixed before 1.0 Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments
[ https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672180#comment-13672180 ] Suneel Marthi commented on MAHOUT-1070: --- Is someone looking at this patch? I can take a crack at this if there are no takers. DisplayKMeans example has transposed/mislabelled arguments -- Key: MAHOUT-1070 URL: https://issues.apache.org/jira/browse/MAHOUT-1070 Project: Mahout Issue Type: Bug Components: Examples Affects Versions: 0.7 Reporter: Gabriel Reid Assignee: Paritosh Ranjan Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1070.patch The org.apache.mahout.clustering.display.DisplayKMeans example class uses a value for k (numClusters) and maximum number of iterations to come to convergence, but their use is transposed (i.e. the numClusters is used as max iterations, and max iterations is used for numClusters). Furthermore, a second hard-coded version of the value is used. The end result is that it's not directly possible to experiment with different values of numClusters and maxIterations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)
[ https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672183#comment-13672183 ] Suneel Marthi commented on MAHOUT-1052: --- I can get this patch in for the 0.8 release, but the quality of clusters is still questionable. Nevertheless this patch is still needed, I can open another JIRA for Minhash clustering itself (based on Broder's paper). Thoughts? Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) - Key: MAHOUT-1052 URL: https://issues.apache.org/jira/browse/MAHOUT-1052 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.6 Reporter: Elena Smirnova Assignee: Suneel Marthi Priority: Minor Labels: minhash Fix For: Backlog Attachments: MAHOUT-1052.patch Add a parameter to MinHash clustering that specifies the dimension of vector to hash (indexes or values). Current version of MinHash clustering only hashed values of vectors. Based on discussion on dev-mahout list, both of the use-cases are possible and frequently met in practice. Preserve backward compatibility with default dimension set to values. Add new unit tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments
[ https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-1070. Resolution: Fixed Committed DisplayKMeans example has transposed/mislabelled arguments -- Key: MAHOUT-1070 URL: https://issues.apache.org/jira/browse/MAHOUT-1070 Project: Mahout Issue Type: Bug Components: Examples Affects Versions: 0.7 Reporter: Gabriel Reid Assignee: Paritosh Ranjan Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1070.patch The org.apache.mahout.clustering.display.DisplayKMeans example class uses a value for k (numClusters) and maximum number of iterations to come to convergence, but their use is transposed (i.e. the numClusters is used as max iterations, and max iterations is used for numClusters). Furthermore, a second hard-coded version of the value is used. The end result is that it's not directly possible to experiment with different values of numClusters and maxIterations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1047) CVB hangs after completion
[ https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672186#comment-13672186 ] Suneel Marthi commented on MAHOUT-1047: --- Tested this patch and committing to trunk. CVB hangs after completion -- Key: MAHOUT-1047 URL: https://issues.apache.org/jira/browse/MAHOUT-1047 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Ubuntu Reporter: seth boyles Assignee: Suneel Marthi Priority: Minor Labels: cvb, lda Fix For: 0.7, 0.8 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch After running the new LDA CVB implementation, it hangs and does not terminate the process like every other time I run Mahout Terminal output: 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done. 12/07/19 11:38:49 INFO mapred.JobClient: map 100% reduce 0% 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8 12/07/19 11:38:49 INFO mapred.JobClient: File Output Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793 12/07/19 11:38:49 INFO mapred.JobClient: File Input Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337 12/07/19 11:38:49 INFO mapred.JobClient: FileSystemCounters 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302 12/07/19 11:38:49 INFO mapred.JobClient: Map-Reduce Framework 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 68.165834) $MAHOUT_HOME/mahout cvb -i /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt /home/seth/Scripted/mahout_data/distributions -dict /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-1047) CVB hangs after completion
[ https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi reassigned MAHOUT-1047: - Assignee: Suneel Marthi CVB hangs after completion -- Key: MAHOUT-1047 URL: https://issues.apache.org/jira/browse/MAHOUT-1047 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Ubuntu Reporter: seth boyles Assignee: Suneel Marthi Priority: Minor Labels: cvb, lda Fix For: 0.7, 0.8 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch After running the new LDA CVB implementation, it hangs and does not terminate the process like every other time I run Mahout Terminal output: 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done. 12/07/19 11:38:49 INFO mapred.JobClient: map 100% reduce 0% 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8 12/07/19 11:38:49 INFO mapred.JobClient: File Output Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793 12/07/19 11:38:49 INFO mapred.JobClient: File Input Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337 12/07/19 11:38:49 INFO mapred.JobClient: FileSystemCounters 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302 12/07/19 11:38:49 INFO mapred.JobClient: Map-Reduce Framework 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 68.165834) $MAHOUT_HOME/mahout cvb -i /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt /home/seth/Scripted/mahout_data/distributions -dict /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1206) Add density-based clustering algorithms to mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1206: --- Fix Version/s: Backlog Add density-based clustering algorithms to mahout - Key: MAHOUT-1206 URL: https://issues.apache.org/jira/browse/MAHOUT-1206 Project: Mahout Issue Type: Improvement Reporter: Yexi Jiang Labels: clustering Fix For: Backlog The clustering algorithms (kmeans, fuzzy kmeans, dirichlet clustering, and spectral cluster) clustering data by assuming that the data can be clustered into the regular hyper sphere or ellipsoid. However, in practical, not all the data can be clustered in this way. To enable the data to be clustered in arbitrary shapes, clustering algorithms like DBSCAN, BIRCH, CLARANCE (http://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering) are proposed. It is better that we can implement one or some of these clustering algorithm to enrich the clustering library. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out
[ https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1220: --- Fix Version/s: (was: 0.7) Affects Version/s: (was: 0.7) Backlog seqdirectory brings empty files out --- Key: MAHOUT-1220 URL: https://issues.apache.org/jira/browse/MAHOUT-1220 Project: Mahout Issue Type: Bug Affects Versions: Backlog Reporter: Summer Lee Priority: Minor I put the input file on mahout seqdirectory -- command mahout seqdirectory --input user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output /user/hdfs/mahout_test/output/final3/seqdirectory/ but the result file, chunk-0 contains like this. -- chunk-0 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text I heard that chunk-0 files should have number like SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ... I think my input file is something wrong, so I tried with other different input files but results are same. How can I fix this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1228) Cleanup .gitignore
[ https://issues.apache.org/jira/browse/MAHOUT-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1228: --- Affects Version/s: (was: 0.7) 0.8 Cleanup .gitignore -- Key: MAHOUT-1228 URL: https://issues.apache.org/jira/browse/MAHOUT-1228 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.8 Reporter: Stevo Slavic Priority: Trivial Labels: eclipse, git, maven Attachments: mahout-gitignore.patch .gitignore unnecessarily has duplicate entries for ignoring eclipse IDE specific files and directories, as well as Maven build output directory. For distribution module Maven build output directory is not ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId
[ https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-974: -- Fix Version/s: 0.8 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId --- Key: MAHOUT-974 URL: https://issues.apache.org/jira/browse/MAHOUT-974 Project: Mahout Issue Type: Wish Components: Collaborative Filtering Affects Versions: 0.8 Reporter: Han Hui Wen Assignee: Sebastian Schelter Labels: CF,recommendation,als Fix For: 0.8 Original Estimate: 2h Remaining Estimate: 2h org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob uses integer as userId and itemId,but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and ItemId. It's best that ParallelALSFactorizationJob also uses Long as userId and itemId ,so that same dataset can use all the recommendation arithrmetic -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1228) Cleanup .gitignore
[ https://issues.apache.org/jira/browse/MAHOUT-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1228: --- Fix Version/s: 0.8 Cleanup .gitignore -- Key: MAHOUT-1228 URL: https://issues.apache.org/jira/browse/MAHOUT-1228 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.8 Reporter: Stevo Slavic Priority: Trivial Labels: eclipse, git, maven Fix For: 0.8 Attachments: mahout-gitignore.patch .gitignore unnecessarily has duplicate entries for ignoring eclipse IDE specific files and directories, as well as Maven build output directory. For distribution module Maven build output directory is not ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things
Ted Dunning created MAHOUT-1236: --- Summary: Need a cleaned up serialized format for Vectors to handle names and all other kinds of things Key: MAHOUT-1236 URL: https://issues.apache.org/jira/browse/MAHOUT-1236 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Our current serialization is subject several ills a) it breaks alignment by having a 1 byte flag field (evil, generic) b) it doesn't handle any kind of extensible format like protobufs so it isn't future-proof c) it doesn't handle named vectors very well d) it totally breaks with any other kind of decoration as with Centroids or WeightedVector or ... (see b) I propose that we use the current tag byte on the current serialization with a new flag bit that indicates that the vector will use a protobuf encoding. Then 3 bytes will be skipped to restore alignment. Then there will be a protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1047) CVB hangs after completion
[ https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1047: -- Resolution: Fixed Fix Version/s: (was: 0.7) Status: Resolved (was: Patch Available) CVB hangs after completion -- Key: MAHOUT-1047 URL: https://issues.apache.org/jira/browse/MAHOUT-1047 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Ubuntu Reporter: seth boyles Assignee: Suneel Marthi Priority: Minor Labels: cvb, lda Fix For: 0.8 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch After running the new LDA CVB implementation, it hangs and does not terminate the process like every other time I run Mahout Terminal output: 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done. 12/07/19 11:38:49 INFO mapred.JobClient: map 100% reduce 0% 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8 12/07/19 11:38:49 INFO mapred.JobClient: File Output Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793 12/07/19 11:38:49 INFO mapred.JobClient: File Input Format Counters 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337 12/07/19 11:38:49 INFO mapred.JobClient: FileSystemCounters 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302 12/07/19 11:38:49 INFO mapred.JobClient: Map-Reduce Framework 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 68.165834) $MAHOUT_HOME/mahout cvb -i /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt /home/seth/Scripted/mahout_data/distributions -dict /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script
[ https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi reassigned MAHOUT-1026: - Assignee: Suneel Marthi (was: Jake Mannix) Add LDA (CVB implementation) to the cluster_reuters.sh example script - Key: MAHOUT-1026 URL: https://issues.apache.org/jira/browse/MAHOUT-1026 Project: Mahout Issue Type: Task Components: Clustering Affects Versions: 0.8 Reporter: Sebastian Schelter Assignee: Suneel Marthi Fix For: 0.8 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1153) Implement streaming random forests
[ https://issues.apache.org/jira/browse/MAHOUT-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1153: --- Fix Version/s: Backlog Affects Version/s: (was: 0.7) Implement streaming random forests -- Key: MAHOUT-1153 URL: https://issues.apache.org/jira/browse/MAHOUT-1153 Project: Mahout Issue Type: New Feature Components: Classification Reporter: Andy Twigg Labels: features Fix For: Backlog The current random forest implementations are in-core and not scalable. This issue is to add an out-of-core, scalable, streaming implementation. Initially it could be based on [1], and using mappers in a master-worker style. [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method
[ https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1214: --- Fix Version/s: Backlog Improve the accuracy of the Spectral KMeans Method -- Key: MAHOUT-1214 URL: https://issues.apache.org/jira/browse/MAHOUT-1214 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Environment: Mahout 0.7 Reporter: Yiqun Hu Labels: clustering, improvement Fix For: Backlog The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community. # Issue 1: The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K1; We have an idea and implementation to select based on cosAngle/orthogonality; # Issue 2: The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value. We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1080. - Resolution: Duplicate MAHOUT-1236 address this in the more general case Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things
[ https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672188#comment-13672188 ] Jake Mannix commented on MAHOUT-1236: - Why protobufs? Why not thrift or avro? Maybe we should make this pluggable? Need a cleaned up serialized format for Vectors to handle names and all other kinds of things - Key: MAHOUT-1236 URL: https://issues.apache.org/jira/browse/MAHOUT-1236 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Our current serialization is subject several ills a) it breaks alignment by having a 1 byte flag field (evil, generic) b) it doesn't handle any kind of extensible format like protobufs so it isn't future-proof c) it doesn't handle named vectors very well d) it totally breaks with any other kind of decoration as with Centroids or WeightedVector or ... (see b) I propose that we use the current tag byte on the current serialization with a new flag bit that indicates that the vector will use a protobuf encoding. Then 3 bytes will be skipped to restore alignment. Then there will be a protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things
[ https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672190#comment-13672190 ] Sean Owen commented on MAHOUT-1236: --- There has always been a tension between all of the above, and size. Protobufs are probably the best option since it has a lean means of dealing with optional fields (and variable-length ints right?) but it will still increase the size of each vector. That's probably worth it. There still has to be this VectorWritable factory thing to work with Hadoop, but hey. Why does alignment matter here? Need a cleaned up serialized format for Vectors to handle names and all other kinds of things - Key: MAHOUT-1236 URL: https://issues.apache.org/jira/browse/MAHOUT-1236 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Our current serialization is subject several ills a) it breaks alignment by having a 1 byte flag field (evil, generic) b) it doesn't handle any kind of extensible format like protobufs so it isn't future-proof c) it doesn't handle named vectors very well d) it totally breaks with any other kind of decoration as with Centroids or WeightedVector or ... (see b) I propose that we use the current tag byte on the current serialization with a new flag bit that indicates that the vector will use a protobuf encoding. Then 3 bytes will be skipped to restore alignment. Then there will be a protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1065) Add CassandraDataModelTest
[ https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672191#comment-13672191 ] Sean Owen commented on MAHOUT-1065: --- AFAIK this is on hold until the dependencies are available. I would ice it for the foreseeable future. Add CassandraDataModelTest -- Key: MAHOUT-1065 URL: https://issues.apache.org/jira/browse/MAHOUT-1065 Project: Mahout Issue Type: Test Components: Collaborative Filtering, Integration Affects Versions: 0.8 Reporter: Eduardo Gurgel Pinho Priority: Minor Labels: cassandra, collaborative-filtering, datamodel, hector, taste, test Attachments: 0001-Add-CassandraDataModelTest.patch The test class for the CassandraDataModel class. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things
[ https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672192#comment-13672192 ] Jake Mannix commented on MAHOUT-1236: - Thrift leaves off optional fields pretty well too, right? I've never seen much difference in the sizes of the thrifts vs. protobufs vs. raw writables here at Twitter (we've got some pretty heterogenous sources). What do you mean about a VectorWritable factory thing to work with hadoop? You mean something like ProtobufWritableProtoVector or ThriftWritableThriftVector, (where ProtoVector extends Message, and ThriftVector extends TBase) ? ElephantBird has some good utilities for this kind of thing. (e.g. https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ProtobufWritable.java and https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ThriftWritable.java ) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things - Key: MAHOUT-1236 URL: https://issues.apache.org/jira/browse/MAHOUT-1236 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Our current serialization is subject several ills a) it breaks alignment by having a 1 byte flag field (evil, generic) b) it doesn't handle any kind of extensible format like protobufs so it isn't future-proof c) it doesn't handle named vectors very well d) it totally breaks with any other kind of decoration as with Centroids or WeightedVector or ... (see b) I propose that we use the current tag byte on the current serialization with a new flag bit that indicates that the vector will use a protobuf encoding. Then 3 bytes will be skipped to restore alignment. Then there will be a protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script
[ https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1026: -- Attachment: MAHOUT-1026.patch Add LDA (CVB implementation) to the cluster_reuters.sh example script - Key: MAHOUT-1026 URL: https://issues.apache.org/jira/browse/MAHOUT-1026 Project: Mahout Issue Type: Task Components: Clustering Affects Versions: 0.8 Reporter: Sebastian Schelter Assignee: Suneel Marthi Fix For: 0.8 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch, MAHOUT-1026.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script
[ https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672194#comment-13672194 ] Suneel Marthi commented on MAHOUT-1026: --- Jake, Attached patch takes care of (a) and (b) above. I committed the code for MAHOUT-1047 to trunk (CVB0 clustering wouldn't work without the patch). I am not sure about the cluster quality in (c) - not too familiar with CVB0 to test that. Add LDA (CVB implementation) to the cluster_reuters.sh example script - Key: MAHOUT-1026 URL: https://issues.apache.org/jira/browse/MAHOUT-1026 Project: Mahout Issue Type: Task Components: Clustering Affects Versions: 0.8 Reporter: Sebastian Schelter Assignee: Suneel Marthi Fix For: 0.8 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch, MAHOUT-1026.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things
[ https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672197#comment-13672197 ] Sean Owen commented on MAHOUT-1236: --- Yes it's probably very similar. The comment was more about size being an important concern here too. For example, simpler still is to use Java serialization. But it would serialize the class name with every instance, for example. For a billion small vectors that's huge overhead. That's no issue with these other options, where the reader/writer already know the type and format anyway. The current 'format' is the ultimate in lean, really. The size increase from using protobufs/Thrift/Avro change would come from having to represent optional fields with additional bytes some other way, but that's still relatively minor. The big deal is representing integers compactly, I think. I don't know Thrift/Avro but assume they probably have some variable-length encoding too. FWIW I don't think it's necessarily useful to support N serialization mechanisms, that's not what I was referring to. But it's similar in the sense that the problem is that the serialized format isn't polymorphic. You have to write this generic all-encompassing format and then have some object make (polymorphic, OOP) Java objects correctly from them. That's what VectorWritable does. It's OK because with Hadoop we have to declare the concrete type of the value upfront, and so were always going to need this level of indirection in order to fake polymorphism. That is, this lets you run a job that consumes VectorWritable and actually send it sparse or dense vectors. Now, vectors aren't really going to change. They're indices and numbers. Decorators may change, and while decorators fit cleaning into OOP, they make the mismatch above worse. Right now it works fine with the 'named' extension (what doesn't work well there?). But if you want 10 other decorations to be represented, it will be unwieldy. That's the motivation for wanting a different format. But are there 10 other extensions that are really necessary? How many times do you want to transparently handle either a plain Vector or DecoratedVector? If you actually want and need to know the difference, then you don't need to model this as a 'decoration' and don't have the problem above. Names? OK I can see not caring about whether it's named. Weights? yeah maybe. Centroids? what's special about centroids for example? Anyway I think that's the real question. Need a cleaned up serialized format for Vectors to handle names and all other kinds of things - Key: MAHOUT-1236 URL: https://issues.apache.org/jira/browse/MAHOUT-1236 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Our current serialization is subject several ills a) it breaks alignment by having a 1 byte flag field (evil, generic) b) it doesn't handle any kind of extensible format like protobufs so it isn't future-proof c) it doesn't handle named vectors very well d) it totally breaks with any other kind of decoration as with Centroids or WeightedVector or ... (see b) I propose that we use the current tag byte on the current serialization with a new flag bit that indicates that the vector will use a protobuf encoding. Then 3 bytes will be skipped to restore alignment. Then there will be a protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.
[ https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1084: Fix Version/s: 0.8 We should make sure the examples work, so adding this to 0.8. My env. is messed up right now, so I can't reproduce it at the moment. Kmeans for synthetic control example--there are 12 cluster during iterations. - Key: MAHOUT-1084 URL: https://issues.apache.org/jira/browse/MAHOUT-1084 Project: Mahout Issue Type: Bug Reporter: liutengfei Fix For: 0.8 In Mahout-Kmeans for syntheticcontrol example, using the default parameters means to compute 6 clusters at last. But why there are 12 clusters during Kmeans iterations. According to my observation, the former 6 clusters and the latter 6 clusters are the same before the first iteration,those 6 clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will assign its own points to this 12 clusters. Is here existing logical errors? The 12 clusters are created by the function setup in CIMapper.java, more specifically, is the line classifier.readFromSeqFiles(conf, new Path(priorClustersPath));, here the priorClustersPath means hdfs direction output/clusters-0/, there are 8 files in this direction: _policy,part-randomSeed(one file record six cluster),part-0 to part-5(total six files,every one record a cluster), while reading this direction, _policy will be filtered out, so program will read part-0 to part-5 to create six clusters, then read part-randomSeed to create the other six clusters, this is the reason why there will be 12 clusters before first iteration. Solution: delete associated code to avoid duplicately creating clusters in output/clusters-0/, here i delete codes where create files: part-0 to part-5 in ClusterClassfier.java: public void writeToSeqFiles(Path path) throws IOException { writePolicy(policy, path); /* Configuration config = new Configuration(); FileSystem fs = FileSystem.get(path.toUri(), config); SequenceFile.Writer writer = null; ClusterWritable cw = new ClusterWritable(); for (int i = 0; i models.size(); i++) { try { Cluster cluster = models.get(i); cw.setValue(cluster); writer = new SequenceFile.Writer(fs, config, new Path(path, part- + String.format(Locale.ENGLISH, %05d, i)), IntWritable.class, ClusterWritable.class); Writable key = new IntWritable(i); writer.append(key, cw); } finally { Closeables.closeQuietly(writer); } } */ } I don't know if it is still okay for other progams who using this file, but for KMeans in Syntheticcontrol example, program will create 6 clusters during every iterations as i expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1208) Not able to get the distance from the cluster.
[ https://issues.apache.org/jira/browse/MAHOUT-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1208. Resolution: Won't Fix Not able to get the distance from the cluster. -- Key: MAHOUT-1208 URL: https://issues.apache.org/jira/browse/MAHOUT-1208 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.7 Environment: Ubuntu Reporter: Sameer Sebastian Hello, After clustering, when I am running the clusterdump mahout command, the result doesn't have the distance. Is https://issues.apache.org/jira/browse/MAHOUT-1073, the only reason why it is happening. If there is a work around without a patch, please tell. Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1204) Rewrite Benchmarks using Caliper
[ https://issues.apache.org/jira/browse/MAHOUT-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1204: --- Fix Version/s: Backlog Rewrite Benchmarks using Caliper Key: MAHOUT-1204 URL: https://issues.apache.org/jira/browse/MAHOUT-1204 Project: Mahout Issue Type: Improvement Affects Versions: 1.0 Reporter: Robin Anil Assignee: Robin Anil Fix For: Backlog https://code.google.com/p/caliper/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1092) MultiNormal is slow in common case
[ https://issues.apache.org/jira/browse/MAHOUT-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-1092. - Resolution: Fixed Fix Version/s: 0.8 Ted says it's fixed on 4944dcc7 MultiNormal is slow in common case -- Key: MAHOUT-1092 URL: https://issues.apache.org/jira/browse/MAHOUT-1092 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Priority: Minor Fix For: 0.8 The multinormal generator unnecessarily uses matrix arithmetic for some simple cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1094) when i am giving the testing data from the new set of data without using split ..it is giving the completely wrong confusion matrix
[ https://issues.apache.org/jira/browse/MAHOUT-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672206#comment-13672206 ] Grant Ingersoll commented on MAHOUT-1094: - Please provide more details and reproducible input when i am giving the testing data from the new set of data without using split ..it is giving the completely wrong confusion matrix --- Key: MAHOUT-1094 URL: https://issues.apache.org/jira/browse/MAHOUT-1094 Project: Mahout Issue Type: Question Components: Classification Affects Versions: 0.7 Environment: hadoop 0.20 Reporter: Priyadarshan raj Labels: newbie Original Estimate: 336h Remaining Estimate: 336h hi, i am able to successfully create a model by the command below:- bin/mahout trainnb -i /user/cloudera/MahoutWeighted/1_FactWt-train-vectors -el -o /user/cloudera/MahoutWeighted/model -li /user/cloudera/MahoutWeighted/labelindex -ow but i am unable to use that model...when i am feeding the model with test data in the same way,i trained it..i am not able to get the correct confusion matrix.and the number of map output records coming is equal to the number of files i am feeding .can anyone tell me why it is not coming to be number of lines ?? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters
[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-1103: Fix Version/s: 0.8 clusterpp is not writing directories for all clusters - Key: MAHOUT-1103 URL: https://issues.apache.org/jira/browse/MAHOUT-1103 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Matt Molek Assignee: Paritosh Ranjan Labels: clusterpp Fix For: 0.8 Attachments: MAHOUT-1103.patch After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: VL-3742464 - -685560454 VL-3742466 - -685560452 Finally, when running with -xm sequential, everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters
[ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672209#comment-13672209 ] Grant Ingersoll commented on MAHOUT-1103: - [~dlyubimov] or [~mmolek] any updates on this? clusterpp is not writing directories for all clusters - Key: MAHOUT-1103 URL: https://issues.apache.org/jira/browse/MAHOUT-1103 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.8 Reporter: Matt Molek Assignee: Paritosh Ranjan Labels: clusterpp Attachments: MAHOUT-1103.patch After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. Here is my command sequence for the k=2 run: {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: VL-3742464 - -685560454 VL-3742466 - -685560452 Finally, when running with -xm sequential, everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1200) Mahout tests depend on writing to /tmp/hadoop-$user
[ https://issues.apache.org/jira/browse/MAHOUT-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1200: --- Fix Version/s: 0.8 Mahout tests depend on writing to /tmp/hadoop-$user --- Key: MAHOUT-1200 URL: https://issues.apache.org/jira/browse/MAHOUT-1200 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.7 Reporter: Isabel Drost-Fromm Fix For: 0.8 Attachments: MAHOUT-1200.patch, MAHOUT-1200.patch, MAHOUT-1200.patch Running the Mahout test suite creates the temp directory /tmp/hadoop-$user which is used by all Hadoop related tests that pull up a local cluster. The directory is not removed after running the tests. In particular when running multiple tests in parallel on the same machine as the same user this can lead to problems. To re-produce issue the following commands prior to running the full test suite: mkdir /tmp/hadoop-$USER chmod 000 /tmp/hadoop-$USER mvn test -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input
[ https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672210#comment-13672210 ] Pat Ferrel commented on MAHOUT-1080: +10 As a frequent user of named vectors I would love to see this supported generally. Kmeans clustered output losses vectorId given in the input -- Key: MAHOUT-1080 URL: https://issues.apache.org/jira/browse/MAHOUT-1080 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.7 Reporter: Smita Wadhwa Fix For: 0.8 Attachments: kMeansClusterVectorId.diff The input to the Kmeans is Intwritable and vectorWritable and the output of clustered points is clusterId WeightedVectorWitable(vector,distance-from-the-centre) The information the id of the vector is lost in this processing . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
[ https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672212#comment-13672212 ] Grant Ingersoll commented on MAHOUT-1108: - Elmer, can you supply a patch? cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true --- Key: MAHOUT-1108 URL: https://issues.apache.org/jira/browse/MAHOUT-1108 Project: Mahout Issue Type: Bug Affects Versions: 0.7 Reporter: Elmer Garduno Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Got the following exception when running the command with HADOOP_CONF and HADOOP_CONF_DIR Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 1 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-598) Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster
[ https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-598. --- Resolution: Cannot Reproduce Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster Key: MAHOUT-598 URL: https://issues.apache.org/jira/browse/MAHOUT-598 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.4, 0.5 Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2 Reporter: Timothy Potter Assignee: Grant Ingersoll Fix For: 0.8 While working on MAHOUT-588, I've discovered an issue with the seq2sparse job running on EMR. From what I can tell this job is made up of multiple MR steps and downstream steps are expecting output from previous steps to be in HDFS, but the output is in S3 (see errors below). For example, the DictionaryVectorizer wrote dictionary.file.0 to S3 but TFPartialVectorReducer is looking for it in HDFS. To run this job, I spin up an EMR cluster and then add the following step to it (this is using the elastic-mapreduce-ruby tool): elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \ --main-class org.apache.mahout.driver.MahoutDriver \ --arg seq2sparse \ --arg -i --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \ --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \ --arg --weight --arg tfidf \ --arg --chunkSize --arg 200 \ --arg --minSupport --arg 2 \ --arg --minDF --arg 1 \ --arg --maxDFPercent --arg 90 \ --arg --norm --arg 2 \ --arg --maxNGramSize --arg 2 \ --arg --overwrite \ -j JOB_ID With these parameters, I see the following errors in the hadoop logs: java.io.FileNotFoundException: File does not exist: /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) at org.apache.hadoop.mapred.Child.main(Child.java:170) java.io.FileNotFoundException: File does not exist: /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) at org.apache.hadoop.mapred.Child.main(Child.java:170) java.io.FileNotFoundException: File does not exist: /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) at org.apache.hadoop.mapred.Child.main(Child.java:170) Exception in thread main org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at
[jira] [Updated] (MAHOUT-684) Topics regularization for LDA
[ https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-684: -- Fix Version/s: 0.8 Assignee: Jake Mannix Jake, please take a look at this one commit/close as necessary Topics regularization for LDA - Key: MAHOUT-684 URL: https://issues.apache.org/jira/browse/MAHOUT-684 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Vasil Vasilev Assignee: Jake Mannix Priority: Minor Labels: LDA. Fix For: 0.8 Attachments: MAHOUT-684.patch, MAHOUT-684.patch, MAHOUT-684.patch Implementation provided for the alpha parameters estimation as described in the paper of Blei, Ng and Jordan (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf). Remark: there is a mistake in the last formula in A.4.2 (the signs are wrong). The correct version is described here: http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-775) L2 does not work with TrainAdaptiveLogisticRegression
[ https://issues.apache.org/jira/browse/MAHOUT-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-775. --- Resolution: Fixed L2 does not work with TrainAdaptiveLogisticRegression - Key: MAHOUT-775 URL: https://issues.apache.org/jira/browse/MAHOUT-775 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.6 Reporter: XiaoboGu Fix For: 0.8 Attachments: MAHOUT-775.patch I have post the problem to the dev list, see the following message http://mail-archives.apache.org/mod_mbox/mahout-dev/201106.mbox/%3cbanlktik6153pjgcfnayuprwbv9jzcxp...@mail.gmail.com%3e -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1196) LogisticModelParameters uses csv.getTargetCategories() even if csv is not used.
[ https://issues.apache.org/jira/browse/MAHOUT-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1196: --- Fix Version/s: 0.8 LogisticModelParameters uses csv.getTargetCategories() even if csv is not used. --- Key: MAHOUT-1196 URL: https://issues.apache.org/jira/browse/MAHOUT-1196 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Environment: All Reporter: Vineet Krishnan Priority: Trivial Labels: CSV, Classifier, LogisticModelParameters Fix For: 0.8 Original Estimate: 1h Remaining Estimate: 1h saveTo(OutputStream out) tries to get csv.getTargetCategories() even when it has already been set. In a case when CsvRecordFactory is not used, this gives a NullPointerException when saveTo() is called. IMHO a simple null check for targetCategories is sufficient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1196) LogisticModelParameters uses csv.getTargetCategories() even if csv is not used.
[ https://issues.apache.org/jira/browse/MAHOUT-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672217#comment-13672217 ] Sebastian Schelter commented on MAHOUT-1196: Vineet, any progress on this? LogisticModelParameters uses csv.getTargetCategories() even if csv is not used. --- Key: MAHOUT-1196 URL: https://issues.apache.org/jira/browse/MAHOUT-1196 Project: Mahout Issue Type: Bug Components: Classification Affects Versions: 0.8 Environment: All Reporter: Vineet Krishnan Priority: Trivial Labels: CSV, Classifier, LogisticModelParameters Fix For: 0.8 Original Estimate: 1h Remaining Estimate: 1h saveTo(OutputStream out) tries to get csv.getTargetCategories() even when it has already been set. In a case when CsvRecordFactory is not used, this gives a NullPointerException when saveTo() is called. IMHO a simple null check for targetCategories is sufficient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs
[ https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1179: --- Fix Version/s: Backlog GSOC 2013: Refactor and improve the classification APIs --- Key: MAHOUT-1179 URL: https://issues.apache.org/jira/browse/MAHOUT-1179 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Labels: gsoc2013, mentor Fix For: Backlog [via Andy Twigg] Improve and unify the Mahout classification API. Also related to the refactoring of the clustering APIs MAHOUT-1177. The two APIs should be roughly the same, at least in terms of input/output so that pipelining etc is easier. (cf scikit-learn clustering/classifier/regression API) Currently Mahout support: - logistic regression - Naive Bayes - Random Forests -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs
[ https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1177: --- Fix Version/s: Backlog GSOC 2013: Reform and simplify the clustering APIs -- Key: MAHOUT-1177 URL: https://issues.apache.org/jira/browse/MAHOUT-1177 Project: Mahout Issue Type: Improvement Reporter: Dan Filimon Labels: gsoc2013, mentor Fix For: Backlog Clustering is one of the most used features in Mahout and has many applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications]. We have of lots clustering algorithms. There's: - basic k-means - canopy clustering - Dirichlet clustering - Fuzzy k-means - Spectral k-means - Streaming k-means [coming soon] We want to make them easier to use by updating the APIs and make sure they all work in the same way have consistent inputs, outputs, diagnostics and documentation. This is a great way to gain an in-depth understanding of clustering algorithms, familiarize yourself with Hadoop, Mahout clustering and good software engineering principles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1178: --- Fix Version/s: Backlog GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Labels: gsoc2013, mentor Fix For: Backlog Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-804) Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours
[ https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-804. --- Resolution: Fixed Seems to be exporting correcting now. Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours -- Key: MAHOUT-804 URL: https://issues.apache.org/jira/browse/MAHOUT-804 Project: Mahout Issue Type: Improvement Components: Website Reporter: Dan Brickley Labels: atlassian, confluence, wiki There are two styles of URL in circulation for URLs into Mahout's Wiki (presumably an Apache-wide configuration issue): https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition They appear to be the self-same confluence 3.4.9 installation (or its raw filetree). Each has a different search box at the top of the page. The version with 'confluence/' in the path does a confluence search, and returns similar URLs as results. The one with '.html' suffixes does a domain-constrained Google search. Despite markup canonicalising the confluence variant, ie. link rel=canonical href=https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition; appearing in the confluence pages, it seems the Google search results typically throw people into the other version of the Wiki site. This is all mildly confusing, mildly annoying but overall mostly harmless. It could be having some negative impact on google rank suchlike, since incoming links will be split between the two styles. Maybe this could be passed along to the Wiki admins? Which version does the Mahout team consider canonical URLs (for external links etc)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1065) Add CassandraDataModelTest
[ https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-1065: --- Affects Version/s: (was: 0.8) Fix Version/s: Backlog Going with what Sean said. Add CassandraDataModelTest -- Key: MAHOUT-1065 URL: https://issues.apache.org/jira/browse/MAHOUT-1065 Project: Mahout Issue Type: Test Components: Collaborative Filtering, Integration Reporter: Eduardo Gurgel Pinho Priority: Minor Labels: cassandra, collaborative-filtering, datamodel, hector, taste, test Fix For: Backlog Attachments: 0001-Add-CassandraDataModelTest.patch The test class for the CassandraDataModel class. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1193) We may want a BlockSparseMatrix
[ https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1193: --- Fix Version/s: Backlog We may want a BlockSparseMatrix --- Key: MAHOUT-1193 URL: https://issues.apache.org/jira/browse/MAHOUT-1193 Project: Mahout Issue Type: Bug Reporter: Ted Dunning Fix For: Backlog Attachments: MAHOUT-1193.patch Here is an implementation. Is it good enough to commit? Is it useful? Is it redundant? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
[ https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672224#comment-13672224 ] Elmer Garduno commented on MAHOUT-1108: --- I will submit it later today. cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true --- Key: MAHOUT-1108 URL: https://issues.apache.org/jira/browse/MAHOUT-1108 Project: Mahout Issue Type: Bug Affects Versions: 0.7 Reporter: Elmer Garduno Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Got the following exception when running the command with HADOOP_CONF and HADOOP_CONF_DIR Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 1 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.
[ https://issues.apache.org/jira/browse/MAHOUT-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1175: --- Fix Version/s: Backlog IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining. Key: MAHOUT-1175 URL: https://issues.apache.org/jira/browse/MAHOUT-1175 Project: Mahout Issue Type: Improvement Components: Frequent Itemset/Association Rule Mining Affects Versions: 0.6 Reporter: Afsal Thaj Priority: Minor Fix For: Backlog We cannot integrate the code for parallel frequent pattern mining to a project which is supposed to be run in an external server that connects to cluster.Program works fine only inside the cluster (from command line to be specific).IllegalStateException and FileNotFoundException can occur otherwise. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1162) Adding BallKMeans and StreamingKMeans classes
[ https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1162. - Resolution: Fixed THis has been checked in. Adding BallKMeans and StreamingKMeans classes - Key: MAHOUT-1162 URL: https://issues.apache.org/jira/browse/MAHOUT-1162 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.8 Reporter: Dan Filimon Fix For: 0.8 Attachments: MAHOUT_1162_with_test.patch Adding BallKMeans and StreamingKMeans clustering algorithms. These both implement IterableCentroid and thus return the resulting centroids after clustering. BallKMeans implements: - kmeans++ initialization; - a normal k-means pass; - a trimming threshold so that points that are too far from the cluster they were assigned to are not used in the new centroid computation. StreamingKMeans implements [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]: - an online clustering algorithm that takes each point into account one by one - for each point, it computes the distance to the nearest existing cluster - if the distance is greater than a set distanceCutoff, it will create a new cluster, otherwise it might be added to the cluster it's closest to (proportional to the value of the distance / distanceCutoff) - if there are too many clusters, the clusters will be *collapsed* (the same method gets called, but the number of clusters is re-adjusted) - finally, *about as many* clusters as requested are returned (not precise!); this represents a sketch of the original points. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1152) mRMR feature selection algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1152: --- Component/s: (was: Integration) mRMR feature selection algorithm Key: MAHOUT-1152 URL: https://issues.apache.org/jira/browse/MAHOUT-1152 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Claudio Reggiani Priority: Minor Labels: algorithm, feature Fix For: 0.8 Original Estimate: 336h Remaining Estimate: 336h Proposal Title: mRMR Feature Selection Algorithm on Map-Reduce. Student Name: Claudio Reggiani Student E-mail: nop...@gmail.com Proposal Abstract: The mRMR algorithm, described in [1], is a feature selection algorithm that leverages mutual information evaluation to select features. At each iteration, mRMR selects a new feature based on both how much it's strongly correlated to the target output and how much it's less correlated to the features already selected. The correlation is measured by means of mutual information. The project proposes to provide the mRMR algorithm in MapReduce programming framework. Additional information: 1. *The code is already available* with some tests, because I'm working on my master thesis an initial milestone of my research was to implement mRMR algorithm in MapReduce. 2. I'm figuring out if it's possible for me to apply at Google Summer of Code 2013. References: [1] Hanchuan Peng, Fuhui Long, and Chris Ding IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. Link: http://penglab.janelia.org/papersall/docpdf/2005_TPAMI_FeaSel.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAHOUT-1210) Fix URLs in mahout-collection-codegen-plugin pom
[ https://issues.apache.org/jira/browse/MAHOUT-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1210. - Resolution: Fixed Fix Version/s: 0.8 Committed this. Great (and obscure) catch, Stevo! Fix URLs in mahout-collection-codegen-plugin pom Key: MAHOUT-1210 URL: https://issues.apache.org/jira/browse/MAHOUT-1210 Project: Mahout Issue Type: Bug Components: build, collections, Math Affects Versions: collections-1.0 Reporter: Stevo Slavic Assignee: Benson Margulies Priority: Minor Labels: maven Fix For: 0.8 Attachments: mahout-collection-codegen-plugin-MAHOUT-1210.patch URLs in mahout-collection-codegen-plugin trunk POM still point to Lucene project and Lucene SVN repository. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1152) mRMR feature selection algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter updated MAHOUT-1152: --- Fix Version/s: (was: 0.8) Backlog mRMR feature selection algorithm Key: MAHOUT-1152 URL: https://issues.apache.org/jira/browse/MAHOUT-1152 Project: Mahout Issue Type: Improvement Affects Versions: 0.7 Reporter: Claudio Reggiani Priority: Minor Labels: algorithm, feature Fix For: Backlog Original Estimate: 336h Remaining Estimate: 336h Proposal Title: mRMR Feature Selection Algorithm on Map-Reduce. Student Name: Claudio Reggiani Student E-mail: nop...@gmail.com Proposal Abstract: The mRMR algorithm, described in [1], is a feature selection algorithm that leverages mutual information evaluation to select features. At each iteration, mRMR selects a new feature based on both how much it's strongly correlated to the target output and how much it's less correlated to the features already selected. The correlation is measured by means of mutual information. The project proposes to provide the mRMR algorithm in MapReduce programming framework. Additional information: 1. *The code is already available* with some tests, because I'm working on my master thesis an initial milestone of my research was to implement mRMR algorithm in MapReduce. 2. I'm figuring out if it's possible for me to apply at Google Summer of Code 2013. References: [1] Hanchuan Peng, Fuhui Long, and Chris Ding IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. Link: http://penglab.janelia.org/papersall/docpdf/2005_TPAMI_FeaSel.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-950) Change BtJob to use new MultipleOutputs API
[ https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-950: -- Fix Version/s: 1.0 Change BtJob to use new MultipleOutputs API --- Key: MAHOUT-950 URL: https://issues.apache.org/jira/browse/MAHOUT-950 Project: Mahout Issue Type: Improvement Components: Math Reporter: Tom White Fix For: 1.0 Attachments: MAHOUT-950.patch BtJob uses a mixture of the old and new MapReduce API to allow it to use MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made available in a 1.x release - see MAPREDUCE-3607.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira