date:20130601


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1201:


Fix Version/s: 0.8

 Some Mahout jobs do not pass user supplied Configuration object to sub jobs
 ---

 Key: MAHOUT-1201
 URL: https://issues.apache.org/jira/browse/MAHOUT-1201
 Project: Mahout
  Issue Type: Bug
  Components: Clustering, Frequent Itemset/Association Rule Mining, 
 Math
Affects Versions: 0.7
Reporter: Isabel Drost-Fromm
 Fix For: 0.8

 Attachments: MAHOUT-1201-clustering.patch, MAHOUT-1201-entropy.patch, 
 MAHOUT-1201-pfpgrowth.patch, MAHOUT-1201-solver.patch


 Some (see patch) of our Hadoop jobs do not pass a user supplied configuration 
 object down to sub jobs created. As a result some Hadoop related settings may 
 not be honored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1154) Implementing Streaming KMeans

[
https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ted Dunning updated MAHOUT-1154:

Fix Version/s: 0.8

Implementing Streaming KMeans
-

Key: MAHOUT-1154
URL: https://issues.apache.org/jira/browse/MAHOUT-1154
Project: Mahout
Issue Type: New Feature
Components: Clustering
Affects Versions: 0.8
Reporter: Dan Filimon
Fix For: 0.8

An implementation of Streaming KMeans as mentioned in [1] is available here
[2].
[1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E
[2] https://github.com/dfilimon/mahout
Since there will be more than one patches, there will be specific JIRA issues
that address each one.
The description of the code being added is:
The main classes are in o.a.m.clustering.streaming [1], under the
core/ project. These are subdivided into 2 packages:
- cluster: contains the BallKMeans and StreamingKMeans classes that
can be used standalone.
BallKMeans is exactly what it sounds like (uses k-means++ for the
initialization, then does a normal k-means pass and ignoring
outilers).
StreamingKMeans implements the online clustering that doesn't return
exactly k clusters, (it returns an estimate). This is used to
approximate the data.
- mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
StreamingKMeansMapper and StreamingKMeansReducer classes.
CentroidWritable serializes Centroids (sort of like AbstractCluster).
StreamingKMeansDriver provides the driver for the job.
StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
sketches of the data for the reducer.
StreamingKMeansReducer collects the centroids produced by the
mappers into one set of weighted points and runs BallKMeans on them
producing the final results.
Additionally the searchers are in o.a.m.math.neighborhood
- neighborhood: various searcher classes that implement nearest-neighbor
search using different strategies.
Searcher, UpdatableSearcher: abstract classes that define how to
search through collections of vectors.
BruteSearch: does a brute search (looks at every point...)
ProjectionSearch: uses random projections for searching.
FastProjectionSearch: also uses random projections (but not binary
search trees as in ProjectionSearch).
HashedVector, LocalitySensitiveHashSearch: implement locality
sensitive hash search.
All the tools that I used are in o.a.m.clustering.streaming [2], under
the examples/ project.
There are a bunch of classes here, covering everything from
vectorizing 20 newsgroups data to various IO utils. The more important
ones are:
utils.ExperimentUtils: convenience methods.
tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
[3]
https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
[4]
https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
The relevant issues are:
- MAHOUT-1155 (Centroid, WeightedVector)
- MAHOUT-1156 (searchers)
- MAHOUT-1162 (clustering, non map-reduce)
- MAHOUT-1181 (map-reduce, command-line changes, pom.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner

Sebastian Schelter created MAHOUT-1235:
--

 Summary: ParallelALSFactorizationJob does not use VectorSumCombiner
 Key: MAHOUT-1235
 URL: https://issues.apache.org/jira/browse/MAHOUT-1235
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1162) Adding BallKMeans and StreamingKMeans classes


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-1162:


Fix Version/s: 0.8

 Adding BallKMeans and StreamingKMeans classes
 -

 Key: MAHOUT-1162
 URL: https://issues.apache.org/jira/browse/MAHOUT-1162
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.8
Reporter: Dan Filimon
 Fix For: 0.8

 Attachments: MAHOUT_1162_with_test.patch


 Adding BallKMeans and StreamingKMeans clustering algorithms.
 These both implement IterableCentroid and thus return the resulting 
 centroids after clustering.
 BallKMeans implements:
 - kmeans++ initialization;
 - a normal k-means pass;
 - a trimming threshold so that points that are too far from the cluster they 
 were assigned to are not used in the new centroid computation.
 StreamingKMeans implements 
 [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]:
 - an online clustering algorithm that takes each point into account one by one
   - for each point, it computes the distance to the nearest existing cluster
   - if the distance is greater than a set distanceCutoff, it will create a 
 new cluster, otherwise it might be added to the cluster it's closest to 
 (proportional to the value of the distance / distanceCutoff)
   - if there are too many clusters, the clusters will be *collapsed* (the 
 same method gets called, but the number of clusters is re-adjusted)
 - finally, *about as many* clusters as requested are returned (not precise!); 
 this represents a sketch of the original points.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar


[ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672126#comment-13672126
 ] 

Grant Ingersoll commented on MAHOUT-1126:
-

When I build the examples job jar, I don't see a META-INF/LICENSES directory 
anymore.  There is a /META-INF/LICENSE file.  There is also a /licenses 
directory, but it is not in /META-INF

 Mac builds won't unjar
 --

 Key: MAHOUT-1126
 URL: https://issues.apache.org/jira/browse/MAHOUT-1126
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.8
 Environment: Builds on the Mac
Reporter: Pat Ferrel
  Labels: build
 Fix For: 0.8


 On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
 unjar mahout. The Mac has a case insensitive file system and so can't tell 
 the difference between LICENSE and license. This was fixed at one point 
 https://issues.apache.org/jira/browse/MAHOUT-780
 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
 META-INF/license/
 zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
 META-INF/LICENSE/
 Looks like as is mentioned in 
 https://issues.apache.org/jira/browse/MAHOUT-780 
 mv target/maven-shared-archive-resources/META-INF/LICENSE 
 target/maven-shared-archive-resources/META-INF/LICENSES
 works too.
 Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1132) fpgrowth2 crash when have not unique items in one line


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-1132:


Fix Version/s: Backlog

 fpgrowth2 crash when have not unique items in one line
 --

 Key: MAHOUT-1132
 URL: https://issues.apache.org/jira/browse/MAHOUT-1132
 Project: Mahout
  Issue Type: Bug
Reporter: Kirill A. Korinskiy
 Fix For: Backlog

 Attachments: MAHOUT-1132.patch


 I create follow file as input for fpgrowth2:
 0, 0, 0
 0, 0, 0
 0, 0, 0
 and when I run ./bin/mahout -i kv -o output -2 --mathod mapreduct I take a 
 crash:
 java.lang.IllegalStateException: mismatched counts for targetAttr=0, (3 != 
 9); thisTree=[FPTree
   -{attr:-1, cnt:0}-1--{attr:0, cnt:3}
 ]
   at 
 org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPTree.createMoreFreqConditionalTree(FPTree.java:259)
   at 
 org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.growth(FPGrowthIds.java:238)
   at 
 org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.fpGrowth(FPGrowthIds.java:163)
   at 
 org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:220)
   at 
 org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:115)
   at 
 org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:99)
   at 
 org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:48)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Follow patch fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-684) Topics regularization for LDA


[ 
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672128#comment-13672128
 ] 

Grant Ingersoll commented on MAHOUT-684:


Any update on this?

 Topics regularization for LDA
 -

 Key: MAHOUT-684
 URL: https://issues.apache.org/jira/browse/MAHOUT-684
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Vasil Vasilev
Priority: Minor
  Labels: LDA.
 Attachments: MAHOUT-684.patch, MAHOUT-684.patch, MAHOUT-684.patch


 Implementation provided for the alpha parameters estimation as described in 
 the paper of Blei, Ng and Jordan 
 (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
 Remark: there is a mistake in the last formula in A.4.2 (the signs are 
 wrong). The correct version is described here: 
 http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-670) Provide a performance measurement framework for Mahout


 [ 
https://issues.apache.org/jira/browse/MAHOUT-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-670.


Resolution: Won't Fix

People who want this can get it off of Github, as there isn't a patch and GH is 
likely fine for this stuff

 Provide a performance measurement framework for Mahout
 --

 Key: MAHOUT-670
 URL: https://issues.apache.org/jira/browse/MAHOUT-670
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Reporter: Oliver B. Fischer
Assignee: Grant Ingersoll
Priority: Minor
  Labels: framework, performance, test, testing, testsuite
 Fix For: Backlog


 At the moment Mahout lacks the existence of a performance test framework. The 
 framework should be able to execute user defined performace test of 
 distributed and non-distributed algorithms, generate reports and to detect 
 regressions in the performace of mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar

2013-06-01 Thread Pat Ferrel (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672129#comment-13672129
]

Pat Ferrel commented on MAHOUT-1126:

Right you are and so the solution has changed to delete the file, not the
directory. Still it's a post build process thing and new people have to figure
out the solution over and over. There used to be a special exclude in the
examples/src/main/assembly/job.xml shown below but I don't think that works
anymore. Maybe that could be the source of a permanent fix? I'm not a Maven
expert.

BTW I don't build in examples but I so use it as an example of how to create a
separate build and end up with the same problem because it includes the same
deps and license. The problem is obviously not Mahout, but that is the
infection vector...

excludes
excludeorg.apache.hadoop:hadoop-core/exclude
!-- This jar contains a LICENSE file in the combined package. Another
JAR includes
a licenses/ directory. That's OK except when unpacked on
case-insensitive file
systems like Mac HFS+. Since this isn't really needed, we just remove
it. --
excludecom.github.stephenc.high-scale-lib:high-scale-lib/exclude
/excludes

Mac builds won't unjar
--

Key: MAHOUT-1126
URL: https://issues.apache.org/jira/browse/MAHOUT-1126
Project: Mahout
Issue Type: Bug
Components: build
Affects Versions: 0.8
Environment: Builds on the Mac
Reporter: Pat Ferrel
Labels: build
Fix For: 0.8

On the Mac you have to remove the licenses in the mahout jar or hadoop can't
unjar mahout. The Mac has a case insensitive file system and so can't tell
the difference between LICENSE and license. This was fixed at one point
https://issues.apache.org/jira/browse/MAHOUT-780
zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
META-INF/license/
zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
META-INF/LICENSE/
Looks like as is mentioned in
https://issues.apache.org/jira/browse/MAHOUT-780
mv target/maven-shared-archive-resources/META-INF/LICENSE
target/maven-shared-archive-resources/META-INF/LICENSES
works too.
Can this get a permanent fix?

[jira] [Updated] (MAHOUT-775) L2 does not work with TrainAdaptiveLogisticRegression


 [ 
https://issues.apache.org/jira/browse/MAHOUT-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-775:
---

Fix Version/s: 0.8

 L2 does not work with TrainAdaptiveLogisticRegression
 -

 Key: MAHOUT-775
 URL: https://issues.apache.org/jira/browse/MAHOUT-775
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.6
Reporter: XiaoboGu
 Fix For: 0.8

 Attachments: MAHOUT-775.patch


 I have post the problem to the dev list, see the following message
 http://mail-archives.apache.org/mod_mbox/mahout-dev/201106.mbox/%3cbanlktik6153pjgcfnayuprwbv9jzcxp...@mail.gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1235:
---

Fix Version/s: 0.8

 ParallelALSFactorizationJob does not use VectorSumCombiner
 --

 Key: MAHOUT-1235
 URL: https://issues.apache.org/jira/browse/MAHOUT-1235
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Trivial
 Fix For: 0.8




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1235.


Resolution: Fixed

 ParallelALSFactorizationJob does not use VectorSumCombiner
 --

 Key: MAHOUT-1235
 URL: https://issues.apache.org/jira/browse/MAHOUT-1235
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Trivial



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-804) Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours


[ 
https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672132#comment-13672132
 ] 

Grant Ingersoll commented on MAHOUT-804:


Not sure what to do, perhaps we should move to the ASF CMS?

 Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles 
 and search behaviours
 --

 Key: MAHOUT-804
 URL: https://issues.apache.org/jira/browse/MAHOUT-804
 Project: Mahout
  Issue Type: Improvement
  Components: Website
Reporter: Dan Brickley
  Labels: atlassian, confluence, wiki

 There are two styles of URL in circulation for URLs into Mahout's Wiki 
 (presumably an Apache-wide configuration issue):
 https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs
 https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
 They appear to be the self-same confluence 3.4.9 installation (or its raw 
 filetree). Each has a different search box at the top of the page. The 
 version with 'confluence/' in the path does a confluence search, and returns 
 similar URLs as results. The one with '.html' suffixes does a 
 domain-constrained Google search. 
 Despite markup canonicalising the confluence variant, ie.  link 
 rel=canonical 
 href=https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition;
  appearing in the confluence pages, it seems the Google search results 
 typically throw people into the other version of the Wiki site.
 This is all mildly confusing, mildly annoying but overall mostly harmless. It 
 could be having some negative impact on google rank  suchlike, since 
 incoming links will be split between the two styles. Maybe this could be 
 passed along to the Wiki admins? 
 Which version does the Mahout team consider canonical URLs (for external 
 links etc)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-836) On donating my Robust PCA Java code to Mahout


[ 
https://issues.apache.org/jira/browse/MAHOUT-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672133#comment-13672133
 ] 

Grant Ingersoll commented on MAHOUT-836:


Hi Sujit,

This is interesting, do you have a patch?

 On donating my Robust PCA Java code to Mahout
 -

 Key: MAHOUT-836
 URL: https://issues.apache.org/jira/browse/MAHOUT-836
 Project: Mahout
  Issue Type: New JIRA Project
  Components: Classification
 Environment: Platform independent
Reporter: Sujit Nair
  Labels: newbie
   Original Estimate: 672h
  Remaining Estimate: 672h

 Hi All,
 I have an implementation of Robust PCA (a.k.a low rank and sparse 
 decomposition) in Java which I would like to donate to Mahout. I am a MATLAB 
 expert, comfortable with C++ and have just started with Java. I am completely 
 new to Mahout but am very excited to participate and contribute. 
 I have tested my code exhaustively and there does not seem to be any issues. 
 The results are very good but the code definitely needs some optimization. 
 Please let me know if there is interest. 
 Thanks,
 Sujit

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-865) Refactor Sequential Clustering algorithms


 [ 
https://issues.apache.org/jira/browse/MAHOUT-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-865.


Resolution: Won't Fix

We should open issues for individual instances as desired.

 Refactor Sequential Clustering algorithms
 -

 Key: MAHOUT-865
 URL: https://issues.apache.org/jira/browse/MAHOUT-865
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor

 We have a lot of implementations of sequential clustering algorithms that are 
 kind of treated as an afterthought by sticking them into the *Driver classes. 
  We should pull them out into their own classes with real APIs so that people 
 can use them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies


[ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672143#comment-13672143
 ] 

Ted Dunning commented on MAHOUT-874:


Jake,

Can you confirm that changing Hadoop to provided solved this for you?

I would like to mark this as fixed.

 Extract Writables into a separate module to allow smaller dependencies
 --

 Key: MAHOUT-874
 URL: https://issues.apache.org/jira/browse/MAHOUT-874
 Project: Mahout
  Issue Type: Improvement
Reporter: Ted Dunning

 The theory is that we can have a smaller jar if we only include writable 
 classes and their exact dependencies.
 I have a prototype, but it has some funky characteristics which I would like 
 to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility


[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672144#comment-13672144
 ] 

Ted Dunning commented on MAHOUT-884:


Suneel, can you commit this if you think it is good?

 Matrix Concatenate utility
 --

 Key: MAHOUT-884
 URL: https://issues.apache.org/jira/browse/MAHOUT-884
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Reporter: Lance Norskog
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch


 Utility to concatenate matrices stored as SequenceFiles of vectors.
 Each pair in the SequenceFile is the IntWritable row number and a 
 VectorWritable.
 The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility


[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672145#comment-13672145
 ] 

Sebastian Schelter commented on MAHOUT-884:
---

regarding the patch: please make sure to always close readers in finally blocks 
and don't throw an InterruptedException if the job fails.

 Matrix Concatenate utility
 --

 Key: MAHOUT-884
 URL: https://issues.apache.org/jira/browse/MAHOUT-884
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Reporter: Lance Norskog
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch


 Utility to concatenate matrices stored as SequenceFiles of vectors.
 Each pair in the SequenceFile is the IntWritable row number and a 
 VectorWritable.
 The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1206) Add density-based clustering algorithms to mahout

2013-06-01 Thread Yexi Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672147#comment-13672147
 ] 

Yexi Jiang commented on MAHOUT-1206:


Still there is no comments?

 Add density-based clustering algorithms to mahout
 -

 Key: MAHOUT-1206
 URL: https://issues.apache.org/jira/browse/MAHOUT-1206
 Project: Mahout
  Issue Type: Improvement
Reporter: Yexi Jiang
  Labels: clustering

 The clustering algorithms (kmeans, fuzzy kmeans, dirichlet clustering, and 
 spectral cluster) clustering data by assuming that the data can be clustered 
 into the regular hyper sphere or ellipsoid. However, in practical, not all 
 the data can be clustered in this way. 
 To enable the data to be clustered in arbitrary shapes, clustering algorithms 
 like DBSCAN, BIRCH, CLARANCE 
 (http://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering) are 
 proposed.
 It is better that we can implement one or some of these clustering algorithm 
 to enrich the clustering library. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-942) Improbe the way to process the missing value for DF.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-942.


Resolution: Later

Please reopen when you have a patch

 Improbe the way to process the missing value for DF.
 

 Key: MAHOUT-942
 URL: https://issues.apache.org/jira/browse/MAHOUT-942
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Ikumasa Mukai
  Labels: DecisionForest

 If we process the data which contains the missing value(?),
 the tree cannot be created because DataConverter.convert inserts the null 
 value
 to the list of Instances.
 Of cause we can fix this issue with prohibiting DataConverter.convert insert
 the null value, but I notice that there is a potentiality that the rows
 which have missing value(?) can be also used to make the tree.
 We can use them for making all stems on the edge where we use the missing 
 value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

2013-06-01 Thread Jake Mannix (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672149#comment-13672149
 ] 

Jake Mannix commented on MAHOUT-874:


So marking hadoop as provided is nice, a smaller jar is great, but what I as I 
mentioned above, the size was never my primary concern, it was the dependency 
graph: It's really nice that mahout-math is a nice little non-hadoop-depending 
package which just does stats, linear algebra, and ml which don't have to think 
about hadoop stuff, even for compile time.  -core is big, because it's what 
mahout is.  What I has been wanting is something a little in between, that 
depends on hadoop (but with provided scope), and mahout-math, but has the 
writables so that someone can work with mahout data inputs/outputs without 
actually linking to -core.

Essentially, it's the distinction between a mahout-api vs mahout-impl 
package.  Since our API is file-format, the mahout-api module is really 
just the set of writables needed to be able to marshall/unmarshall our binary 
data.

 Extract Writables into a separate module to allow smaller dependencies
 --

 Key: MAHOUT-874
 URL: https://issues.apache.org/jira/browse/MAHOUT-874
 Project: Mahout
  Issue Type: Improvement
Reporter: Ted Dunning

 The theory is that we can have a smaller jar if we only include writable 
 classes and their exact dependencies.
 I have a prototype, but it has some funky characteristics which I would like 
 to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility


[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672150#comment-13672150
 ] 

Suneel Marthi commented on MAHOUT-884:
--

Agree with Sebastian. I can work on this later today.

 Matrix Concatenate utility
 --

 Key: MAHOUT-884
 URL: https://issues.apache.org/jira/browse/MAHOUT-884
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Reporter: Lance Norskog
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch


 Utility to concatenate matrices stored as SequenceFiles of vectors.
 Each pair in the SequenceFile is the IntWritable row number and a 
 VectorWritable.
 The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-950) Change BtJob to use new MultipleOutputs API


[ 
https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672151#comment-13672151
 ] 

Grant Ingersoll commented on MAHOUT-950:


I think we still need to support 1.0.X, so I'm not sure how to handle this.

 Change BtJob to use new MultipleOutputs API
 ---

 Key: MAHOUT-950
 URL: https://issues.apache.org/jira/browse/MAHOUT-950
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Reporter: Tom White
 Attachments: MAHOUT-950.patch


 BtJob uses a mixture of the old and new MapReduce API to allow it to use 
 MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when 
 run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new 
 MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made 
 available in a 1.x release - see MAPREDUCE-3607.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility


[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672154#comment-13672154
 ] 

Suneel Marthi commented on MAHOUT-884:
--

Also will be adding unit tests as part of committing this patch.

 Matrix Concatenate utility
 --

 Key: MAHOUT-884
 URL: https://issues.apache.org/jira/browse/MAHOUT-884
 Project: Mahout
  Issue Type: New Feature
  Components: Integration
Reporter: Lance Norskog
Assignee: Suneel Marthi
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch


 Utility to concatenate matrices stored as SequenceFiles of vectors.
 Each pair in the SequenceFile is the IntWritable row number and a 
 VectorWritable.
 The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-952) ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other ARFF issues


 [ 
https://issues.apache.org/jira/browse/MAHOUT-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-952:
---

Fix Version/s: 0.8

I think we can add this to 0.8.  Joe or Stuart, can you update this issue?

 ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other 
 ARFF issues
 -

 Key: MAHOUT-952
 URL: https://issues.apache.org/jira/browse/MAHOUT-952
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.6
 Environment: Latest SVN on ubuntu
Reporter: Stuart Smith
Priority: Minor
  Labels: ARFF
 Fix For: 0.8

 Attachments: MAHOUT-952.patch


 Whatever is parsing the ARFF file for the ARFFVectorIterable (As far as I can 
 tell, it's the class itself) doesn't handle '?' as a marker for unknown 
 value. See: http://www.cs.waikato.ac.nz/~ml/weka/arff.html  
 I just started looking at Mahout classifiers this week, so I'm not sure how 
 to handle this yet. If I figure it out, I'll post a patch, but until then, 
 guidance would be helpful!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name


 [ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-953:
---

Fix Version/s: 0.8

 ArffVectorIterable does not gracefully handle duplicate attribute name
 --

 Key: MAHOUT-953
 URL: https://issues.apache.org/jira/browse/MAHOUT-953
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.6
Reporter: Stuart Smith
Priority: Trivial
 Fix For: 0.8


 If you have duplicate attribute names in your ARFF file, and you have 
 non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
 ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
 of your attribute labels (duplicates removed), but your arff vectors could 
 have more values (if they reference the attribute at both indexes). This is a 
 somewhat pathological ARFF file.
 Not sure if I should note the error (throw an exception) in computeNext() 
 when it's out of bounds, or when someone tries to add duplicate label to the 
 MapBackedArffModel.
 My first impulse would be to check in computeNext(), but addLabel() in 
 MapBackedArffModel will do something rather pathological in the case of 
 duplicate attributes: it overwrites the Label map with the new index, but the 
 idxLabel map will hold a mapping from both indexes to the attribute name, so 
 it's out of sync.. so it may be best to disallow duplicate attribute names 
 IllegalArgumentException altogether.
 For example
 @attribute my_attribute NUMERIC
 @attribute my_attribute NUMERIC
 addLabel()
 addLabel()
 labelBindings - ('my_attribute', 1)
 idxLabel - (0, 'my_attribute), (1, 'my_attribute')
 I'll happily submit a patch, just wondering if it should be in computeNext() 
 or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name


[ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672158#comment-13672158
 ] 

Grant Ingersoll commented on MAHOUT-953:


Stuart, any chance you can get a patch for this to add in 0.8?

 ArffVectorIterable does not gracefully handle duplicate attribute name
 --

 Key: MAHOUT-953
 URL: https://issues.apache.org/jira/browse/MAHOUT-953
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
Affects Versions: 0.6
Reporter: Stuart Smith
Priority: Trivial

 If you have duplicate attribute names in your ARFF file, and you have 
 non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
 ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
 of your attribute labels (duplicates removed), but your arff vectors could 
 have more values (if they reference the attribute at both indexes). This is a 
 somewhat pathological ARFF file.
 Not sure if I should note the error (throw an exception) in computeNext() 
 when it's out of bounds, or when someone tries to add duplicate label to the 
 MapBackedArffModel.
 My first impulse would be to check in computeNext(), but addLabel() in 
 MapBackedArffModel will do something rather pathological in the case of 
 duplicate attributes: it overwrites the Label map with the new index, but the 
 idxLabel map will hold a mapping from both indexes to the attribute name, so 
 it's out of sync.. so it may be best to disallow duplicate attribute names 
 IllegalArgumentException altogether.
 For example
 @attribute my_attribute NUMERIC
 @attribute my_attribute NUMERIC
 addLabel()
 addLabel()
 labelBindings - ('my_attribute', 1)
 idxLabel - (0, 'my_attribute), (1, 'my_attribute')
 I'll happily submit a patch, just wondering if it should be in computeNext() 
 or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor


[ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672161#comment-13672161
 ] 

Grant Ingersoll commented on MAHOUT-966:


Any update on this?  Seems like it should be fixed for 0.8

 Mismatch in the number of points given by the clusterDumper and 
 ClusterOutputPostProcessor
 --

 Key: MAHOUT-966
 URL: https://issues.apache.org/jira/browse/MAHOUT-966
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.6
 Environment: hadoop 0.20.2 mahout 0.6 
Reporter: Gaurav Redkar
Priority: Minor
 Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
 mtestdata.txt, points100dCCNorm.txt


  After running the post processor the number of points that each cluster 
 contains is not matching the number of points each cluster should contain as 
 stated by clusterdumper.
  
 MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
 MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
 the n mentioned in clusters-n-final against each cluster is different from 
 the number of points actually contained in d directory for each cluster. Any 
 idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor


 [ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-966:
---

Fix Version/s: 0.8

 Mismatch in the number of points given by the clusterDumper and 
 ClusterOutputPostProcessor
 --

 Key: MAHOUT-966
 URL: https://issues.apache.org/jira/browse/MAHOUT-966
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.6
 Environment: hadoop 0.20.2 mahout 0.6 
Reporter: Gaurav Redkar
Priority: Minor
 Fix For: 0.8

 Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
 mtestdata.txt, points100dCCNorm.txt


  After running the post processor the number of points that each cluster 
 contains is not matching the number of points each cluster should contain as 
 stated by clusterdumper.
  
 MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
 MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
 the n mentioned in clusters-n-final against each cluster is different from 
 the number of points actually contained in d directory for each cluster. Any 
 idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId


 [ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-974:
--

Affects Version/s: (was: 0.6)
   0.8

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
 integer as userId and itemId
 ---

 Key: MAHOUT-974
 URL: https://issues.apache.org/jira/browse/MAHOUT-974
 Project: Mahout
  Issue Type: Wish
  Components: Collaborative Filtering
Affects Versions: 0.8
Reporter: Han Hui Wen 
Assignee: Sebastian Schelter
  Labels: CF,recommendation,als
   Original Estimate: 2h
  Remaining Estimate: 2h

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
 integer as userId and itemId,but 
 org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
 ItemId.
 It's best that ParallelALSFactorizationJob   also uses Long as userId and 
 itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId


[ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672163#comment-13672163
 ] 

Sebastian Schelter commented on MAHOUT-974:
---

Saikat, are you still on this?

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
 integer as userId and itemId
 ---

 Key: MAHOUT-974
 URL: https://issues.apache.org/jira/browse/MAHOUT-974
 Project: Mahout
  Issue Type: Wish
  Components: Collaborative Filtering
Affects Versions: 0.8
Reporter: Han Hui Wen 
Assignee: Sebastian Schelter
  Labels: CF,recommendation,als
   Original Estimate: 2h
  Remaining Estimate: 2h

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
 integer as userId and itemId,but 
 org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
 ItemId.
 It's best that ParallelALSFactorizationJob   also uses Long as userId and 
 itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-978) spectralkmeans utility fails when input filename begins with leading underscore

[
https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Grant Ingersoll resolved MAHOUT-978.

Resolution: Won't Fix

I'd say, won't fix, as there is a workaround. Please re-open if there is a
specific patch.

spectralkmeans utility fails when input filename begins with leading
underscore
---

Key: MAHOUT-978
URL: https://issues.apache.org/jira/browse/MAHOUT-978
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.6
Environment: Tested on a real Linux-based cluster running Hadoop
0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop
0.20.203.0 running 16 Feb trunk build.
Reporter: Dan Brickley
Priority: Minor
Attachments: jira-underscore-spectral-log.txt

The commandline 'bin/mahout spectralkmeans' utility fails with
NoSuchElementException after Loading vector from:
spectral/output/results2/calculations/diagonal/part-r-0 when input data
in hdfs has filename beginning with a leading underscore.
This was partially reported in comments for MAHOUT-524 but I believe
identified now as a distinct issue (thanks to Shannon for help diagnosing). I
have not investigated if there is an equivalent problem for API-based use of
this piece of Mahout.
Steps to reproduce:
1. put affinity file into hdfs, following
https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs
count from zero etc. Name your file with a leading underscore. For example,
try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in
spectral/input/_topic_skm.csv
(I'll leave that example input file in place unchanged for others to try. It
is built from dbpedia data, encoding associations from Wikipedia pages to
categories. Whether it is a good use of spectral clustering I'm not sure, but
I'd at least hope the job would run to completion.)
2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o
spectral/output/results1'
3. Wait for it to fail just after printing Loading vector from:
spectral/output/results1/calculations/diagonal/part-r-0, with
java.util.NoSuchElementException at
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).
4. Rename the file in hdfs to eliminate the leading underscore. Re-run the
command (give a different results dir or cleanup from the first run, to avoid
mixing the tests). This attempt should succeed and you'll see it proceed
deeper into the job, i.e. something like
12/02/19 14:38:32 INFO common.VectorCache: Loading vector from:
spectral/output/results2/calculations/diagonal/part-r-0
12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
12/02/19 14:38:45 INFO mapred.JobClient: map 0% reduce 0%
12/02/19 14:39:31 INFO mapred.JobClient: map 1% reduce 0%
(5. You might get a memory-based failure some time later; that is a separate
problem.)
I'll attach a more detailed transcript. I've made no attempt to diagnose
internals yet, but did make some other tests and can confirm that it does not
seem to matter whether the commandline invocation names the file explicitly,
or by directory name only. Also trailing slash does not seem to be an issue.
Finally, a related 'gotcha': make sure the results directory is not inside
the input directory when testing.

[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR


 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-992:
---

Fix Version/s: 0.8

 Audit DistributedCache use to support EMR
 -

 Key: MAHOUT-992
 URL: https://issues.apache.org/jira/browse/MAHOUT-992
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.6
Reporter: tom pierce
Priority: Minor
  Labels: newbie
 Fix For: 0.8


 Apparently some of our DistributedCache use is not EMR-safe.  It would be 
 great if someone could audit our uses of DC, and fix up this problem where it 
 exists.
 For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1234) Canopy Clustering


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1234.


Resolution: Won't Fix

 Canopy Clustering
 -

 Key: MAHOUT-1234
 URL: https://issues.apache.org/jira/browse/MAHOUT-1234
 Project: Mahout
  Issue Type: Question
  Components: Clustering
Reporter: Sameer Sebastian

 Hello,
 I'm trying out Canopy clustering.
 I want to know, how to determine the optimum value for the distance 
 thresholds t1 and t2.
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1025) Update documentation for LDA before the release.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1025.
-

Resolution: Fixed

 Update documentation for LDA before the release.
 

 Key: MAHOUT-1025
 URL: https://issues.apache.org/jira/browse/MAHOUT-1025
 Project: Mahout
  Issue Type: Task
Affects Versions: 0.7
Reporter: Robin Anil
Assignee: Jake Mannix
 Fix For: 0.8




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1231) No input clusters found in error in kmeans


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1231:
---

Affects Version/s: (was: 0.8)
   (was: 0.7)
   Backlog

 No input clusters found in  error in kmeans
 -

 Key: MAHOUT-1231
 URL: https://issues.apache.org/jira/browse/MAHOUT-1231
 Project: Mahout
  Issue Type: Question
  Components: Clustering
Affects Versions: Backlog
Reporter: Summer Lee

 1.seqdirectory
  mahout seqdirectory --input /user/hdfs/input/new1.csv --output
  /user/hdfs/new1/seqdirectory --tempDir
  /user/hdfs/new1/seqdirectory/tempDir
 2.seq2sparse 
  mahout seq2sparse --input /user/hdfs/new1/seqdirectory --output
  /user/hdfs/new1/seq2sparse -wt tfidf
 3.kmeans 
  mahout kmeans --input /user/hdfs/new1/seq2sparse/tfidf-vectors
  --output /user/hdfs/new1/kmeans -c /user/hdfs/new1/clusters/kmeans -x 3 -k 
  3 --tempDir /user/hdfs/new1/kmeans/tempDir
 and then error is occured
 Failing Oozie Launcher, Main class [org.apache.mahout.driver.MahoutDriver], 
 main() threw exception, No input clusters found in 
 /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
 argument.
 java.lang.IllegalStateException: No input clusters found in 
 /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
 argument.
   at 
 org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217)
   at 
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148)
   at 
 org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at 
 org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
   at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:467)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 Oozie Launcher failed, finishing Hadoop job gracefully
 Oozie Launcher ends
 ===
 Why kmeans driver can't make clusters in Hadoop with oozie system?
 In hadoop with not oozie system, it worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1041) Support for PMML


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1041.
-

Resolution: Won't Fix

Without a patch, I don't see putting this in.  Also, I don't see the benefit of 
storing largish models in XML.  I could see a specific issue that can do I/O of 
PMML into Mahout's, but I don't see any thing running natively off of PMML.

 Support for PMML
 

 Key: MAHOUT-1041
 URL: https://issues.apache.org/jira/browse/MAHOUT-1041
 Project: Mahout
  Issue Type: Improvement
  Components: Integration
 Environment: Software Platform
Reporter: Duraimurugan
 Fix For: Backlog


 Would like to request a support for PMML. With that once the predictive 
 models are built and provided in PMML format, we should be able to import 
 into hadoop cluster for scoring. This way models built in external 
 (non-mahout) systems can be imported to Hadoop/Mahout for scalable 
 environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1204) Rewrite Benchmarks using Caliper


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1204:
---

Affects Version/s: 1.0

 Rewrite Benchmarks using Caliper
 

 Key: MAHOUT-1204
 URL: https://issues.apache.org/jira/browse/MAHOUT-1204
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0
Reporter: Robin Anil
Assignee: Robin Anil

 https://code.google.com/p/caliper/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1045) Cluster evaluators returning bad results


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1045.
-

Resolution: Fixed

Looks in and passing

 Cluster evaluators returning bad results
 

 Key: MAHOUT-1045
 URL: https://issues.apache.org/jira/browse/MAHOUT-1045
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.6, 0.7, 0.8
 Environment: Several environments and data sets
Reporter: Pat Ferrel
 Fix For: 0.8

 Attachments: first-time-density-nan.txt, MAHOUT-1045.patch, 
 MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch


 With real world crawl data the Intra-cluster density from ClusterEvaluator is 
 almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
 also seen several cases where CDbw fails to return any results but have not 
 tracked down why yet.
 I have sent a link to an 8G data set that reproduces these errors to Jeff 
 Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId

2013-06-01 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672172#comment-13672172
 ] 

Saikat Kanjilal commented on MAHOUT-974:


Yes, although I could use some general guidance being a newbie on this 
codebase, I've not had time to research this further, can you respond to my 
comments above?

Thanks

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
 integer as userId and itemId
 ---

 Key: MAHOUT-974
 URL: https://issues.apache.org/jira/browse/MAHOUT-974
 Project: Mahout
  Issue Type: Wish
  Components: Collaborative Filtering
Affects Versions: 0.8
Reporter: Han Hui Wen 
Assignee: Sebastian Schelter
  Labels: CF,recommendation,als
   Original Estimate: 2h
  Remaining Estimate: 2h

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
 integer as userId and itemId,but 
 org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
 ItemId.
 It's best that ParallelALSFactorizationJob   also uses Long as userId and 
 itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1053) Use KMeans++ for cluster Initialization


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1053.
-

Resolution: Fixed

This is resolved by the new streaming k-means stuff.

 Use KMeans++ for cluster Initialization
 ---

 Key: MAHOUT-1053
 URL: https://issues.apache.org/jira/browse/MAHOUT-1053
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Paritosh Ranjan
 Fix For: 0.8


 Use KMeans++ for cluster intialization.
 Ted has already implemented a similar version. http://github.com/tdunning/knn

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1054) Use ball KMeans for clustering


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1054.
-

Resolution: Fixed

This is resolved by the new streaming k-means stuff.

 Use ball KMeans for clustering
 --

 Key: MAHOUT-1054
 URL: https://issues.apache.org/jira/browse/MAHOUT-1054
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Paritosh Ranjan
 Fix For: 0.8


 Use ball KMeans for clustering.
 Ted has already implemented a similar version. http://github.com/tdunning/knn

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1117) Vectors are not hashable

[
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672176#comment-13672176
]

Robin Anil commented on MAHOUT-1117:

There is no single way good to hash a vector most methods are heavy plus the
additional overhead of caching the hash. If you do want to hash vector's, you
can override the hash-codes for your specific use-cases. This a design choice
we should write down.

Vectors are not hashable

Key: MAHOUT-1117
URL: https://issues.apache.org/jira/browse/MAHOUT-1117
Project: Mahout
Issue Type: Improvement
Affects Versions: 1.0
Reporter: Dan Filimon
Priority: Minor

No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
In working on improving clustering in Mahout, Ted Dunning wrote prototype
code for Streaming KMeans and Ball KMeans, that I'm working with him on.
These need to be used together in the MapReduce version.
However, in Ball KMeans, we initialize the clusters using a probabilistic
approach similar to k-means++. This however requires a
MultinomialWeightedVector distribution of the points we want to cluster to
pick the centroids.
Internally, the MultinomialT uses a HashMap to keep track of the values it
can sample from.
Since Vectors don't override Object's hashCode(), it is possible to get the
same value multiple times in the map (as long as the references differ).
This is less of an issue because of how we're adding the vectors to the
multinomial (we can guarantee that the references will be unique) and once
MAHOUT-1116 is resolved the hashing will work okay for our needs.
It still seems that it would be useful to have hashable vectors.
What do you think? And what would a hash function look like?

[jira] [Resolved] (MAHOUT-1117) Vectors are not hashable

[
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robin Anil resolved MAHOUT-1117.

Resolution: Won't Fix

Vectors are not hashable

Key: MAHOUT-1117
URL: https://issues.apache.org/jira/browse/MAHOUT-1117
Project: Mahout
Issue Type: Improvement
Affects Versions: 1.0
Reporter: Dan Filimon
Priority: Minor

[jira] [Commented] (MAHOUT-1065) Add CassandraDataModelTest


[ 
https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672173#comment-13672173
 ] 

Grant Ingersoll commented on MAHOUT-1065:
-

[~eduardo.gurgel] [~srowen] any update on this one?  In or out for 0.8?

 Add CassandraDataModelTest
 --

 Key: MAHOUT-1065
 URL: https://issues.apache.org/jira/browse/MAHOUT-1065
 Project: Mahout
  Issue Type: Test
  Components: Collaborative Filtering, Integration
Affects Versions: 0.8
Reporter: Eduardo Gurgel Pinho
Priority: Minor
  Labels: cassandra, collaborative-filtering, datamodel, hector, 
 taste, test
 Attachments: 0001-Add-CassandraDataModelTest.patch


 The test class for the CassandraDataModel class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1070:


Fix Version/s: 0.8

 DisplayKMeans example has transposed/mislabelled arguments
 --

 Key: MAHOUT-1070
 URL: https://issues.apache.org/jira/browse/MAHOUT-1070
 Project: Mahout
  Issue Type: Bug
  Components: Examples
Affects Versions: 0.7
Reporter: Gabriel Reid
Assignee: Paritosh Ranjan
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1070.patch


 The org.apache.mahout.clustering.display.DisplayKMeans example class uses a 
 value for k (numClusters) and maximum number of iterations to come to 
 convergence, but their use is transposed (i.e. the numClusters is used as max 
 iterations, and max iterations is used for numClusters). Furthermore, a 
 second hard-coded version of the value is used. The end result is that it's 
 not directly possible to experiment with different values of numClusters and 
 maxIterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1060) Search for nearest neighbor


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1060.
-

Resolution: Fixed

All of this capability has been added by Dan's streaming k-means clustering 
work except for the knn stuff.

 Search for nearest neighbor
 ---

 Key: MAHOUT-1060
 URL: https://issues.apache.org/jira/browse/MAHOUT-1060
 Project: Mahout
  Issue Type: Bug
  Components: Math
Reporter: Ted Dunning
 Fix For: 0.8

 Attachments: 
 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 
 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 
 0002-MAHOUT-1059-Stylistic-cleanups.patch, 
 0002-MAHOUT-1059-Stylistic-cleanups.patch, 
 0003-MAHOUT-1059-Add-generic-vector-test.patch, 
 0003-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 
 0004-MAHOUT-1059-Indentation.patch, 
 0004-MAHOUT-1060-Add-basic-knn-capabilities.patch, 
 0005-MAHOUT-1059-Abstract-the-idea-of-a-cached-length.patch, 
 0006-MAHOUT-1059-Additional-test-for-weighted-vectors.patch, 
 0007-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 
 0008-MAHOUT-1060-Add-basic-knn-capabilities.patch, 
 0009-MAHOUT-1060-shorten-test-sizes.patch


 This will contain a patch for sequential nearest neighbor search routines 
 that underpin new clustering algorithms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input


[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672182#comment-13672182
 ] 

Grant Ingersoll commented on MAHOUT-1080:
-

Here's a thought: kill NamedVector, and move the single name string to 
Vector.  It seems to me naming a Vector is very, very common.  A possible 
issue, however, is dealing with older Vectors that don't have a name, but we 
could just treat it as an empty string.

IMO, this should be fixed before 1.0

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments


[ 
https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672180#comment-13672180
 ] 

Suneel Marthi commented on MAHOUT-1070:
---

Is someone looking at this patch? I can take a crack at this if there are no 
takers.

 DisplayKMeans example has transposed/mislabelled arguments
 --

 Key: MAHOUT-1070
 URL: https://issues.apache.org/jira/browse/MAHOUT-1070
 Project: Mahout
  Issue Type: Bug
  Components: Examples
Affects Versions: 0.7
Reporter: Gabriel Reid
Assignee: Paritosh Ranjan
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1070.patch


 The org.apache.mahout.clustering.display.DisplayKMeans example class uses a 
 value for k (numClusters) and maximum number of iterations to come to 
 convergence, but their use is transposed (i.e. the numClusters is used as max 
 iterations, and max iterations is used for numClusters). Furthermore, a 
 second hard-coded version of the value is used. The end result is that it's 
 not directly possible to experiment with different values of numClusters and 
 maxIterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1052) Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672183#comment-13672183
 ] 

Suneel Marthi commented on MAHOUT-1052:
---

I can get this patch in for the 0.8 release, but the quality of clusters is 
still questionable. Nevertheless this patch is still needed, I can open another 
JIRA for Minhash clustering itself (based on Broder's paper). Thoughts?

 Add an option to MinHashDriver that specifies the dimension of vector to hash 
 (indexes or values)
 -

 Key: MAHOUT-1052
 URL: https://issues.apache.org/jira/browse/MAHOUT-1052
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.6
Reporter: Elena Smirnova
Assignee: Suneel Marthi
Priority: Minor
  Labels: minhash
 Fix For: Backlog

 Attachments: MAHOUT-1052.patch


 Add a parameter to MinHash clustering that specifies the dimension of vector 
 to hash (indexes or values). Current version of MinHash clustering only 
 hashed values of vectors. Based on discussion on dev-mahout list, both of the 
 use-cases are possible and frequently met in practice. 
 Preserve backward compatibility with default dimension set to values. Add new 
 unit tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1070.


Resolution: Fixed

Committed

 DisplayKMeans example has transposed/mislabelled arguments
 --

 Key: MAHOUT-1070
 URL: https://issues.apache.org/jira/browse/MAHOUT-1070
 Project: Mahout
  Issue Type: Bug
  Components: Examples
Affects Versions: 0.7
Reporter: Gabriel Reid
Assignee: Paritosh Ranjan
Priority: Minor
 Fix For: 0.8

 Attachments: MAHOUT-1070.patch


 The org.apache.mahout.clustering.display.DisplayKMeans example class uses a 
 value for k (numClusters) and maximum number of iterations to come to 
 convergence, but their use is transposed (i.e. the numClusters is used as max 
 iterations, and max iterations is used for numClusters). Furthermore, a 
 second hard-coded version of the value is used. The end result is that it's 
 not directly possible to experiment with different values of numClusters and 
 maxIterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1047) CVB hangs after completion


[ 
https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672186#comment-13672186
 ] 

Suneel Marthi commented on MAHOUT-1047:
---

Tested this patch and committing to trunk.

 CVB hangs after completion
 --

 Key: MAHOUT-1047
 URL: https://issues.apache.org/jira/browse/MAHOUT-1047
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Ubuntu
Reporter: seth boyles
Assignee: Suneel Marthi
Priority: Minor
  Labels: cvb, lda
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch


 After running the new LDA CVB implementation, it hangs and does not terminate 
 the process like every other time I run Mahout
 Terminal output:
 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 
 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done.
 12/07/19 11:38:49 INFO mapred.JobClient:  map 100% reduce 0%
 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022
 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8
 12/07/19 11:38:49 INFO mapred.JobClient:   File Output Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793
 12/07/19 11:38:49 INFO mapred.JobClient:   File Input Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337
 12/07/19 11:38:49 INFO mapred.JobClient:   FileSystemCounters
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302
 12/07/19 11:38:49 INFO mapred.JobClient:   Map-Reduce Framework
 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772
 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0
 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772
 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 
 68.165834)
 $MAHOUT_HOME/mahout cvb -i 
 /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o 
 /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt 
 /home/seth/Scripted/mahout_data/distributions -dict 
 /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt 
 /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAHOUT-1047) CVB hangs after completion


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1047:
-

Assignee: Suneel Marthi

 CVB hangs after completion
 --

 Key: MAHOUT-1047
 URL: https://issues.apache.org/jira/browse/MAHOUT-1047
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Ubuntu
Reporter: seth boyles
Assignee: Suneel Marthi
Priority: Minor
  Labels: cvb, lda
 Fix For: 0.7, 0.8

 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch


 After running the new LDA CVB implementation, it hangs and does not terminate 
 the process like every other time I run Mahout
 Terminal output:
 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 
 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done.
 12/07/19 11:38:49 INFO mapred.JobClient:  map 100% reduce 0%
 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022
 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8
 12/07/19 11:38:49 INFO mapred.JobClient:   File Output Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793
 12/07/19 11:38:49 INFO mapred.JobClient:   File Input Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337
 12/07/19 11:38:49 INFO mapred.JobClient:   FileSystemCounters
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302
 12/07/19 11:38:49 INFO mapred.JobClient:   Map-Reduce Framework
 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772
 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0
 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772
 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 
 68.165834)
 $MAHOUT_HOME/mahout cvb -i 
 /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o 
 /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt 
 /home/seth/Scripted/mahout_data/distributions -dict 
 /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt 
 /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1206) Add density-based clustering algorithms to mahout


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1206:
---

Fix Version/s: Backlog

 Add density-based clustering algorithms to mahout
 -

 Key: MAHOUT-1206
 URL: https://issues.apache.org/jira/browse/MAHOUT-1206
 Project: Mahout
  Issue Type: Improvement
Reporter: Yexi Jiang
  Labels: clustering
 Fix For: Backlog


 The clustering algorithms (kmeans, fuzzy kmeans, dirichlet clustering, and 
 spectral cluster) clustering data by assuming that the data can be clustered 
 into the regular hyper sphere or ellipsoid. However, in practical, not all 
 the data can be clustered in this way. 
 To enable the data to be clustered in arbitrary shapes, clustering algorithms 
 like DBSCAN, BIRCH, CLARANCE 
 (http://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering) are 
 proposed.
 It is better that we can implement one or some of these clustering algorithm 
 to enrich the clustering library. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1220) seqdirectory brings empty files out


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1220:
---

Fix Version/s: (was: 0.7)
Affects Version/s: (was: 0.7)
   Backlog

 seqdirectory brings empty files out
 ---

 Key: MAHOUT-1220
 URL: https://issues.apache.org/jira/browse/MAHOUT-1220
 Project: Mahout
  Issue Type: Bug
Affects Versions: Backlog
Reporter: Summer Lee
Priority: Minor

 I put the input file on mahout seqdirectory  
 -- command
 mahout seqdirectory --input 
 user/hdfs/mahout_test/input2/mahout_input_final3_0.csv --output 
 /user/hdfs/mahout_test/output/final3/seqdirectory/
 but the result file, chunk-0 contains like this.
 -- chunk-0
 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
 I heard that chunk-0 files should have number like 
 SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text 1.0 2.0 ...
 I think my input file is something wrong, so I tried with other different 
 input files but results are same.
 How can I fix this? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1228) Cleanup .gitignore


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1228:
---

Affects Version/s: (was: 0.7)
   0.8

 Cleanup .gitignore
 --

 Key: MAHOUT-1228
 URL: https://issues.apache.org/jira/browse/MAHOUT-1228
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.8
Reporter: Stevo Slavic
Priority: Trivial
  Labels: eclipse, git, maven
 Attachments: mahout-gitignore.patch


 .gitignore unnecessarily has duplicate entries for ignoring eclipse IDE 
 specific files and directories, as well as Maven build output directory. For 
 distribution module Maven build output directory is not ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId


 [ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-974:
--

Fix Version/s: 0.8

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
 integer as userId and itemId
 ---

 Key: MAHOUT-974
 URL: https://issues.apache.org/jira/browse/MAHOUT-974
 Project: Mahout
  Issue Type: Wish
  Components: Collaborative Filtering
Affects Versions: 0.8
Reporter: Han Hui Wen 
Assignee: Sebastian Schelter
  Labels: CF,recommendation,als
 Fix For: 0.8

   Original Estimate: 2h
  Remaining Estimate: 2h

 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
 integer as userId and itemId,but 
 org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
 ItemId.
 It's best that ParallelALSFactorizationJob   also uses Long as userId and 
 itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1228) Cleanup .gitignore


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1228:
---

Fix Version/s: 0.8

 Cleanup .gitignore
 --

 Key: MAHOUT-1228
 URL: https://issues.apache.org/jira/browse/MAHOUT-1228
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.8
Reporter: Stevo Slavic
Priority: Trivial
  Labels: eclipse, git, maven
 Fix For: 0.8

 Attachments: mahout-gitignore.patch


 .gitignore unnecessarily has duplicate entries for ignoring eclipse IDE 
 specific files and directories, as well as Maven build output directory. For 
 distribution module Maven build output directory is not ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

Ted Dunning created MAHOUT-1236:
---

 Summary: Need a cleaned up serialized format for Vectors to handle 
names and all other kinds of things
 Key: MAHOUT-1236
 URL: https://issues.apache.org/jira/browse/MAHOUT-1236
 Project: Mahout
  Issue Type: Bug
Reporter: Ted Dunning


Our current serialization is subject several ills

a) it breaks alignment by having a 1 byte flag field (evil, generic)

b) it doesn't handle any kind of extensible format like protobufs so it isn't 
future-proof

c) it doesn't handle named vectors very well

d) it totally breaks with any other kind of decoration as with Centroids or 
WeightedVector or ... (see b)

I propose that we use the current tag byte on the current serialization with a 
new flag bit that indicates that the vector will use a protobuf encoding.  Then 
3 bytes will be skipped to restore alignment.  Then there will be a protobuf 
encoding for the vector. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1047) CVB hangs after completion


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1047:
--

   Resolution: Fixed
Fix Version/s: (was: 0.7)
   Status: Resolved  (was: Patch Available)

 CVB hangs after completion
 --

 Key: MAHOUT-1047
 URL: https://issues.apache.org/jira/browse/MAHOUT-1047
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Ubuntu
Reporter: seth boyles
Assignee: Suneel Marthi
Priority: Minor
  Labels: cvb, lda
 Fix For: 0.8

 Attachments: MAHOUT-1047.patch, MAHOUT-1047-Show-Leak.patch


 After running the new LDA CVB implementation, it hangs and does not terminate 
 the process like every other time I run Mahout
 Terminal output:
 12/07/19 11:38:49 INFO mapred.LocalJobRunner: 
 12/07/19 11:38:49 INFO mapred.Task: Task 'attempt_local_0022_m_00_0' done.
 12/07/19 11:38:49 INFO mapred.JobClient:  map 100% reduce 0%
 12/07/19 11:38:49 INFO mapred.JobClient: Job complete: job_local_0022
 12/07/19 11:38:49 INFO mapred.JobClient: Counters: 8
 12/07/19 11:38:49 INFO mapred.JobClient:   File Output Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Written=2247793
 12/07/19 11:38:49 INFO mapred.JobClient:   File Input Format Counters 
 12/07/19 11:38:49 INFO mapred.JobClient: Bytes Read=1920337
 12/07/19 11:38:49 INFO mapred.JobClient:   FileSystemCounters
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_READ=1342812616
 12/07/19 11:38:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1326092302
 12/07/19 11:38:49 INFO mapred.JobClient:   Map-Reduce Framework
 12/07/19 11:38:49 INFO mapred.JobClient: Map input records=2772
 12/07/19 11:38:49 INFO mapred.JobClient: Spilled Records=0
 12/07/19 11:38:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
 12/07/19 11:38:49 INFO mapred.JobClient: Map output records=2772
 12/07/19 11:38:49 INFO driver.MahoutDriver: Program took 4089950 ms (Minutes: 
 68.165834)
 $MAHOUT_HOME/mahout cvb -i 
 /home/seth/Scripted/mahout_data/vectors/vectors/vectors-for-cvb/ -o 
 /home/seth/Scripted/mahout_data/clusters/ -ow -k 90 -dt 
 /home/seth/Scripted/mahout_data/distributions -dict 
 /home/seth/Scripted/mahout_data/vectors/vectors/dictionary.file-0 -mt 
 /home/seth/Scripted/mahout_data/temp/ -x 20 -cd 0.05

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1026:
-

Assignee: Suneel Marthi  (was: Jake Mannix)

 Add LDA (CVB implementation) to the cluster_reuters.sh example script
 -

 Key: MAHOUT-1026
 URL: https://issues.apache.org/jira/browse/MAHOUT-1026
 Project: Mahout
  Issue Type: Task
  Components: Clustering
Affects Versions: 0.8
Reporter: Sebastian Schelter
Assignee: Suneel Marthi
 Fix For: 0.8

 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1153) Implement streaming random forests


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1153:
---

Fix Version/s: Backlog
Affects Version/s: (was: 0.7)

 Implement streaming random forests
 --

 Key: MAHOUT-1153
 URL: https://issues.apache.org/jira/browse/MAHOUT-1153
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Reporter: Andy Twigg
  Labels: features
 Fix For: Backlog


 The current random forest implementations are in-core and not scalable. This 
 issue is to add an out-of-core, scalable, streaming implementation. Initially 
 it could be based on [1], and using mappers in a master-worker style.
 [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Schelter updated MAHOUT-1214:
---

Fix Version/s: Backlog

Improve the accuracy of the Spectral KMeans Method
--

Key: MAHOUT-1214
URL: https://issues.apache.org/jira/browse/MAHOUT-1214
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.7
Environment: Mahout 0.7
Reporter: Yiqun Hu
Labels: clustering, improvement
Fix For: Backlog

The current implementation of the spectral KMeans algorithm (Andrew Ng. etc.
NIPS 2002) in version 0.7 has two serious issues. These two incorrect
implementations make it fail even for a very obvious trivial dataset. We have
implemented a solution to resolve these two issues and hope to contribute
back to the community.
# Issue 1:
The EigenVerificationJob in version 0.7 does not check the orthogonality of
eigenvectors, which is necessary to obtain the correct clustering results for
the case of K1; We have an idea and implementation to select based on
cosAngle/orthogonality;
# Issue 2:
The random seed initialization of KMeans algorithm is not optimal and
sometimes a bad initialization will generate wrong clustering result. In this
case, the selected K eigenvector actually provides a better way to initalize
cluster centroids because each selected eigenvector is a relaxed indicator of
the memberships of one cluster. For every selected eigenvector, we use the
data point whose eigen component achieves the maximum absolute value.
We have already verified our improvement on synthetic dataset and it shows
that the improved version get the optimal clustering result while the current
0.7 version obtains the wrong result.

[jira] [Resolved] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1080.
-

Resolution: Duplicate

MAHOUT-1236 address this in the more general case

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

2013-06-01 Thread Jake Mannix (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672188#comment-13672188
 ] 

Jake Mannix commented on MAHOUT-1236:
-

Why protobufs?  Why not thrift or avro?  Maybe we should make this pluggable?

 Need a cleaned up serialized format for Vectors to handle names and all other 
 kinds of things
 -

 Key: MAHOUT-1236
 URL: https://issues.apache.org/jira/browse/MAHOUT-1236
 Project: Mahout
  Issue Type: Bug
Reporter: Ted Dunning

 Our current serialization is subject several ills
 a) it breaks alignment by having a 1 byte flag field (evil, generic)
 b) it doesn't handle any kind of extensible format like protobufs so it isn't 
 future-proof
 c) it doesn't handle named vectors very well
 d) it totally breaks with any other kind of decoration as with Centroids or 
 WeightedVector or ... (see b)
 I propose that we use the current tag byte on the current serialization with 
 a new flag bit that indicates that the vector will use a protobuf encoding.  
 Then 3 bytes will be skipped to restore alignment.  Then there will be a 
 protobuf encoding for the vector. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

2013-06-01 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672190#comment-13672190
]

Sean Owen commented on MAHOUT-1236:
---

There has always been a tension between all of the above, and size. Protobufs
are probably the best option since it has a lean means of dealing with optional
fields (and variable-length ints right?) but it will still increase the size of
each vector. That's probably worth it. There still has to be this
VectorWritable factory thing to work with Hadoop, but hey. Why does alignment
matter here?

Need a cleaned up serialized format for Vectors to handle names and all other
kinds of things
-

Key: MAHOUT-1236
URL: https://issues.apache.org/jira/browse/MAHOUT-1236
Project: Mahout
Issue Type: Bug
Reporter: Ted Dunning

Our current serialization is subject several ills
a) it breaks alignment by having a 1 byte flag field (evil, generic)
b) it doesn't handle any kind of extensible format like protobufs so it isn't
future-proof
c) it doesn't handle named vectors very well
d) it totally breaks with any other kind of decoration as with Centroids or
WeightedVector or ... (see b)
I propose that we use the current tag byte on the current serialization with
a new flag bit that indicates that the vector will use a protobuf encoding.
Then 3 bytes will be skipped to restore alignment. Then there will be a
protobuf encoding for the vector.

[jira] [Commented] (MAHOUT-1065) Add CassandraDataModelTest

2013-06-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672191#comment-13672191
 ] 

Sean Owen commented on MAHOUT-1065:
---

AFAIK this is on hold until the dependencies are available. I would ice it for 
the foreseeable future.

 Add CassandraDataModelTest
 --

 Key: MAHOUT-1065
 URL: https://issues.apache.org/jira/browse/MAHOUT-1065
 Project: Mahout
  Issue Type: Test
  Components: Collaborative Filtering, Integration
Affects Versions: 0.8
Reporter: Eduardo Gurgel Pinho
Priority: Minor
  Labels: cassandra, collaborative-filtering, datamodel, hector, 
 taste, test
 Attachments: 0001-Add-CassandraDataModelTest.patch


 The test class for the CassandraDataModel class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

2013-06-01 Thread Jake Mannix (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672192#comment-13672192
]

Jake Mannix commented on MAHOUT-1236:
-

Thrift leaves off optional fields pretty well too, right? I've never seen much
difference in the sizes of the thrifts vs. protobufs vs. raw writables here at
Twitter (we've got some pretty heterogenous sources).

What do you mean about a VectorWritable factory thing to work with hadoop?
You mean something like ProtobufWritableProtoVector or
ThriftWritableThriftVector, (where ProtoVector extends Message, and
ThriftVector extends TBase) ? ElephantBird has some good utilities for this
kind of thing. (e.g.
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ProtobufWritable.java
and
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ThriftWritable.java
)

Need a cleaned up serialized format for Vectors to handle names and all other
kinds of things
-

Key: MAHOUT-1236
URL: https://issues.apache.org/jira/browse/MAHOUT-1236
Project: Mahout
Issue Type: Bug
Reporter: Ted Dunning

[jira] [Updated] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1026:
--

Attachment: MAHOUT-1026.patch

 Add LDA (CVB implementation) to the cluster_reuters.sh example script
 -

 Key: MAHOUT-1026
 URL: https://issues.apache.org/jira/browse/MAHOUT-1026
 Project: Mahout
  Issue Type: Task
  Components: Clustering
Affects Versions: 0.8
Reporter: Sebastian Schelter
Assignee: Suneel Marthi
 Fix For: 0.8

 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch, MAHOUT-1026.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1026) Add LDA (CVB implementation) to the cluster_reuters.sh example script


[ 
https://issues.apache.org/jira/browse/MAHOUT-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672194#comment-13672194
 ] 

Suneel Marthi commented on MAHOUT-1026:
---

Jake,

Attached patch takes care of (a) and (b) above. I committed the code for 
MAHOUT-1047 to trunk (CVB0 clustering wouldn't work without the patch).  

I am not sure about the cluster quality in (c) - not too familiar with CVB0 to 
test that. 



 Add LDA (CVB implementation) to the cluster_reuters.sh example script
 -

 Key: MAHOUT-1026
 URL: https://issues.apache.org/jira/browse/MAHOUT-1026
 Project: Mahout
  Issue Type: Task
  Components: Clustering
Affects Versions: 0.8
Reporter: Sebastian Schelter
Assignee: Suneel Marthi
 Fix For: 0.8

 Attachments: MAHOUT-1026.patch, MAHOUT-1026.patch, MAHOUT-1026.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1236) Need a cleaned up serialized format for Vectors to handle names and all other kinds of things

2013-06-01 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672197#comment-13672197
]

Sean Owen commented on MAHOUT-1236:
---

Yes it's probably very similar. The comment was more about size being an
important concern here too. For example, simpler still is to use Java
serialization. But it would serialize the class name with every instance, for
example. For a billion small vectors that's huge overhead.

That's no issue with these other options, where the reader/writer already know
the type and format anyway. The current 'format' is the ultimate in lean,
really. The size increase from using protobufs/Thrift/Avro change would come
from having to represent optional fields with additional bytes some other way,
but that's still relatively minor. The big deal is representing integers
compactly, I think. I don't know Thrift/Avro but assume they probably have some
variable-length encoding too.

FWIW I don't think it's necessarily useful to support N serialization
mechanisms, that's not what I was referring to.
But it's similar in the sense that the problem is that the serialized format
isn't polymorphic. You have to write this generic all-encompassing format and
then have some object make (polymorphic, OOP) Java objects correctly from them.
That's what VectorWritable does. It's OK because with Hadoop we have to declare
the concrete type of the value upfront, and so were always going to need this
level of indirection in order to fake polymorphism. That is, this lets you run
a job that consumes VectorWritable and actually send it sparse or dense
vectors.

Now, vectors aren't really going to change. They're indices and numbers.
Decorators may change, and while decorators fit cleaning into OOP, they make
the mismatch above worse. Right now it works fine with the 'named' extension
(what doesn't work well there?). But if you want 10 other decorations to be
represented, it will be unwieldy. That's the motivation for wanting a different
format. But are there 10 other extensions that are really necessary?

How many times do you want to transparently handle either a plain Vector or
DecoratedVector? If you actually want and need to know the difference, then you
don't need to model this as a 'decoration' and don't have the problem above.
Names? OK I can see not caring about whether it's named. Weights? yeah maybe.
Centroids? what's special about centroids for example?

Anyway I think that's the real question.

Need a cleaned up serialized format for Vectors to handle names and all other
kinds of things
-

Key: MAHOUT-1236
URL: https://issues.apache.org/jira/browse/MAHOUT-1236
Project: Mahout
Issue Type: Bug
Reporter: Ted Dunning

[jira] [Updated] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1084:


Fix Version/s: 0.8

We should make sure the examples work, so adding this to 0.8.  My env. is 
messed up right now, so I can't reproduce it at the moment.

 Kmeans for synthetic control example--there are 12 cluster during iterations.
 -

 Key: MAHOUT-1084
 URL: https://issues.apache.org/jira/browse/MAHOUT-1084
 Project: Mahout
  Issue Type: Bug
Reporter: liutengfei
 Fix For: 0.8


In Mahout-Kmeans for syntheticcontrol example, using the default 
 parameters means to compute 6 clusters at last. But why there are 12 clusters 
 during Kmeans iterations. According to my observation, the former 6 clusters 
 and the latter 6 clusters are the same before the first iteration,those 6 
 clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
 assign its own points to this 12 clusters. Is here existing logical errors?
The 12 clusters are created by the function setup in CIMapper.java, 
 more specifically, is the line classifier.readFromSeqFiles(conf, new 
 Path(priorClustersPath));, here the priorClustersPath means hdfs direction 
 output/clusters-0/, there are 8 files in this direction: 
 _policy,part-randomSeed(one file record six cluster),part-0 to 
 part-5(total six files,every one record a cluster), while reading this 
 direction, _policy will be filtered out, so program will read part-0 
 to part-5 to create six clusters, then read part-randomSeed to create 
 the other six clusters, this is the reason why there will be 12 clusters 
 before first iteration.
   Solution: delete associated code to avoid duplicately creating clusters 
 in output/clusters-0/, here i delete codes where create files: part-0 
 to part-5 in ClusterClassfier.java:
   public void writeToSeqFiles(Path path) throws IOException {
 writePolicy(policy, path);
 /*
 Configuration config = new Configuration();
 FileSystem fs = FileSystem.get(path.toUri(), config);
 SequenceFile.Writer writer = null;
 ClusterWritable cw = new ClusterWritable();
 for (int i = 0; i  models.size(); i++) {
   try {
 Cluster cluster = models.get(i);
 cw.setValue(cluster);
 writer = new SequenceFile.Writer(fs, config,
 new Path(path, part- + String.format(Locale.ENGLISH, %05d, 
 i)), IntWritable.class,
 ClusterWritable.class);
 Writable key = new IntWritable(i);
 writer.append(key, cw);
   } finally {
 Closeables.closeQuietly(writer);
   }
 }
 */
   }
 I don't know if it is still okay for other progams who using this file, 
 but for KMeans in Syntheticcontrol example, program will create 6 clusters 
 during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1208) Not able to get the distance from the cluster.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1208.


Resolution: Won't Fix

 Not able to get the distance from the cluster.
 --

 Key: MAHOUT-1208
 URL: https://issues.apache.org/jira/browse/MAHOUT-1208
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.7
 Environment: Ubuntu
Reporter: Sameer Sebastian

 Hello,
 After clustering, when I am running the clusterdump mahout command, the 
 result doesn't have the distance.
 Is https://issues.apache.org/jira/browse/MAHOUT-1073, the only reason why it 
 is happening.
 If there is a work around without a patch, please tell.
 Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1204) Rewrite Benchmarks using Caliper


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1204:
---

Fix Version/s: Backlog

 Rewrite Benchmarks using Caliper
 

 Key: MAHOUT-1204
 URL: https://issues.apache.org/jira/browse/MAHOUT-1204
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: Backlog


 https://code.google.com/p/caliper/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1092) MultiNormal is slow in common case


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1092.
-

   Resolution: Fixed
Fix Version/s: 0.8

Ted says it's fixed on 4944dcc7

 MultiNormal is slow in common case
 --

 Key: MAHOUT-1092
 URL: https://issues.apache.org/jira/browse/MAHOUT-1092
 Project: Mahout
  Issue Type: Bug
Reporter: Ted Dunning
Priority: Minor
 Fix For: 0.8


 The multinormal generator unnecessarily uses matrix arithmetic for some 
 simple cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1094) when i am giving the testing data from the new set of data without using split ..it is giving the completely wrong confusion matrix


[ 
https://issues.apache.org/jira/browse/MAHOUT-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672206#comment-13672206
 ] 

Grant Ingersoll commented on MAHOUT-1094:
-

Please provide more details and reproducible input

 when i am giving the testing data from the new set of data without using 
 split ..it is giving the completely wrong confusion matrix
 ---

 Key: MAHOUT-1094
 URL: https://issues.apache.org/jira/browse/MAHOUT-1094
 Project: Mahout
  Issue Type: Question
  Components: Classification
Affects Versions: 0.7
 Environment: hadoop 0.20
Reporter: Priyadarshan raj
  Labels: newbie
   Original Estimate: 336h
  Remaining Estimate: 336h

 hi,
 i am able to successfully create  a model by the command below:-
 bin/mahout trainnb -i /user/cloudera/MahoutWeighted/1_FactWt-train-vectors 
 -el -o /user/cloudera/MahoutWeighted/model -li 
 /user/cloudera/MahoutWeighted/labelindex -ow
 
 but  i am unable to use that model...when i am feeding the model with test 
 data in the same way,i trained it..i am not able to get the correct confusion 
 matrix.and the number of map output records coming is equal to the number of 
 files i am feeding .can anyone tell me why it is not coming to be number of 
 lines ??

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1103:


Fix Version/s: 0.8

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Paritosh Ranjan
  Labels: clusterpp
 Fix For: 0.8

 Attachments: MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters


[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672209#comment-13672209
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

[~dlyubimov] or [~mmolek] any updates on this?

 clusterpp is not writing directories for all clusters
 -

 Key: MAHOUT-1103
 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.8
Reporter: Matt Molek
Assignee: Paritosh Ranjan
  Labels: clusterpp
 Attachments: MAHOUT-1103.patch


 After running kmeans clustering on a set of ~3M points, clusterpp fails to 
 populate directories for some clusters, no matter what k is.
 I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
 Even with k=2 only one cluster directory was created. For each reducer that 
 fails to produce directories there is an empty part-r-* file in the output 
 directory.
 Here is my command sequence for the k=2 run:
 {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
 2clusters/pca-clusters -dm 
 org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
 -cl
 bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
 2clusters.txt
 bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
 The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
 containing 2585843 and 1156624 points respectively.
 Discussion on the user mailing list suggested that this might be caused by 
 the default hadoop hash partitioner. The hashes of these two clusters aren't 
 identical, but they are close. Putting both cluster names into a Text and 
 caling hashCode() gives:
 VL-3742464 - -685560454
 VL-3742466 - -685560452
 Finally, when running with -xm sequential, everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1200) Mahout tests depend on writing to /tmp/hadoop-$user


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1200:
---

Fix Version/s: 0.8

 Mahout tests depend on writing to /tmp/hadoop-$user
 ---

 Key: MAHOUT-1200
 URL: https://issues.apache.org/jira/browse/MAHOUT-1200
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.7
Reporter: Isabel Drost-Fromm
 Fix For: 0.8

 Attachments: MAHOUT-1200.patch, MAHOUT-1200.patch, MAHOUT-1200.patch


 Running the Mahout test suite creates the temp directory /tmp/hadoop-$user 
 which is used by all Hadoop related tests that pull up a local cluster. The 
 directory is not removed after running the tests. In particular when running 
 multiple tests in parallel on the same machine as the same user this can lead 
 to problems.
 To re-produce issue the following commands prior to running the full test 
 suite:
 mkdir /tmp/hadoop-$USER
 chmod 000 /tmp/hadoop-$USER
 mvn test

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2013-06-01 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672210#comment-13672210
 ] 

Pat Ferrel commented on MAHOUT-1080:


+10

As a frequent user of named vectors I would love to see this supported 
generally.

 Kmeans clustered output losses vectorId given in the input
 --

 Key: MAHOUT-1080
 URL: https://issues.apache.org/jira/browse/MAHOUT-1080
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.7
Reporter: Smita Wadhwa
 Fix For: 0.8

 Attachments: kMeansClusterVectorId.diff


 The input to the Kmeans is Intwritable and vectorWritable 
 and the output of clustered points is clusterId 
 WeightedVectorWitable(vector,distance-from-the-centre)
 The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true


[ 
https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672212#comment-13672212
 ] 

Grant Ingersoll commented on MAHOUT-1108:
-

Elmer, can you supply a patch?

 cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
 ---

 Key: MAHOUT-1108
 URL: https://issues.apache.org/jira/browse/MAHOUT-1108
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7
Reporter: Elmer Garduno
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Got the following exception when running the command with HADOOP_CONF and  
 HADOOP_CONF_DIR
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/hadoop/util/ProgramDriver
   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.util.ProgramDriver
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   ... 1 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-598) Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster


 [ 
https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-598.
---

Resolution: Cannot Reproduce

 Downstream steps in the seq2sparse job flow looking in wrong location for 
 output from previous steps when running in Elastic MapReduce (EMR) cluster
 

 Key: MAHOUT-598
 URL: https://issues.apache.org/jira/browse/MAHOUT-598
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.4, 0.5
 Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2
Reporter: Timothy Potter
Assignee: Grant Ingersoll
 Fix For: 0.8


 While working on MAHOUT-588, I've discovered an issue with the seq2sparse job 
 running on EMR. From what I can tell this job is made up of multiple MR steps 
 and downstream steps are expecting output from previous steps to be in HDFS, 
 but the output is in S3 (see errors below). For example, the 
 DictionaryVectorizer wrote dictionary.file.0 to S3 but 
 TFPartialVectorReducer is looking for it in HDFS.
 To run this job, I spin up an EMR cluster and then add the following step to 
 it (this is using the elastic-mapreduce-ruby tool):
 elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \
 --main-class org.apache.mahout.driver.MahoutDriver \
 --arg seq2sparse \
 --arg -i --arg 
 s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \
 --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \
 --arg --weight --arg tfidf \
 --arg --chunkSize --arg 200 \
 --arg --minSupport --arg 2 \
 --arg --minDF --arg 1 \
 --arg --maxDFPercent --arg 90 \
 --arg --norm --arg 2 \
 --arg --maxNGramSize --arg 2 \
 --arg --overwrite \
 -j JOB_ID
 With these parameters, I see the following errors in the hadoop logs:
 java.io.FileNotFoundException: File does not exist: 
 /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
   at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471)
   at 
 org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 java.io.FileNotFoundException: File does not exist: 
 /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
   at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471)
   at 
 org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 java.io.FileNotFoundException: File does not exist: 
 /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
   at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1476)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1471)
   at 
 org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
   at 
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Exception in thread main 
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
 not exist: 
 s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
   at

[jira] [Updated] (MAHOUT-684) Topics regularization for LDA


 [ 
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-684:
--

Fix Version/s: 0.8
 Assignee: Jake Mannix

Jake, please take a look at this one commit/close as necessary

 Topics regularization for LDA
 -

 Key: MAHOUT-684
 URL: https://issues.apache.org/jira/browse/MAHOUT-684
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Reporter: Vasil Vasilev
Assignee: Jake Mannix
Priority: Minor
  Labels: LDA.
 Fix For: 0.8

 Attachments: MAHOUT-684.patch, MAHOUT-684.patch, MAHOUT-684.patch


 Implementation provided for the alpha parameters estimation as described in 
 the paper of Blei, Ng and Jordan 
 (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
 Remark: there is a mistake in the last formula in A.4.2 (the signs are 
 wrong). The correct version is described here: 
 http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-775) L2 does not work with TrainAdaptiveLogisticRegression


 [ 
https://issues.apache.org/jira/browse/MAHOUT-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-775.
---

Resolution: Fixed

 L2 does not work with TrainAdaptiveLogisticRegression
 -

 Key: MAHOUT-775
 URL: https://issues.apache.org/jira/browse/MAHOUT-775
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.6
Reporter: XiaoboGu
 Fix For: 0.8

 Attachments: MAHOUT-775.patch


 I have post the problem to the dev list, see the following message
 http://mail-archives.apache.org/mod_mbox/mahout-dev/201106.mbox/%3cbanlktik6153pjgcfnayuprwbv9jzcxp...@mail.gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1196) LogisticModelParameters uses csv.getTargetCategories() even if csv is not used.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1196:
---

Fix Version/s: 0.8

 LogisticModelParameters uses csv.getTargetCategories() even if csv is not 
 used.
 ---

 Key: MAHOUT-1196
 URL: https://issues.apache.org/jira/browse/MAHOUT-1196
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
 Environment: All
Reporter: Vineet Krishnan
Priority: Trivial
  Labels: CSV, Classifier, LogisticModelParameters
 Fix For: 0.8

   Original Estimate: 1h
  Remaining Estimate: 1h

 saveTo(OutputStream out) tries to get csv.getTargetCategories() even when it 
 has already been set. In a case when CsvRecordFactory is not used, this gives 
 a NullPointerException when saveTo() is called.
 IMHO a simple null check for targetCategories is sufficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1196) LogisticModelParameters uses csv.getTargetCategories() even if csv is not used.


[ 
https://issues.apache.org/jira/browse/MAHOUT-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672217#comment-13672217
 ] 

Sebastian Schelter commented on MAHOUT-1196:


Vineet, any progress on this?

 LogisticModelParameters uses csv.getTargetCategories() even if csv is not 
 used.
 ---

 Key: MAHOUT-1196
 URL: https://issues.apache.org/jira/browse/MAHOUT-1196
 Project: Mahout
  Issue Type: Bug
  Components: Classification
Affects Versions: 0.8
 Environment: All
Reporter: Vineet Krishnan
Priority: Trivial
  Labels: CSV, Classifier, LogisticModelParameters
 Fix For: 0.8

   Original Estimate: 1h
  Remaining Estimate: 1h

 saveTo(OutputStream out) tries to get csv.getTargetCategories() even when it 
 has already been set. In a case when CsvRecordFactory is not used, this gives 
 a NullPointerException when saveTo() is called.
 IMHO a simple null check for targetCategories is sufficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1179:
---

Fix Version/s: Backlog

 GSOC 2013: Refactor and improve the classification APIs
 ---

 Key: MAHOUT-1179
 URL: https://issues.apache.org/jira/browse/MAHOUT-1179
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
  Labels: gsoc2013, mentor
 Fix For: Backlog


 [via Andy Twigg]
 Improve and unify the Mahout classification API. Also related to the 
 refactoring of the clustering APIs MAHOUT-1177.
 The two APIs should be roughly the same, at least in
 terms of input/output so that pipelining etc is easier. (cf
 scikit-learn clustering/classifier/regression API)
 Currently Mahout support:
 - logistic regression
 - Naive Bayes
 - Random Forests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1177:
---

Fix Version/s: Backlog

 GSOC 2013: Reform and simplify the clustering APIs
 --

 Key: MAHOUT-1177
 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
 Project: Mahout
  Issue Type: Improvement
Reporter: Dan Filimon
  Labels: gsoc2013, mentor
 Fix For: Backlog


 Clustering is one of the most used features in Mahout and has many 
 applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
 We have of lots clustering algorithms. There's:
 - basic k-means
 - canopy clustering
 - Dirichlet clustering
 - Fuzzy k-means
 - Spectral k-means
 - Streaming k-means [coming soon]
 We want to make them easier to use by updating the APIs and make sure they 
 all work in the same way have consistent inputs, outputs, diagnostics and 
 documentation.
 This is a great way to gain an in-depth understanding of clustering 
 algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
 software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

[
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Schelter updated MAHOUT-1178:
---

Fix Version/s: Backlog

GSOC 2013: Improve Lucene support in Mahout
---

Key: MAHOUT-1178
URL: https://issues.apache.org/jira/browse/MAHOUT-1178
Project: Mahout
Issue Type: New Feature
Reporter: Dan Filimon
Labels: gsoc2013, mentor
Fix For: Backlog

Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch

[via Ted Dunning]
It should be possible to view a Lucene index as a matrix. This would
require that we standardize on a way to convert documents to rows. There
are many choices, the discussion of which should be deferred to the actual
work on the project, but there are a few obvious constraints:
a) it should be possible to get the same result as dumping the term vectors
for each document each to a line and converting that result using standard
Mahout methods.
b) numeric fields ought to work somehow.
c) if there are multiple text fields that ought to work sensibly as well.
Two options include dumping multiple matrices or to convert the fields
into a single row of a single matrix.
d) it should be possible to refer back from a row of the matrix to find the
correct document. THis might be because we remember the Lucene doc number
or because a field is named as holding a unique id.
e) named vectors and matrices should be used if plausible.

[jira] [Resolved] (MAHOUT-804) Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours

[
https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robin Anil resolved MAHOUT-804.
---

Resolution: Fixed

Seems to be exporting correcting now.

Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles
and search behaviours
--

Key: MAHOUT-804
URL: https://issues.apache.org/jira/browse/MAHOUT-804
Project: Mahout
Issue Type: Improvement
Components: Website
Reporter: Dan Brickley
Labels: atlassian, confluence, wiki

There are two styles of URL in circulation for URLs into Mahout's Wiki
(presumably an Apache-wide configuration issue):
https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs
https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
They appear to be the self-same confluence 3.4.9 installation (or its raw
filetree). Each has a different search box at the top of the page. The
version with 'confluence/' in the path does a confluence search, and returns
similar URLs as results. The one with '.html' suffixes does a
domain-constrained Google search.
Despite markup canonicalising the confluence variant, ie. link
rel=canonical
href=https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition;
appearing in the confluence pages, it seems the Google search results
typically throw people into the other version of the Wiki site.
This is all mildly confusing, mildly annoying but overall mostly harmless. It
could be having some negative impact on google rank suchlike, since
incoming links will be split between the two styles. Maybe this could be
passed along to the Wiki admins?
Which version does the Mahout team consider canonical URLs (for external
links etc)?

[jira] [Updated] (MAHOUT-1065) Add CassandraDataModelTest


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1065:
---

Affects Version/s: (was: 0.8)
Fix Version/s: Backlog

Going with what Sean said.

 Add CassandraDataModelTest
 --

 Key: MAHOUT-1065
 URL: https://issues.apache.org/jira/browse/MAHOUT-1065
 Project: Mahout
  Issue Type: Test
  Components: Collaborative Filtering, Integration
Reporter: Eduardo Gurgel Pinho
Priority: Minor
  Labels: cassandra, collaborative-filtering, datamodel, hector, 
 taste, test
 Fix For: Backlog

 Attachments: 0001-Add-CassandraDataModelTest.patch


 The test class for the CassandraDataModel class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1193) We may want a BlockSparseMatrix


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1193:
---

Fix Version/s: Backlog

 We may want a BlockSparseMatrix
 ---

 Key: MAHOUT-1193
 URL: https://issues.apache.org/jira/browse/MAHOUT-1193
 Project: Mahout
  Issue Type: Bug
Reporter: Ted Dunning
 Fix For: Backlog

 Attachments: MAHOUT-1193.patch


 Here is an implementation.
 Is it good enough to commit?
 Is it useful?
 Is it redundant?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true

2013-06-01 Thread Elmer Garduno (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13672224#comment-13672224
 ] 

Elmer Garduno commented on MAHOUT-1108:
---

I will submit it later today.

 cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
 ---

 Key: MAHOUT-1108
 URL: https://issues.apache.org/jira/browse/MAHOUT-1108
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.7
Reporter: Elmer Garduno
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Got the following exception when running the command with HADOOP_CONF and  
 HADOOP_CONF_DIR
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/hadoop/util/ProgramDriver
   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.util.ProgramDriver
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   ... 1 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1175:
---

Fix Version/s: Backlog

 IllegalStateException and FileNotFoundException occures when running mahout 
 inbuilt mapreduce implementation of frequent pattern mining.
 

 Key: MAHOUT-1175
 URL: https://issues.apache.org/jira/browse/MAHOUT-1175
 Project: Mahout
  Issue Type: Improvement
  Components: Frequent Itemset/Association Rule Mining
Affects Versions: 0.6
Reporter: Afsal Thaj
Priority: Minor
 Fix For: Backlog


 We cannot integrate the code for parallel frequent pattern mining to a 
 project which is supposed to be run in an external server that connects to 
 cluster.Program works fine only inside the cluster (from command line to be 
 specific).IllegalStateException and FileNotFoundException can occur otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1162) Adding BallKMeans and StreamingKMeans classes


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1162.
-

Resolution: Fixed

THis has been checked in.

 Adding BallKMeans and StreamingKMeans classes
 -

 Key: MAHOUT-1162
 URL: https://issues.apache.org/jira/browse/MAHOUT-1162
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.8
Reporter: Dan Filimon
 Fix For: 0.8

 Attachments: MAHOUT_1162_with_test.patch


 Adding BallKMeans and StreamingKMeans clustering algorithms.
 These both implement IterableCentroid and thus return the resulting 
 centroids after clustering.
 BallKMeans implements:
 - kmeans++ initialization;
 - a normal k-means pass;
 - a trimming threshold so that points that are too far from the cluster they 
 were assigned to are not used in the new centroid computation.
 StreamingKMeans implements 
 [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]:
 - an online clustering algorithm that takes each point into account one by one
   - for each point, it computes the distance to the nearest existing cluster
   - if the distance is greater than a set distanceCutoff, it will create a 
 new cluster, otherwise it might be added to the cluster it's closest to 
 (proportional to the value of the distance / distanceCutoff)
   - if there are too many clusters, the clusters will be *collapsed* (the 
 same method gets called, but the number of clusters is re-adjusted)
 - finally, *about as many* clusters as requested are returned (not precise!); 
 this represents a sketch of the original points.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1152) mRMR feature selection algorithm

[
https://issues.apache.org/jira/browse/MAHOUT-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Schelter updated MAHOUT-1152:
---

Component/s: (was: Integration)

mRMR feature selection algorithm

Key: MAHOUT-1152
URL: https://issues.apache.org/jira/browse/MAHOUT-1152
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.7
Reporter: Claudio Reggiani
Priority: Minor
Labels: algorithm, feature
Fix For: 0.8

Original Estimate: 336h
Remaining Estimate: 336h

Proposal Title: mRMR Feature Selection Algorithm on Map-Reduce.
Student Name: Claudio Reggiani
Student E-mail: nop...@gmail.com
Proposal Abstract:
The mRMR algorithm, described in [1], is a feature selection algorithm that
leverages mutual information evaluation to select features. At each
iteration, mRMR selects a new feature based on both how much it's strongly
correlated to the target output and how much it's less correlated to the
features already selected. The correlation is measured by means of mutual
information. The project proposes to provide the mRMR algorithm in MapReduce
programming framework.
Additional information:
1. *The code is already available* with some tests, because I'm working on my
master thesis an initial milestone of my research was to implement mRMR
algorithm in MapReduce.
2. I'm figuring out if it's possible for me to apply at Google Summer of Code
2013.
References:
[1] Hanchuan Peng, Fuhui Long, and Chris Ding
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 27, No. 8, pp.1226-1238, 2005.
Link: http://penglab.janelia.org/papersall/docpdf/2005_TPAMI_FeaSel.pdf

[jira] [Resolved] (MAHOUT-1210) Fix URLs in mahout-collection-codegen-plugin pom


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1210.
-

   Resolution: Fixed
Fix Version/s: 0.8

Committed this.  Great (and obscure) catch, Stevo!

 Fix URLs in mahout-collection-codegen-plugin pom
 

 Key: MAHOUT-1210
 URL: https://issues.apache.org/jira/browse/MAHOUT-1210
 Project: Mahout
  Issue Type: Bug
  Components: build, collections, Math
Affects Versions: collections-1.0
Reporter: Stevo Slavic
Assignee: Benson Margulies
Priority: Minor
  Labels: maven
 Fix For: 0.8

 Attachments: mahout-collection-codegen-plugin-MAHOUT-1210.patch


 URLs in mahout-collection-codegen-plugin trunk POM still point to Lucene 
 project and Lucene SVN repository.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1152) mRMR feature selection algorithm

[
https://issues.apache.org/jira/browse/MAHOUT-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Schelter updated MAHOUT-1152:
---

Fix Version/s: (was: 0.8)
Backlog

mRMR feature selection algorithm

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Updated] (MAHOUT-950) Change BtJob to use new MultipleOutputs API