[jira] [Commented] (MAHOUT-1080) Kmeans clustered output losses vectorId given in the input

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672182#comment-13672182
 ] 

Grant Ingersoll commented on MAHOUT-1080:
-

Here's a thought: kill NamedVector, and move the single "name" string to 
Vector.  It seems to me naming a Vector is very, very common.  A possible 
issue, however, is dealing with older Vectors that don't have a name, but we 
could just treat it as an empty string.

IMO, this should be fixed before 1.0

> Kmeans clustered output losses vectorId given in the input
> --
>
> Key: MAHOUT-1080
> URL: https://issues.apache.org/jira/browse/MAHOUT-1080
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
>Reporter: Smita Wadhwa
> Fix For: 0.8
>
> Attachments: kMeansClusterVectorId.diff
>
>
> The input to the Kmeans is Intwritable and vectorWritable 
> and the output of clustered points is clusterId 
> WeightedVectorWitable(vector,distance-from-the-centre)
> The information the id of the vector is lost in this processing . 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1070) DisplayKMeans example has transposed/mislabelled arguments

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1070:


Fix Version/s: 0.8

> DisplayKMeans example has transposed/mislabelled arguments
> --
>
> Key: MAHOUT-1070
> URL: https://issues.apache.org/jira/browse/MAHOUT-1070
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.7
>Reporter: Gabriel Reid
>Assignee: Paritosh Ranjan
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1070.patch
>
>
> The org.apache.mahout.clustering.display.DisplayKMeans example class uses a 
> value for k (numClusters) and maximum number of iterations to come to 
> convergence, but their use is transposed (i.e. the numClusters is used as max 
> iterations, and max iterations is used for numClusters). Furthermore, a 
> second hard-coded version of the value is used. The end result is that it's 
> not directly possible to experiment with different values of numClusters and 
> maxIterations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1060) Search for nearest neighbor

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1060.
-

Resolution: Fixed

All of this capability has been added by Dan's streaming k-means clustering 
work except for the knn stuff.

> Search for nearest neighbor
> ---
>
> Key: MAHOUT-1060
> URL: https://issues.apache.org/jira/browse/MAHOUT-1060
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Reporter: Ted Dunning
> Fix For: 0.8
>
> Attachments: 
> 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 
> 0001-MAHOUT-1059-Added-Centroid-WeightedVector-Delegating.patch, 
> 0002-MAHOUT-1059-Stylistic-cleanups.patch, 
> 0002-MAHOUT-1059-Stylistic-cleanups.patch, 
> 0003-MAHOUT-1059-Add-generic-vector-test.patch, 
> 0003-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 
> 0004-MAHOUT-1059-Indentation.patch, 
> 0004-MAHOUT-1060-Add-basic-knn-capabilities.patch, 
> 0005-MAHOUT-1059-Abstract-the-idea-of-a-cached-length.patch, 
> 0006-MAHOUT-1059-Additional-test-for-weighted-vectors.patch, 
> 0007-MAHOUT-1060-Move-distance-measures-to-math-as-much-a.patch, 
> 0008-MAHOUT-1060-Add-basic-knn-capabilities.patch, 
> 0009-MAHOUT-1060-shorten-test-sizes.patch
>
>
> This will contain a patch for sequential nearest neighbor search routines 
> that underpin new clustering algorithms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1117) Vectors are not hashable

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-1117.


Resolution: Won't Fix

> Vectors are not hashable
> 
>
> Key: MAHOUT-1117
> URL: https://issues.apache.org/jira/browse/MAHOUT-1117
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0
>Reporter: Dan Filimon
>Priority: Minor
>
> No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
> In working on improving clustering in Mahout, Ted Dunning wrote prototype 
> code for Streaming KMeans and Ball KMeans, that I'm working with him on. 
> These need to be used together in the MapReduce version.
> However, in Ball KMeans, we initialize the clusters using a probabilistic 
> approach similar to k-means++. This however requires a 
> Multinomial distribution of the points we want to cluster to 
> pick the centroids.
> Internally, the Multinomial uses a HashMap to keep track of the values it 
> can sample from.
> Since Vectors don't override Object's hashCode(), it is possible to get the 
> same value multiple times in the map (as long as the references differ).
> This is less of an issue because of how we're adding the vectors to the 
> multinomial (we can guarantee that the references will be unique) and once 
> MAHOUT-1116 is resolved the hashing will work okay for our needs.
> It still seems that it would be useful to have hashable vectors.
> What do you think? And what would a hash function look like?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1117) Vectors are not hashable

2013-06-01 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672176#comment-13672176
 ] 

Robin Anil commented on MAHOUT-1117:


There is no single way good to hash a vector most methods are heavy plus the 
additional overhead of caching the hash. If you do want to hash vector's, you 
can override the hash-codes for your specific use-cases. This a design choice 
we should write down. 

> Vectors are not hashable
> 
>
> Key: MAHOUT-1117
> URL: https://issues.apache.org/jira/browse/MAHOUT-1117
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0
>Reporter: Dan Filimon
>Priority: Minor
>
> No *Vector classes (DenseVector, WeightedVector, etc.) implement hashCode().
> In working on improving clustering in Mahout, Ted Dunning wrote prototype 
> code for Streaming KMeans and Ball KMeans, that I'm working with him on. 
> These need to be used together in the MapReduce version.
> However, in Ball KMeans, we initialize the clusters using a probabilistic 
> approach similar to k-means++. This however requires a 
> Multinomial distribution of the points we want to cluster to 
> pick the centroids.
> Internally, the Multinomial uses a HashMap to keep track of the values it 
> can sample from.
> Since Vectors don't override Object's hashCode(), it is possible to get the 
> same value multiple times in the map (as long as the references differ).
> This is less of an issue because of how we're adding the vectors to the 
> multinomial (we can guarantee that the references will be unique) and once 
> MAHOUT-1116 is resolved the hashing will work okay for our needs.
> It still seems that it would be useful to have hashable vectors.
> What do you think? And what would a hash function look like?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1065) Add CassandraDataModelTest

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672173#comment-13672173
 ] 

Grant Ingersoll commented on MAHOUT-1065:
-

[~eduardo.gurgel] [~srowen] any update on this one?  In or out for 0.8?

> Add CassandraDataModelTest
> --
>
> Key: MAHOUT-1065
> URL: https://issues.apache.org/jira/browse/MAHOUT-1065
> Project: Mahout
>  Issue Type: Test
>  Components: Collaborative Filtering, Integration
>Affects Versions: 0.8
>Reporter: Eduardo Gurgel Pinho
>Priority: Minor
>  Labels: cassandra, collaborative-filtering, datamodel, hector, 
> taste, test
> Attachments: 0001-Add-CassandraDataModelTest.patch
>
>
> The test class for the CassandraDataModel class.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1053) Use KMeans++ for cluster Initialization

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1053.
-

Resolution: Fixed

This is resolved by the new streaming k-means stuff.

> Use KMeans++ for cluster Initialization
> ---
>
> Key: MAHOUT-1053
> URL: https://issues.apache.org/jira/browse/MAHOUT-1053
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Paritosh Ranjan
> Fix For: 0.8
>
>
> Use KMeans++ for cluster intialization.
> Ted has already implemented a similar version. http://github.com/tdunning/knn

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1054) Use ball KMeans for clustering

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1054.
-

Resolution: Fixed

This is resolved by the new streaming k-means stuff.

> Use ball KMeans for clustering
> --
>
> Key: MAHOUT-1054
> URL: https://issues.apache.org/jira/browse/MAHOUT-1054
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Paritosh Ranjan
> Fix For: 0.8
>
>
> Use ball KMeans for clustering.
> Ted has already implemented a similar version. http://github.com/tdunning/knn

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId

2013-06-01 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672172#comment-13672172
 ] 

Saikat Kanjilal commented on MAHOUT-974:


Yes, although I could use some general guidance being a newbie on this 
codebase, I've not had time to research this further, can you respond to my 
comments above?

Thanks

> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
> integer as userId and itemId
> ---
>
> Key: MAHOUT-974
> URL: https://issues.apache.org/jira/browse/MAHOUT-974
> Project: Mahout
>  Issue Type: Wish
>  Components: Collaborative Filtering
>Affects Versions: 0.8
>Reporter: Han Hui Wen 
>Assignee: Sebastian Schelter
>  Labels: CF,recommendation,als
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
> integer as userId and itemId,but 
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
> ItemId.
> It's best that ParallelALSFactorizationJob   also uses Long as userId and 
> itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1045) Cluster evaluators returning bad results

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1045.
-

Resolution: Fixed

Looks in and passing

> Cluster evaluators returning bad results
> 
>
> Key: MAHOUT-1045
> URL: https://issues.apache.org/jira/browse/MAHOUT-1045
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.6, 0.7, 0.8
> Environment: Several environments and data sets
>Reporter: Pat Ferrel
> Fix For: 0.8
>
> Attachments: first-time-density-nan.txt, MAHOUT-1045.patch, 
> MAHOUT-1045.patch, MAHOUT-1045.patch, MAHOUT-1045.patch
>
>
> With real world crawl data the Intra-cluster density from ClusterEvaluator is 
> almost always NaN. The CDbw inter-cluster density is almost always 0. I have 
> also seen several cases where CDbw fails to return any results but have not 
> tracked down why yet.
> I have sent a link to an 8G data set that reproduces these errors to Jeff 
> Eastman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1041) Support for PMML

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1041.
-

Resolution: Won't Fix

Without a patch, I don't see putting this in.  Also, I don't see the benefit of 
storing largish models in XML.  I could see a specific issue that can do I/O of 
PMML into Mahout's, but I don't see any thing running natively off of PMML.

> Support for PMML
> 
>
> Key: MAHOUT-1041
> URL: https://issues.apache.org/jira/browse/MAHOUT-1041
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
> Environment: Software Platform
>Reporter: Duraimurugan
> Fix For: Backlog
>
>
> Would like to request a support for PMML. With that once the predictive 
> models are built and provided in PMML format, we should be able to import 
> into hadoop cluster for scoring. This way models built in external 
> (non-mahout) systems can be imported to Hadoop/Mahout for scalable 
> environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1204) Rewrite Benchmarks using Caliper

2013-06-01 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-1204:
---

Affects Version/s: 1.0

> Rewrite Benchmarks using Caliper
> 
>
> Key: MAHOUT-1204
> URL: https://issues.apache.org/jira/browse/MAHOUT-1204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0
>Reporter: Robin Anil
>Assignee: Robin Anil
>
> https://code.google.com/p/caliper/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1231) "No input clusters found in " error in kmeans

2013-06-01 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1231:
---

Affects Version/s: (was: 0.8)
   (was: 0.7)
   Backlog

> "No input clusters found in " error in kmeans
> -
>
> Key: MAHOUT-1231
> URL: https://issues.apache.org/jira/browse/MAHOUT-1231
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: Backlog
>Reporter: Summer Lee
>
> 1.seqdirectory
> > mahout seqdirectory --input /user/hdfs/input/new1.csv --output
> > /user/hdfs/new1/seqdirectory --tempDir
> > /user/hdfs/new1/seqdirectory/tempDir
> 2.seq2sparse 
> > mahout seq2sparse --input /user/hdfs/new1/seqdirectory --output
> > /user/hdfs/new1/seq2sparse -wt tfidf
> 3.kmeans 
> > mahout kmeans --input /user/hdfs/new1/seq2sparse/tfidf-vectors
> > --output /user/hdfs/new1/kmeans -c /user/hdfs/new1/clusters/kmeans -x 3 -k 
> > 3 --tempDir /user/hdfs/new1/kmeans/tempDir
> and then error is occured
> Failing Oozie Launcher, Main class [org.apache.mahout.driver.MahoutDriver], 
> main() threw exception, No input clusters found in 
> /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
> argument.
> java.lang.IllegalStateException: No input clusters found in 
> /user/oozie/mahout/z3/kmeansCopy/clusters/part-randomSeed. Check your -c 
> argument.
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at 
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>   at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:467)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Oozie Launcher failed, finishing Hadoop job gracefully
> Oozie Launcher ends
> ===
> Why kmeans driver can't make clusters in Hadoop with oozie system?
> In hadoop with not oozie system, it worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1025) Update documentation for LDA before the release.

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1025.
-

Resolution: Fixed

> Update documentation for LDA before the release.
> 
>
> Key: MAHOUT-1025
> URL: https://issues.apache.org/jira/browse/MAHOUT-1025
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.7
>Reporter: Robin Anil
>Assignee: Jake Mannix
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1234) Canopy Clustering

2013-06-01 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1234.


Resolution: Won't Fix

> Canopy Clustering
> -
>
> Key: MAHOUT-1234
> URL: https://issues.apache.org/jira/browse/MAHOUT-1234
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Reporter: Sameer Sebastian
>
> Hello,
> I'm trying out Canopy clustering.
> I want to know, how to determine the optimum value for the distance 
> thresholds t1 and t2.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-992:
---

Fix Version/s: 0.8

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-978) spectralkmeans utility fails when input filename begins with leading underscore

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-978.


Resolution: Won't Fix

I'd say, won't fix, as there is a workaround.  Please re-open if there is a 
specific patch.

> spectralkmeans utility fails when input filename begins with leading 
> underscore
> ---
>
> Key: MAHOUT-978
> URL: https://issues.apache.org/jira/browse/MAHOUT-978
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.6
> Environment: Tested on a real Linux-based cluster running Hadoop 
> 0.20.2-cdh3u2 and the 0.6 release; also OSX pseudo cluster running Hadoop 
> 0.20.203.0 running 16 Feb trunk build.
>Reporter: Dan Brickley
>Priority: Minor
> Attachments: jira-underscore-spectral-log.txt
>
>
> The commandline 'bin/mahout spectralkmeans' utility fails with 
> NoSuchElementException after "Loading vector from: 
> spectral/output/results2/calculations/diagonal/part-r-0"  when input data 
> in hdfs has filename beginning with a leading underscore.
> This was partially reported in comments for MAHOUT-524 but I believe 
> identified now as a distinct issue (thanks to Shannon for help diagnosing). I 
> have not investigated if there is an equivalent problem for API-based use of 
> this piece of Mahout.
> Steps to reproduce: 
> 1. put affinity file into hdfs, following 
> https://cwiki.apache.org/MAHOUT/spectral-clustering.html - note that node IDs 
> count from zero etc. Name your file with a leading underscore. For example, 
> try http://danbri.org/2012/spectral/dbpedia/_topic_skm.csv and store it in 
> spectral/input/_topic_skm.csv
> (I'll leave that example input file in place unchanged for others to try. It 
> is built from dbpedia data, encoding associations from Wikipedia pages to 
> categories. Whether it is a good use of spectral clustering I'm not sure, but 
> I'd at least hope the job would run to completion.)
> 2. Run 'mahout spectralkmeans -k 20 -d 4192499 -x 7 -i spectral/input/ -o 
> spectral/output/results1'
> 3. Wait for it to fail just after printing "Loading vector from: 
> spectral/output/results1/calculations/diagonal/part-r-0", with 
> java.util.NoSuchElementException at 
> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152).
> 4. Rename the file in hdfs to eliminate the leading underscore. Re-run the 
> command (give a different results dir or cleanup from the first run, to avoid 
> mixing the tests). This attempt should succeed and you'll see it proceed 
> deeper into the job, i.e. something like 
> 12/02/19 14:38:32 INFO common.VectorCache: Loading vector from: 
> spectral/output/results2/calculations/diagonal/part-r-0
> 12/02/19 14:38:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 12/02/19 14:38:43 INFO input.FileInputFormat: Total input paths to process : 1
> 12/02/19 14:38:44 INFO mapred.JobClient: Running job: job_201202191410_0005
> 12/02/19 14:38:45 INFO mapred.JobClient:  map 0% reduce 0%
> 12/02/19 14:39:31 INFO mapred.JobClient:  map 1% reduce 0%
> (5. You might get a memory-based failure some time later; that is a separate 
> problem.)
> I'll attach a more detailed transcript. I've made no attempt to diagnose 
> internals yet, but did make some other tests and can confirm that it does not 
> seem to matter whether the commandline invocation names the file explicitly, 
> or by directory name only. Also trailing slash does not seem to be an issue. 
> Finally, a related 'gotcha': make sure the results directory is not inside 
> the input directory when testing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId

2013-06-01 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672163#comment-13672163
 ] 

Sebastian Schelter commented on MAHOUT-974:
---

Saikat, are you still on this?

> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
> integer as userId and itemId
> ---
>
> Key: MAHOUT-974
> URL: https://issues.apache.org/jira/browse/MAHOUT-974
> Project: Mahout
>  Issue Type: Wish
>  Components: Collaborative Filtering
>Affects Versions: 0.8
>Reporter: Han Hui Wen 
>Assignee: Sebastian Schelter
>  Labels: CF,recommendation,als
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
> integer as userId and itemId,but 
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
> ItemId.
> It's best that ParallelALSFactorizationJob   also uses Long as userId and 
> itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-974) org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId

2013-06-01 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-974:
--

Affects Version/s: (was: 0.6)
   0.8

> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  use 
> integer as userId and itemId
> ---
>
> Key: MAHOUT-974
> URL: https://issues.apache.org/jira/browse/MAHOUT-974
> Project: Mahout
>  Issue Type: Wish
>  Components: Collaborative Filtering
>Affects Versions: 0.8
>Reporter: Han Hui Wen 
>Assignee: Sebastian Schelter
>  Labels: CF,recommendation,als
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob  uses 
> integer as userId and itemId,but 
> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob  and  
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and 
> ItemId.
> It's best that ParallelALSFactorizationJob   also uses Long as userId and 
> itemId ,so that same dataset can use all the recommendation arithrmetic

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-966:
---

Fix Version/s: 0.8

> Mismatch in the number of points given by the clusterDumper and 
> ClusterOutputPostProcessor
> --
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6 
>Reporter: Gaurav Redkar
>Priority: Minor
> Fix For: 0.8
>
> Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
> mtestdata.txt, points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster 
> contains is not matching the number of points each cluster should contain as 
> stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from 
> the number of points actually contained in d directory for each cluster. Any 
> idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672161#comment-13672161
 ] 

Grant Ingersoll commented on MAHOUT-966:


Any update on this?  Seems like it should be fixed for 0.8

> Mismatch in the number of points given by the clusterDumper and 
> ClusterOutputPostProcessor
> --
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6 
>Reporter: Gaurav Redkar
>Priority: Minor
> Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
> mtestdata.txt, points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster 
> contains is not matching the number of points each cluster should contain as 
> stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from 
> the number of points actually contained in d directory for each cluster. Any 
> idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-953:
---

Fix Version/s: 0.8

> ArffVectorIterable does not gracefully handle duplicate attribute name
> --
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.6
>Reporter: Stuart Smith
>Priority: Trivial
> Fix For: 0.8
>
>
> If you have duplicate attribute names in your ARFF file, and you have 
> non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
> ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
> of your attribute labels (duplicates removed), but your arff vectors could 
> have more values (if they reference the attribute at both indexes). This is a 
> somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() 
> when it's out of bounds, or when someone tries to add duplicate label to the 
> MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in 
> MapBackedArffModel will do something rather pathological in the case of 
> duplicate attributes: it overwrites the Label map with the new index, but the 
> idxLabel map will hold a mapping from both indexes to the attribute name, so 
> it's out of sync.. so it may be best to disallow duplicate attribute names 
> "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() 
> or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672158#comment-13672158
 ] 

Grant Ingersoll commented on MAHOUT-953:


Stuart, any chance you can get a patch for this to add in 0.8?

> ArffVectorIterable does not gracefully handle duplicate attribute name
> --
>
> Key: MAHOUT-953
> URL: https://issues.apache.org/jira/browse/MAHOUT-953
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.6
>Reporter: Stuart Smith
>Priority: Trivial
>
> If you have duplicate attribute names in your ARFF file, and you have 
> non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a 
> ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size 
> of your attribute labels (duplicates removed), but your arff vectors could 
> have more values (if they reference the attribute at both indexes). This is a 
> somewhat pathological ARFF file.
> Not sure if I should note the error (throw an exception) in computeNext() 
> when it's out of bounds, or when someone tries to add duplicate label to the 
> MapBackedArffModel.
> My first impulse would be to check in computeNext(), but addLabel() in 
> MapBackedArffModel will do something rather pathological in the case of 
> duplicate attributes: it overwrites the Label map with the new index, but the 
> idxLabel map will hold a mapping from both indexes to the attribute name, so 
> it's out of sync.. so it may be best to disallow duplicate attribute names 
> "IllegalArgumentException" altogether.
> For example
> @attribute my_attribute NUMERIC
> @attribute my_attribute NUMERIC
> addLabel()
> addLabel()
> labelBindings -> ('my_attribute', 1)
> idxLabel -> (0, 'my_attribute), (1, 'my_attribute')
> I'll happily submit a patch, just wondering if it should be in computeNext() 
> or addLabel()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-952) ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other ARFF issues

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-952:
---

Fix Version/s: 0.8

I think we can add this to 0.8.  Joe or Stuart, can you update this issue?

> ARFFVectorIterable/MapBackedArffModel doesn't handle question mark '?', other 
> ARFF issues
> -
>
> Key: MAHOUT-952
> URL: https://issues.apache.org/jira/browse/MAHOUT-952
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: Latest SVN on ubuntu
>Reporter: Stuart Smith
>Priority: Minor
>  Labels: ARFF
> Fix For: 0.8
>
> Attachments: MAHOUT-952.patch
>
>
> Whatever is parsing the ARFF file for the ARFFVectorIterable (As far as I can 
> tell, it's the class itself) doesn't handle '?' as a marker for unknown 
> value. See: http://www.cs.waikato.ac.nz/~ml/weka/arff.html  
> I just started looking at Mahout classifiers this week, so I'm not sure how 
> to handle this yet. If I figure it out, I'll post a patch, but until then, 
> guidance would be helpful!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672154#comment-13672154
 ] 

Suneel Marthi commented on MAHOUT-884:
--

Also will be adding unit tests as part of committing this patch.

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-950) Change BtJob to use new MultipleOutputs API

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672151#comment-13672151
 ] 

Grant Ingersoll commented on MAHOUT-950:


I think we still need to support 1.0.X, so I'm not sure how to handle this.

> Change BtJob to use new MultipleOutputs API
> ---
>
> Key: MAHOUT-950
> URL: https://issues.apache.org/jira/browse/MAHOUT-950
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Tom White
> Attachments: MAHOUT-950.patch
>
>
> BtJob uses a mixture of the old and new MapReduce API to allow it to use 
> MultipleOutputs (which isn't available in Hadoop 0.20/1.0). This fails when 
> run against 0.23 (see MAHOUT-822), so we should change BtJob to use the new 
> MultipleOutputs API. (Hopefully the new MultipleOutputs API will be made 
> available in a 1.x release - see MAPREDUCE-3607.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-884:


Assignee: Suneel Marthi

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672150#comment-13672150
 ] 

Suneel Marthi commented on MAHOUT-884:
--

Agree with Sebastian. I can work on this later today.

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

2013-06-01 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672149#comment-13672149
 ] 

Jake Mannix commented on MAHOUT-874:


So marking hadoop as provided is nice, a smaller jar is great, but what I as I 
mentioned above, the size was never my primary concern, it was the dependency 
graph: It's really nice that mahout-math is a nice little non-hadoop-depending 
package which just does stats, linear algebra, and ml which don't have to think 
about hadoop stuff, even for compile time.  -core is big, because it's what 
mahout "is".  What I has been wanting is something a little in between, that 
depends on hadoop (but with provided scope), and mahout-math, but has the 
writables so that someone can work with mahout data inputs/outputs without 
actually linking to -core.

Essentially, it's the distinction between a "mahout-api" vs "mahout-impl" 
package.  Since our "API" is file-format, the "mahout-api" module is really 
just the set of writables needed to be able to marshall/unmarshall our binary 
data.

> Extract Writables into a separate module to allow smaller dependencies
> --
>
> Key: MAHOUT-874
> URL: https://issues.apache.org/jira/browse/MAHOUT-874
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable 
> classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like 
> to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-942) Improbe the way to process the missing value for DF.

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-942.


Resolution: Later

Please reopen when you have a patch

> Improbe the way to process the missing value for DF.
> 
>
> Key: MAHOUT-942
> URL: https://issues.apache.org/jira/browse/MAHOUT-942
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification
>Reporter: Ikumasa Mukai
>  Labels: DecisionForest
>
> If we process the data which contains the missing value("?"),
> the tree cannot be created because DataConverter.convert inserts the null 
> value
> to the list of Instances.
> Of cause we can fix this issue with prohibiting DataConverter.convert insert
> the null value, but I notice that there is a potentiality that the rows
> which have missing value("?") can be also used to make the tree.
> We can use them for making all stems on the edge where we use the missing 
> value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1206) Add density-based clustering algorithms to mahout

2013-06-01 Thread Yexi Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672147#comment-13672147
 ] 

Yexi Jiang commented on MAHOUT-1206:


Still there is no comments?

> Add density-based clustering algorithms to mahout
> -
>
> Key: MAHOUT-1206
> URL: https://issues.apache.org/jira/browse/MAHOUT-1206
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Yexi Jiang
>  Labels: clustering
>
> The clustering algorithms (kmeans, fuzzy kmeans, dirichlet clustering, and 
> spectral cluster) clustering data by assuming that the data can be clustered 
> into the regular hyper sphere or ellipsoid. However, in practical, not all 
> the data can be clustered in this way. 
> To enable the data to be clustered in arbitrary shapes, clustering algorithms 
> like DBSCAN, BIRCH, CLARANCE 
> (http://en.wikipedia.org/wiki/Cluster_analysis#Density-based_clustering) are 
> proposed.
> It is better that we can implement one or some of these clustering algorithm 
> to enrich the clustering library. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672145#comment-13672145
 ] 

Sebastian Schelter commented on MAHOUT-884:
---

regarding the patch: please make sure to always close readers in finally blocks 
and don't throw an InterruptedException if the job fails.

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-884) Matrix Concatenate utility

2013-06-01 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672144#comment-13672144
 ] 

Ted Dunning commented on MAHOUT-884:


Suneel, can you commit this if you think it is good?

> Matrix Concatenate utility
> --
>
> Key: MAHOUT-884
> URL: https://issues.apache.org/jira/browse/MAHOUT-884
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Lance Norskog
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-884.patch, MAHOUT-884.patch, MAHOUT-884.patch
>
>
> Utility to concatenate matrices stored as SequenceFiles of vectors.
> Each pair in the SequenceFile is the IntWritable row number and a 
> VectorWritable.
> The input and output files may skip rows. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-874) Extract Writables into a separate module to allow smaller dependencies

2013-06-01 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672143#comment-13672143
 ] 

Ted Dunning commented on MAHOUT-874:


Jake,

Can you confirm that changing Hadoop to provided solved this for you?

I would like to mark this as fixed.

> Extract Writables into a separate module to allow smaller dependencies
> --
>
> Key: MAHOUT-874
> URL: https://issues.apache.org/jira/browse/MAHOUT-874
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Ted Dunning
>
> The theory is that we can have a smaller jar if we only include writable 
> classes and their exact dependencies.
> I have a prototype, but it has some funky characteristics which I would like 
> to discuss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-865) Refactor Sequential Clustering algorithms

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-865.


Resolution: Won't Fix

We should open issues for individual instances as desired.

> Refactor Sequential Clustering algorithms
> -
>
> Key: MAHOUT-865
> URL: https://issues.apache.org/jira/browse/MAHOUT-865
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
>Priority: Minor
>
> We have a lot of implementations of sequential clustering algorithms that are 
> kind of treated as an afterthought by sticking them into the *Driver classes. 
>  We should pull them out into their own classes with real APIs so that people 
> can use them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-836) On donating my Robust PCA Java code to Mahout

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672133#comment-13672133
 ] 

Grant Ingersoll commented on MAHOUT-836:


Hi Sujit,

This is interesting, do you have a patch?

> On donating my Robust PCA Java code to Mahout
> -
>
> Key: MAHOUT-836
> URL: https://issues.apache.org/jira/browse/MAHOUT-836
> Project: Mahout
>  Issue Type: New JIRA Project
>  Components: Classification
> Environment: Platform independent
>Reporter: Sujit Nair
>  Labels: newbie
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hi All,
> I have an implementation of Robust PCA (a.k.a low rank and sparse 
> decomposition) in Java which I would like to donate to Mahout. I am a MATLAB 
> expert, comfortable with C++ and have just started with Java. I am completely 
> new to Mahout but am very excited to participate and contribute. 
> I have tested my code exhaustively and there does not seem to be any issues. 
> The results are very good but the code definitely needs some optimization. 
> Please let me know if there is interest. 
> Thanks,
> Sujit

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-804) Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles and search behaviours

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672132#comment-13672132
 ] 

Grant Ingersoll commented on MAHOUT-804:


Not sure what to do, perhaps we should move to the ASF CMS?

> Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles 
> and search behaviours
> --
>
> Key: MAHOUT-804
> URL: https://issues.apache.org/jira/browse/MAHOUT-804
> Project: Mahout
>  Issue Type: Improvement
>  Components: Website
>Reporter: Dan Brickley
>  Labels: atlassian, confluence, wiki
>
> There are two styles of URL in circulation for URLs into Mahout's Wiki 
> (presumably an Apache-wide configuration issue):
> https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs
> https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
> They appear to be the self-same confluence 3.4.9 installation (or its raw 
> filetree). Each has a different search box at the top of the page. The 
> version with 'confluence/' in the path does a confluence search, and returns 
> similar URLs as results. The one with '.html' suffixes does a 
> domain-constrained Google search. 
> Despite markup canonicalising the confluence variant, ie.   rel="canonical" 
> href="https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition";>
>  appearing in the confluence pages, it seems the Google search results 
> typically throw people into the other version of the Wiki site.
> This is all mildly confusing, mildly annoying but overall mostly harmless. It 
> could be having some negative impact on google rank & suchlike, since 
> incoming links will be split between the two styles. Maybe this could be 
> passed along to the Wiki admins? 
> Which version does the Mahout team consider canonical URLs (for external 
> links etc)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner

2013-06-01 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1235.


Resolution: Fixed

> ParallelALSFactorizationJob does not use VectorSumCombiner
> --
>
> Key: MAHOUT-1235
> URL: https://issues.apache.org/jira/browse/MAHOUT-1235
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Trivial
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner

2013-06-01 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1235:
---

Fix Version/s: 0.8

> ParallelALSFactorizationJob does not use VectorSumCombiner
> --
>
> Key: MAHOUT-1235
> URL: https://issues.apache.org/jira/browse/MAHOUT-1235
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Trivial
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-775) L2 does not work with TrainAdaptiveLogisticRegression

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-775:
---

Fix Version/s: 0.8

> L2 does not work with TrainAdaptiveLogisticRegression
> -
>
> Key: MAHOUT-775
> URL: https://issues.apache.org/jira/browse/MAHOUT-775
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.6
>Reporter: XiaoboGu
> Fix For: 0.8
>
> Attachments: MAHOUT-775.patch
>
>
> I have post the problem to the dev list, see the following message
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201106.mbox/%3cbanlktik6153pjgcfnayuprwbv9jzcxp...@mail.gmail.com%3e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar

2013-06-01 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672129#comment-13672129
 ] 

Pat Ferrel commented on MAHOUT-1126:


Right you are and so the solution has changed to delete the file, not the 
directory. Still it's a post build process thing and new people have to figure 
out the solution over and over. There used to be a special exclude in the 
examples/src/main/assembly/job.xml shown below but I don't think that works 
anymore. Maybe that could be the source of a permanent fix? I'm not a Maven 
expert.

BTW I don't build in examples but I so use it as an example of how to create a 
separate build and end up with the same problem because it includes the same 
deps and license. The problem is obviously not Mahout, but that is the 
infection vector...

 
org.apache.hadoop:hadoop-core

com.github.stephenc.high-scale-lib:high-scale-lib
  


> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-684) Topics regularization for LDA

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672128#comment-13672128
 ] 

Grant Ingersoll commented on MAHOUT-684:


Any update on this?

> Topics regularization for LDA
> -
>
> Key: MAHOUT-684
> URL: https://issues.apache.org/jira/browse/MAHOUT-684
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Vasil Vasilev
>Priority: Minor
>  Labels: LDA.
> Attachments: MAHOUT-684.patch, MAHOUT-684.patch, MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in 
> the paper of Blei, Ng and Jordan 
> (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are 
> wrong). The correct version is described here: 
> http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-670) Provide a performance measurement framework for Mahout

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-670.


Resolution: Won't Fix

People who want this can get it off of Github, as there isn't a patch and GH is 
likely fine for this stuff

> Provide a performance measurement framework for Mahout
> --
>
> Key: MAHOUT-670
> URL: https://issues.apache.org/jira/browse/MAHOUT-670
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Reporter: Oliver B. Fischer
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: framework, performance, test, testing, testsuite
> Fix For: Backlog
>
>
> At the moment Mahout lacks the existence of a performance test framework. The 
> framework should be able to execute user defined performace test of 
> distributed and non-distributed algorithms, generate reports and to detect 
> regressions in the performace of mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1132) fpgrowth2 crash when have not unique items in one line

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-1132:


Fix Version/s: Backlog

> fpgrowth2 crash when have not unique items in one line
> --
>
> Key: MAHOUT-1132
> URL: https://issues.apache.org/jira/browse/MAHOUT-1132
> Project: Mahout
>  Issue Type: Bug
>Reporter: Kirill A. Korinskiy
> Fix For: Backlog
>
> Attachments: MAHOUT-1132.patch
>
>
> I create follow file as input for fpgrowth2:
> 0, 0, 0
> 0, 0, 0
> 0, 0, 0
> and when I run ./bin/mahout -i kv -o output -2 --mathod mapreduct I take a 
> crash:
> java.lang.IllegalStateException: mismatched counts for targetAttr=0, (3 != 
> 9); thisTree=[FPTree
>   -{attr:-1, cnt:0}-1->-{attr:0, cnt:3}
> ]
>   at 
> org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPTree.createMoreFreqConditionalTree(FPTree.java:259)
>   at 
> org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.growth(FPGrowthIds.java:238)
>   at 
> org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.fpGrowth(FPGrowthIds.java:163)
>   at 
> org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:220)
>   at 
> org.apache.mahout.fpm.pfpgrowth.fpgrowth2.FPGrowthIds.generateTopKFrequentPatterns(FPGrowthIds.java:115)
>   at 
> org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:99)
>   at 
> org.apache.mahout.fpm.pfpgrowth.ParallelFPGrowthReducer.reduce(ParallelFPGrowthReducer.java:48)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
> Follow patch fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672126#comment-13672126
 ] 

Grant Ingersoll commented on MAHOUT-1126:
-

When I build the examples job jar, I don't see a META-INF/LICENSES directory 
anymore.  There is a /META-INF/LICENSE file.  There is also a /licenses 
directory, but it is not in /META-INF

> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1235) ParallelALSFactorizationJob does not use VectorSumCombiner

2013-06-01 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1235:
--

 Summary: ParallelALSFactorizationJob does not use VectorSumCombiner
 Key: MAHOUT-1235
 URL: https://issues.apache.org/jira/browse/MAHOUT-1235
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter
Priority: Trivial




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1162) Adding BallKMeans and StreamingKMeans classes

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-1162:


Fix Version/s: 0.8

> Adding BallKMeans and StreamingKMeans classes
> -
>
> Key: MAHOUT-1162
> URL: https://issues.apache.org/jira/browse/MAHOUT-1162
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
> Fix For: 0.8
>
> Attachments: MAHOUT_1162_with_test.patch
>
>
> Adding BallKMeans and StreamingKMeans clustering algorithms.
> These both implement Iterable and thus return the resulting 
> centroids after clustering.
> BallKMeans implements:
> - kmeans++ initialization;
> - a normal k-means pass;
> - a trimming threshold so that points that are too far from the cluster they 
> were assigned to are not used in the new centroid computation.
> StreamingKMeans implements 
> [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]:
> - an online clustering algorithm that takes each point into account one by one
>   - for each point, it computes the distance to the nearest existing cluster
>   - if the distance is greater than a set distanceCutoff, it will create a 
> new cluster, otherwise it might be added to the cluster it's closest to 
> (proportional to the value of the distance / distanceCutoff)
>   - if there are too many clusters, the clusters will be *collapsed* (the 
> same method gets called, but the number of clusters is re-adjusted)
> - finally, *about as many* clusters as requested are returned (not precise!); 
> this represents a sketch of the original points.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1154) Implementing Streaming KMeans

2013-06-01 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-1154:


Fix Version/s: 0.8

> Implementing Streaming KMeans
> -
>
> Key: MAHOUT-1154
> URL: https://issues.apache.org/jira/browse/MAHOUT-1154
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
> Fix For: 0.8
>
>
> An implementation of Streaming KMeans as mentioned in [1] is available here 
> [2].
> [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E
> [2] https://github.com/dfilimon/mahout
> Since there will be more than one patches, there will be specific JIRA issues 
> that address each one.
> The description of the code being added is:
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 2 packages:
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
>   BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
>   StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
>   CentroidWritable serializes Centroids (sort of like AbstractCluster).
>   StreamingKMeansDriver provides the driver for the job.
>   StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
>   StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
> Additionally the searchers are in o.a.m.math.neighborhood
> - neighborhood: various searcher classes that implement nearest-neighbor
> search using different strategies.
>   Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
>   BruteSearch: does a brute search (looks at every point...)
>   ProjectionSearch: uses random projections for searching.
>   FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
>   HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
>   utils.ExperimentUtils: convenience methods.
>   tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
> [3] 
> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [4] 
> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
> The relevant issues are:
> - MAHOUT-1155 (Centroid, WeightedVector)
> - MAHOUT-1156 (searchers)
> - MAHOUT-1162 (clustering, non map-reduce)
> - MAHOUT-1181 (map-reduce, command-line changes, pom.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1154) Implementing Streaming KMeans

2013-06-01 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672125#comment-13672125
 ] 

Grant Ingersoll commented on MAHOUT-1154:
-

[~dfilimon] can this be resolved/closed?

> Implementing Streaming KMeans
> -
>
> Key: MAHOUT-1154
> URL: https://issues.apache.org/jira/browse/MAHOUT-1154
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Dan Filimon
>
> An implementation of Streaming KMeans as mentioned in [1] is available here 
> [2].
> [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E
> [2] https://github.com/dfilimon/mahout
> Since there will be more than one patches, there will be specific JIRA issues 
> that address each one.
> The description of the code being added is:
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 2 packages:
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
>   BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
>   StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
>   CentroidWritable serializes Centroids (sort of like AbstractCluster).
>   StreamingKMeansDriver provides the driver for the job.
>   StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
>   StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
> Additionally the searchers are in o.a.m.math.neighborhood
> - neighborhood: various searcher classes that implement nearest-neighbor
> search using different strategies.
>   Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
>   BruteSearch: does a brute search (looks at every point...)
>   ProjectionSearch: uses random projections for searching.
>   FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
>   HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
>   utils.ExperimentUtils: convenience methods.
>   tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
> [3] 
> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [4] 
> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
> The relevant issues are:
> - MAHOUT-1155 (Centroid, WeightedVector)
> - MAHOUT-1156 (searchers)
> - MAHOUT-1162 (clustering, non map-reduce)
> - MAHOUT-1181 (map-reduce, command-line changes, pom.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1201) Some Mahout jobs do not pass user supplied Configuration object to sub jobs

2013-06-01 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1201:


Fix Version/s: 0.8

> Some Mahout jobs do not pass user supplied Configuration object to sub jobs
> ---
>
> Key: MAHOUT-1201
> URL: https://issues.apache.org/jira/browse/MAHOUT-1201
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Frequent Itemset/Association Rule Mining, 
> Math
>Affects Versions: 0.7
>Reporter: Isabel Drost-Fromm
> Fix For: 0.8
>
> Attachments: MAHOUT-1201-clustering.patch, MAHOUT-1201-entropy.patch, 
> MAHOUT-1201-pfpgrowth.patch, MAHOUT-1201-solver.patch
>
>
> Some (see patch) of our Hadoop jobs do not pass a user supplied configuration 
> object down to sub jobs created. As a result some Hadoop related settings may 
> not be honored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


0.8 and bug squashing on June 1

2013-06-01 Thread Grant Ingersoll
A few of us are at Berlin Buzzwords hanging out and working on Mahout, so if 
you are interested, feel free to jump on IRC (#mahout on freenode) for some 
discussion.  Not all of our conversation will be translated to IRC, but we are 
happy to interact w/ others if interested.

Also, sounds like maybe we are ready for 0.8?  Or at least close?  I 
volunteered to do the release, so I'm going to start going through the 0.8 JIRA 
issues and triaging them.  If you want something in for 0.8, speak now (or 
relatively soon).  I'd like to suggest trying to get an RC out this coming week 
or the following.

-Grant


Grant Ingersoll | @gsingers
http://www.lucidworks.com







<    1   2