[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-10-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809813#comment-13809813
 ] 

Grant Ingersoll commented on MAHOUT-1030:
-

Andrew, I suppose it depends on what part of it you want to address.  If it is 
the literal part of this bug, Pat has been pretty responsive.  If it is the 
reworking of the properties of vectors, that is probably best handled on the 
mailing list.  The basic gist being we want to more intelligently handle vector 
properties and get rid of NamedVector.  [~tdunning], [~robinanil] and others 
may have some thoughts here as well.

(FWIW, I'd prefer the latter to be tackled.)

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> 
>
> Key: MAHOUT-1030
> URL: https://issues.apache.org/jira/browse/MAHOUT-1030
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Integration
>Affects Versions: 0.7
>Reporter: Jeff Eastman
>Assignee: Andrew Musselman
> Fix For: 1.0, 0.9
>
> Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-07-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724029#comment-13724029
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv,

Any chance this can get done?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.9
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorith

[jira] [Updated] (MAHOUT-1284) DummyRecordWriter's bug with reused Writables

2013-07-24 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1284:


Fix Version/s: (was: 0.8)
   (was: 0.7)
   0.9

> DummyRecordWriter's bug with reused Writables
> -
>
> Key: MAHOUT-1284
> URL: https://issues.apache.org/jira/browse/MAHOUT-1284
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7, 0.8
>Reporter: Maysam Yabandeh
>Priority: Minor
>  Labels: test
> Fix For: 0.9
>
> Attachments: MAHOUT-1284.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It is a recommended practice to reuse the Writable objects. 
> DummyRecordWriter, which is used for testing in Mahout, however keeps the 
> same Writable instance in a map: next time that the user reuses the Writable 
> object, the internal map of DummyRecordWriter changes as well. This makes 
> DummyRecordWriter fail for testing the MapReduce jobs that reuse the 
> Writables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703155#comment-13703155
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

[~sslavic]  Yeah, Maven release does create the branch and that is the workflow 
I usually use as well.  The main issue I have, is it seems like the Maven 
release goal has to rollback things if for some reason there are issues w/ the 
RC, but perhaps that is just our misunderstanding of how to use the Maven 
release goal.  Please have a look at 
https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Release to see if our 
understanding of that is right.

> Drop some of the Release Artifact File Types
> 
>
> Key: MAHOUT-1275
> URL: https://issues.apache.org/jira/browse/MAHOUT-1275
> Project: Mahout
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 0.9
>
>
> There really is no reason why we need so many release artifacts for the 
> distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
> let's save a few bits, along with Release Manager upload times, and drop the 
> BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702458#comment-13702458
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

[~sslavic] Please revert this.  We are under code freeze right now on trunk.

> Drop some of the Release Artifact File Types
> 
>
> Key: MAHOUT-1275
> URL: https://issues.apache.org/jira/browse/MAHOUT-1275
> Project: Mahout
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 0.9
>
>
> There really is no reason why we need so many release artifacts for the 
> distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
> let's save a few bits, along with Release Manager upload times, and drop the 
> BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702370#comment-13702370
 ] 

Grant Ingersoll commented on MAHOUT-1275:
-

Stevo, just FYI, please don't commit anything right now, as we are under code 
freeze until 0.8 is out (unless you know how to deal w/ this in Maven release 
plugin)

> Drop some of the Release Artifact File Types
> 
>
> Key: MAHOUT-1275
> URL: https://issues.apache.org/jira/browse/MAHOUT-1275
> Project: Mahout
>  Issue Type: Task
>Reporter: Grant Ingersoll
>Assignee: Stevo Slavic
>Priority: Minor
> Fix For: 0.9
>
>
> There really is no reason why we need so many release artifacts for the 
> distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, 
> let's save a few bits, along with Release Manager upload times, and drop the 
> BZ2 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1275) Drop some of the Release Artifact File Types

2013-07-08 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created MAHOUT-1275:
---

 Summary: Drop some of the Release Artifact File Types
 Key: MAHOUT-1275
 URL: https://issues.apache.org/jira/browse/MAHOUT-1275
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.9


There really is no reason why we need so many release artifacts for the 
distribution.  We run on *NIX machines.  Zip and Gzip are standard tools, let's 
save a few bits, along with Release Manager upload times, and drop the BZ2 
format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-24 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691954#comment-13691954
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Hi,

Any progress on this?  It is the last open issue for 0.8.

Thanks,
Grant


> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682325#comment-13682325
 ] 

Grant Ingersoll commented on MAHOUT-944:


[~smarthi], the error only seems to happen when running all the tests and it 
seems to be intermittent.  It almost looks like some type of classpath issue.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682108#comment-13682108
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

bq. But @Grant suggest we supply the patch of v0.7 first.

Yes, I was working under the assumption that an old patch is better than no 
patch.  A patch against HEAD is even better.  I think we have a few more days, 
so against HEAD would be great.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681744#comment-13681744
 ] 

Grant Ingersoll commented on MAHOUT-944:


Suneel, weird.  I didn't see that before.  We are using the new APIs, AFAICT, 
so not sure what is going on.  So tired of the stupidity of the dual Map/Reduce 
APIs in Hadoop.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-833) Make conversion to sequence files map-reduce

2013-06-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681206#comment-13681206
 ] 

Grant Ingersoll commented on MAHOUT-833:


The patch seems to be missing the WholeFileRecordReader.

> Make conversion to sequence files map-reduce
> 
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.7
>Reporter: Grant Ingersoll
>Assignee: Suneel Marthi
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch, 
> MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1233.
-

Resolution: Incomplete

Please reopen if you have a repeatable test case, as I am not sure there is an 
issue here.

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
> Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1030:


Fix Version/s: 0.9

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> 
>
> Key: MAHOUT-1030
> URL: https://issues.apache.org/jira/browse/MAHOUT-1030
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Integration
>Affects Versions: 0.7
>Reporter: Jeff Eastman
>Assignee: Suneel Marthi
> Fix For: 1.0, 0.9
>
> Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-11 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680392#comment-13680392
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Any update on this for applying against trunk/0.8?

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>Assignee: Robin Anil
>  Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: matrix_1, matrix_2, SpectralKMeans.patch
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-06-11 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1030:


Fix Version/s: (was: 0.8)
   1.0

I'm going to push this.  I know that for 0.9 we are looking at reworking the 
way we handle vectors and their associated properties (i.e. get rid of 
NamedVector, etc.)

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> 
>
> Key: MAHOUT-1030
> URL: https://issues.apache.org/jira/browse/MAHOUT-1030
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Integration
>Affects Versions: 0.7
>Reporter: Jeff Eastman
>Assignee: Suneel Marthi
> Fix For: 1.0
>
> Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679858#comment-13679858
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Do you see:
{code}
echo "Extracting Reuters"
$MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters 
${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out
if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
echo "Copying Reuters data to Hadoop"
set +e
$HADOOP dfs -rmr ${WORK_DIR}/reuters-sgm
$HADOOP dfs -rmr ${WORK_DIR}/reuters-out
set -e
$HADOOP dfs -put ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-sgm
$HADOOP dfs -put ${WORK_DIR}/reuters-out ${WORK_DIR}/reuters-out
fi
{code}

Also, I'm on #mahout on IRC if that helps us resolve this faster.

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679855#comment-13679855
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Hmm, I tested k-means cluster-reuters.sh last night on Hadoop single node and 
it worked fine.  I added a step to copy the reuters-out up to HDFS.  Let me 
make sure I pushed (see MAHOUT-1247)

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679817#comment-13679817
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

Jake, are you up to date?  I fixed a bunch of things related to 
cluster-reuters.  Also, do you have HADOOP-HOME set?  Or MAHOUT-LOCAL?

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1067) SSVD enhancements: +named vector propagation to U, +USigma output

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1067:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

> SSVD enhancements: +named vector propagation to U, +USigma output
> -
>
> Key: MAHOUT-1067
> URL: https://issues.apache.org/jira/browse/MAHOUT-1067
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
>Priority: Trivial
> Fix For: 0.8
>
>
> 1) PCA will benefit from outputting U*Sigma product.
> 2) Dimensionality reduction pipelines need NamedVector propagation to U, 
> U*Sigma or U*Sigma^0.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1147:


Attachment: MAHOUT-1147.patch

I can't completely speak to correctness, but here's an updated patch that fixes 
formatting and handles the DistributedCache better.  

Overall, the patch looks good, but I would like to see a test for it.

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1147) CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random matrix

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679174#comment-13679174
 ] 

Grant Ingersoll commented on MAHOUT-1147:
-

[~jp...@sussex.ac.uk] Do you happen to have a test case that verifies this?

> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random 
> matrix
> ---
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
>Reporter: Jack Pay
>Assignee: Jake Mannix
>  Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found 
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method. 
> The Job is passed to the method instead of the Configuration object given to 
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is 
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore 
> returns null.
> The code stipulates that if it cannot find a model to use a new random 
> matrix. This happens every time as MODEL_PATHS is not set for the 
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration 
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679170#comment-13679170
 ] 

Grant Ingersoll commented on MAHOUT-1030:
-

Pat, do you have a patch for this that demonstrates what you are suggesting so 
that we can compare?

> Regression: Clustered Points Should be WeightedPropertyVectorWritable not 
> WeightedVectorWritable
> 
>
> Key: MAHOUT-1030
> URL: https://issues.apache.org/jira/browse/MAHOUT-1030
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering, Integration
>Affects Versions: 0.7
>Reporter: Jeff Eastman
>Assignee: Suneel Marthi
> Fix For: 0.8
>
> Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on 
> code and tests and I don't know which properties were implemented in the old 
> version. I will create a JIRA and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new 
> > ClusterClassificationDriver was introduced. It should be a pretty easy fix 
> > and I will see if I can make the change before Paritosh cuts the release 
> > bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as 
> >> WeightedVectorWritable where in mahout 0.6 they were 
> >> WeightedPropertyVectorWritable? This means that the distance from the 
> >> centroid is no longer stored here? Why? I hope I'm wrong because that is 
> >> not a welcome change. How is one to order clustered docs by distance from 
> >> cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the 
> >> centroid for the cluster id given in the above WeightedVectorWritable, 
> >> which means iterating through all the clusters for each clustered doc. In 
> >> my case the number of clusters could be fairly large.
> >>
> >> Am I missing something?
> >>
> >>
> >

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679165#comment-13679165
 ] 

Grant Ingersoll commented on MAHOUT-1233:
-

Also, note, we are likely to remove MeanShift

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
> Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679164#comment-13679164
 ] 

Grant Ingersoll commented on MAHOUT-1233:
-

Yannis, the empty clusters sounds bad.  Can you share your vectors via Dropbox 
or something and the commands you ran, assuming they are not NamedVectors? 
(i.e. there is no identifying info)

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
> Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-627:
---

Fix Version/s: (was: 0.8)
   0.9

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.9
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate

[jira] [Assigned] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1211:
---

Assignee: Grant Ingersoll  (was: Dan Filimon)

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679155#comment-13679155
 ] 

Grant Ingersoll commented on MAHOUT-992:


I'm marking this as resolved, but it could use a review of my commits to make 
sure I didn't miss something.

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1211.
-

Resolution: Fixed

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1247.
-

Resolution: Fixed

Fixed by MAHOUT-992

> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-992.


Resolution: Fixed

Went through and audited all uses and fixed handling of cache values using 
makeQualified, etc.

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-975) Bug in Gradient Machine - Computation of the gradient

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679143#comment-13679143
 ] 

Grant Ingersoll commented on MAHOUT-975:


[~tdunning] Any chance this is getting in this week?

> Bug in Gradient Machine  - Computation of the gradient
> --
>
> Key: MAHOUT-975
> URL: https://issues.apache.org/jira/browse/MAHOUT-975
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7
>Reporter: Christian Herta
>Assignee: Ted Dunning
> Fix For: 0.8
>
> Attachments: GradientMachine.patch
>
>
> The initialisation to compute the gradient descent weight updates for the 
> output units should be wrong:
>  
> In the comment: "dy / dw is just w since  y = x' * w + b."
> This is wrong. dy/dw is x (ignoring the indices). The same initialisation is 
> done in the code.
> Check by using neural network terminology:
> The gradient machine is a specialized version of a multi layer perceptron 
> (MLP).
> In a MLP the gradient for computing the "weight change" for the output units 
> is:
> dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j)
> here: i index of the output layer; j index of the hidden layer
> (d stands for the partial derivatives)
> here: z_i = a_i (no squashing in the output layer)
> with the special loss (cost function) is  E = 1 - a_g + a_b = 1 - z_g + z_b
> with
> g index of output unit with target value: +1 (positive class)
> b: random output unit with target value: 0
> =>
> dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden 
> unit
> j)
> dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden 
> unit
> j)
> That's the same if the comment would be correct:
> dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to
> the output unit with target value +1.
> 
> In neural network implementations it's common to compute the gradient
> numerically for a test of the implementation. This can be done by:
> dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon))

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679090#comment-13679090
 ] 

Grant Ingersoll edited comment on MAHOUT-1247 at 6/9/13 3:47 PM:
-

I think I see the issue.  The cache file is "local", the Iterator, however, has 
a Hadoop conf that is expecting an HDFS file, hence it can't find it.

For instance, the logs show:
{quote}11:38:49,638 INFO 
org.apache.mahout.vectorizer.term.TFPartialVectorReducer: Cache Files: 
[/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/2677051046998143225_1262960862_697707077/localhostdicVec/dictionary.file-0]
2013{quote}

Notice it is missing the scheme.  Going to try explicitly setting the scheme to 
file://

  was (Author: gsingers):
I think I see the issue.  The cache file is "local", the Iterator, however, 
has a Hadoop conf that is expecting an HDFS file, hence it can't find it.
  
> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679090#comment-13679090
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

I think I see the issue.  The cache file is "local", the Iterator, however, has 
a Hadoop conf that is expecting an HDFS file, hence it can't find it.

> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679076#comment-13679076
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

After you run cluster-reuters.sh, you can run:
{code}bin/mahout org.apache.mahout.vectorizer.DictionaryVectorizer -i 
/tmp/mahout-work-grantingersoll/reuters-out-seqdir-sparse-kmeans/tokenized-documents
 -o ./dicVec{code}

Make sure you have HADOOP_HOME set and also substitute in the appropriate work 
directory.

> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679074#comment-13679074
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

Here's the first error I'm getting: https://paste.apache.org/cik6
{quote}
java.lang.IllegalStateException: 
/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
at 
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:146)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:9000/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1479)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1474)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.(SequenceFileIterator.java:58)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 9 more
{quote}

Might be related to MAHOUT-992, but not sure.  I added a main to 
DictionaryVectorizer that allows you to reproduce this off of the prior run of 
cluster-reuters without having to go re-run everything.

> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679053#comment-13679053
 ] 

Grant Ingersoll commented on MAHOUT-1211:
-

I committed this, but we can leave open for others to review and tweak, but it 
should be able to be closed before the release.

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1211:


Attachment: MAHOUT-1211.patch

Updated patch to trunk

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch, MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679048#comment-13679048
 ] 

Grant Ingersoll commented on MAHOUT-1211:
-

Patch coming shortly based off of Suneel's original patch.  Would appreciate 
some eyeballs before committing.  I went with Sean's approach for readers and 
writers.  I think Dmitriy has a valid point, but perhaps we take it on a case 
by case base to see if any harm comes out of quietly closing readers.

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1211) Replace deprecated Closables.closeQuietly calls

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1211:
---

Assignee: Grant Ingersoll  (was: Ted Dunning)

> Replace deprecated Closables.closeQuietly calls
> ---
>
> Key: MAHOUT-1211
> URL: https://issues.apache.org/jira/browse/MAHOUT-1211
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Stevo Slavic
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1211.patch
>
>
> Deprecated Guava {{Closables.closeQuietly}} API has to be replaced, it's 
> usage is a code smell, and that method is scheduled to be removed from Guava 
> 16.0.
> See [this 
> discussion|https://code.google.com/p/guava-libraries/issues/detail?id=1118] 
> for more info.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1103.
-

Resolution: Fixed

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1126) Mac builds won't unjar

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1126.
-

Resolution: Fixed

I think the filter I put in place should (hopefully) fix this going forward.

> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>Assignee: Grant Ingersoll
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1247:
---

Assignee: Grant Ingersoll

> cluster-reuters doesn't work on Hadoop
> --
>
> Key: MAHOUT-1247
> URL: https://issues.apache.org/jira/browse/MAHOUT-1247
> Project: Mahout
>  Issue Type: Bug
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
> At least two issues:
> 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
> 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created MAHOUT-1247:
---

 Summary: cluster-reuters doesn't work on Hadoop
 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
 Fix For: 0.8


At least two issues:

1. MAHOUT-992 messed up the Distributed Cache stuff somehow
2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678753#comment-13678753
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

The MapReduce portion of this will never function correctly in Hadoop 
LocalMode.  I've added a printout to the Driver to note this in my next patch.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (MAHOUT-1126) Mac builds won't unjar

2013-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1126 started by Grant Ingersoll.

> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>Assignee: Grant Ingersoll
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1103 started by Grant Ingersoll.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1126) Mac builds won't unjar

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678751#comment-13678751
 ] 

Grant Ingersoll commented on MAHOUT-1126:
-

OK, I see the issue more clearly now that it has officially bit me...

> Mac builds won't unjar
> --
>
> Key: MAHOUT-1126
> URL: https://issues.apache.org/jira/browse/MAHOUT-1126
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8
> Environment: Builds on the Mac
>Reporter: Pat Ferrel
>Assignee: Grant Ingersoll
>  Labels: build
> Fix For: 0.8
>
>
> On the Mac you have to remove the licenses in the mahout jar or hadoop can't 
> unjar mahout. The Mac has a case insensitive file system and so can't tell 
> the difference between LICENSE and license. This was fixed at one point 
> https://issues.apache.org/jira/browse/MAHOUT-780
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/license/
> zip -d mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 
> META-INF/LICENSE/
> Looks like as is mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-780 
> mv target/maven-shared-archive-resources/META-INF/LICENSE 
> target/maven-shared-archive-resources/META-INF/LICENSES
> works too.
> Can this get a permanent fix?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1084.
-

Resolution: Fixed

Thanks liutengfei!

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678733#comment-13678733
 ] 

Grant Ingersoll commented on MAHOUT-1084:
-

liutengfei is right in that it is being confused by the random seed stuff.  
However, rather than changing the code, I think we should just not right the 
random seed to the clusters-0 directory to begin with and instead right them 
somewhere else.

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678728#comment-13678728
 ] 

Grant Ingersoll commented on MAHOUT-1084:
-

I confirm there is something wrong here.  We are passing in a k = 6 and we are 
getting back 12 clusters.

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1084 started by Grant Ingersoll.

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1103:


Attachment: MAHOUT-1103.patch

Matt, can you check this iteration on your patch?  That being said, it doesn't 
work for me running the MR job locally when testing on a small data set.  Would 
be nice to get this self contained somehow in a small unit test.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch, MAHOUT-1103.patch, MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678726#comment-13678726
 ] 

Grant Ingersoll commented on MAHOUT-1233:
-

Yannis, any chance you have a small self contained test?  Or, can you reproduce 
this using any of the examples?  Just trying to make it easier to test this.

> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
>  Issue Type: Question
>  Components: Clustering
>Affects Versions: 0.7, 0.8
>Reporter: yannis ats
>Assignee: yannis ats
>Priority: Minor
> Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1245) Move Website(s) to ASF CMS

2013-06-08 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created MAHOUT-1245:
---

 Summary: Move Website(s) to ASF CMS
 Key: MAHOUT-1245
 URL: https://issues.apache.org/jira/browse/MAHOUT-1245
 Project: Mahout
  Issue Type: Task
Reporter: Grant Ingersoll
 Fix For: 0.9


The ASF CMS makes editing sites a whole lot easier using pub-sub and Markdown.

We should move to it.  We will be much happier.  I'd even propose we move most 
of our wiki to it and let users comment instead of edit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1241) Mailing list archives not available

2013-06-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678725#comment-13678725
 ] 

Grant Ingersoll commented on MAHOUT-1241:
-

I get:
{quote}
patch -p 0 -i ../../patches/MAHOUT-1241.patch --dry-run
patching file .htaccess
patching file developer-resources.html
patching file developer-resources.pdf
patching file index.pdf
patch:  malformed patch at line 1303: Index: linkmap.html
{quote}

Dang, are we still using Forrest?  We should switch to the new ASF CMS.  I 
opened an issue for it.

> Mailing list archives not available
> ---
>
> Key: MAHOUT-1241
> URL: https://issues.apache.org/jira/browse/MAHOUT-1241
> Project: Mahout
>  Issue Type: Bug
>  Components: Website
>Reporter: Zeno Gantner
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1241.patch
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/
> http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/
> give me a 404 error.
> These are the mailing lists archives are linked from here:
> http://mahout.apache.org/mailinglists.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-958) NullPointerException in RepresentativePointsMapper when running cluster-reuters.sh example with kmeans

2013-06-07 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-958:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I couldn't reproduce this, but I suspect it is due to CDH, where I have seen 
similar behaviors before.  At any rate, the code looked good, so I went ahead 
and committed it.

> NullPointerException in RepresentativePointsMapper when running 
> cluster-reuters.sh example with kmeans
> --
>
> Key: MAHOUT-958
> URL: https://issues.apache.org/jira/browse/MAHOUT-958
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.6
> Environment: {code}
> > uname -a
> Linux 3.2.1-3.fc16.x86_64 #1 SMP Mon Jan 23 15:36:17 UTC 2012 x86_64 x86_64 
> x86_64 GNU/Linux
> {code}
> {code}
> > java -version
> java version "1.7.0_02"
> Java(TM) SE Runtime Environment (build 1.7.0_02-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 22.0-b10, mixed mode)
> {code}
> Hadoop Version: 0.20.203.0, r1099333
>Reporter: Rares Vernica
>Assignee: Dan Filimon
> Fix For: 0.8
>
> Attachments: MAHOUT-958.patch
>
>
> {code}
> > svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/mahout/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1235544
> Node Kind: directory
> Schedule: normal
> Last Changed Author: tdunning
> Last Changed Rev: 1231800
> Last Changed Date: 2012-01-15 16:01:38 -0800 (Sun, 15 Jan 2012)
> {code}
> {code}
> > ./examples/bin/cluster-reuters.sh
> ...
> 1. kmeans clustering
> ...
> Inter-Cluster Density: NaN
> Intra-Cluster Density: 0.0
> CDbw Inter-Cluster Density: 0.0
> CDbw Intra-Cluster Density: NaN
> CDbw Separation: 0.0
> 12/01/24 16:08:47 INFO clustering.ClusterDumper: Wrote 20 clusters
> 12/01/24 16:08:47 INFO driver.MahoutDriver: Program took 126749 ms (Minutes: 
> 2.11248335)
> {code}
> All five "{{Representative Points Driver}}" jobs fail.
> {code}
> 2012-01-24 16:07:11,555 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded 
> the native-hadoop library
> 2012-01-24 16:07:11,881 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
> 79691776/99614720
> 2012-01-24 16:07:11,896 INFO org.apache.hadoop.mapred.MapTask: record buffer 
> = 262144/327680
> 2012-01-24 16:07:11,956 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2012-01-24 16:07:11,979 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName vernica for UID 1000 from the native implementation
> 2012-01-24 16:07:11,981 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.lang.NullPointerException
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.mapPoint(RepresentativePointsMapper.java:73)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:60)
>   at 
> org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.map(RepresentativePointsMapper.java:40)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>   at org.apache.hadoop.mapred.Child.main(Child.java:253)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678293#comment-13678293
 ] 

Grant Ingersoll commented on MAHOUT-944:


Saw that.  Fixing.  Not a show stopper, but needs to be fixed.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1084:
---

Assignee: Grant Ingersoll  (was: Robin Anil)

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -
>
> Key: MAHOUT-1084
> URL: https://issues.apache.org/jira/browse/MAHOUT-1084
> Project: Mahout
>  Issue Type: Bug
>Reporter: liutengfei
>Assignee: Grant Ingersoll
> Fix For: 0.8
>
>
>In Mahout-Kmeans for syntheticcontrol example, using the default 
> parameters means to compute 6 clusters at last. But why there are 12 clusters 
> during Kmeans iterations. According to my observation, the former 6 clusters 
> and the latter 6 clusters are the same before the first iteration,those 6 
> clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will 
> assign its own points to this 12 clusters. Is here existing logical errors?
>The 12 clusters are created by the function "setup" in CIMapper.java, 
> more specifically, is the line "classifier.readFromSeqFiles(conf, new 
> Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction 
> "output/clusters-0/", there are 8 files in this direction: 
> "_policy","part-randomSeed"(one file record six cluster),"part-0" to 
> "part-5"(total six files,every one record a cluster), while reading this 
> direction, "_policy" will be filtered out, so program will read "part-0" 
> to "part-5" to create six clusters, then read "part-randomSeed" to create 
> the other six clusters, this is the reason why there will be 12 clusters 
> before first iteration.
>   Solution: delete associated code to avoid duplicately creating clusters 
> in "output/clusters-0/", here i delete codes where create files: "part-0" 
> to "part-5" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
> writePolicy(policy, path);
> /*
> Configuration config = new Configuration();
> FileSystem fs = FileSystem.get(path.toUri(), config);
> SequenceFile.Writer writer = null;
> ClusterWritable cw = new ClusterWritable();
> for (int i = 0; i < models.size(); i++) {
>   try {
> Cluster cluster = models.get(i);
> cw.setValue(cluster);
> writer = new SequenceFile.Writer(fs, config,
> new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", 
> i)), IntWritable.class,
> ClusterWritable.class);
> Writable key = new IntWritable(i);
> writer.append(key, cw);
>   } finally {
> Closeables.closeQuietly(writer);
>   }
> }
> */
>   }
> I don't know if it is still okay for other progams who using this file, 
> but for KMeans in Syntheticcontrol example, program will create 6 clusters 
> during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1244) Upgrade Mahout to Lucene 4.3

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-1244.
-

Resolution: Fixed

> Upgrade Mahout to Lucene 4.3
> 
>
> Key: MAHOUT-1244
> URL: https://issues.apache.org/jira/browse/MAHOUT-1244
> Project: Mahout
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Minor
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677788#comment-13677788
 ] 

Grant Ingersoll commented on MAHOUT-944:


Added LuceneSeqFileHelper.  Need to switch back to a pure SVN workflow, I 
guess, as I seem to be getting the git one wrong.

As for the Version thing, I will try to get to it today.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677242#comment-13677242
 ] 

Grant Ingersoll commented on MAHOUT-944:


That patch should apply from trunk, but I'm curious now to know what happened, 
so I want to give it a bit.


> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-944:
---

Attachment: MAHOUT-944-minor.patch

Here's the diff to trunk at the moment compared with what I have committed on 
my local branch.  Either dcommit hasn't finished applying all the commits or it 
broke.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944-minor.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677237#comment-13677237
 ] 

Grant Ingersoll commented on MAHOUT-944:


Hmm, I wonder if I should have squashed my local commits:
{quote}
Committed r1490329
W: 0a28b0f322ffe888553b9e2adf0b6f098b679f16 and refs/remotes/origin/trunk 
differ, using rebase:
:04 04 779e2a48da78d2f59f994c83eb1cb91a42b04d41 
6e8221954eecd7ee27788976dc7b2665985cd7e6 M  integration
:100644 100644 492aa3aacbee4e33fb70a2e361d772a9d881ae04 
09c5ae712a035af3eef2c3c56db708b8fa75e1b3 M  pom.xml
:04 04 39350289431946a74a7bd15fbf72947261055536 
c7274b40f5de032b1668ed9d6f2d1fa24ff0a124 M  src
Current branch MAHOUT-944 is up to date.
# of revisions changed  
before:
 d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406
9eafd07120a1810d778dfeb4502ba36b5b3eacfe
253a58c30d0a22150234975f782720248b51a8cb 

after:
 0a28b0f322ffe888553b9e2adf0b6f098b679f16
d668ddf606dbb0d046f0fe8e3eb97e06fcd4c406
9eafd07120a1810d778dfeb4502ba36b5b3eacfe
253a58c30d0a22150234975f782720248b51a8cb 
 If you are attempting to commit  merges, try running:
 git rebase --interactive --preserve-merges  refs/remotes/origin/trunk 
Before dcommitting
{quote}

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677223#comment-13677223
 ] 

Grant Ingersoll commented on MAHOUT-944:


uh oh.  Should have been 4.3.  Must have messed up Git.  WTF.  The whole thing 
is messed up.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677216#comment-13677216
 ] 

Grant Ingersoll commented on MAHOUT-992:


Just venting, but could the DistributedCache methods be any more poorly named?  
For getting, "getLocalCacheFiles" is the "user" API and "getCacheFiles" is the 
internal API.  For setting, there is "setLocalFiles" which is internal and 
"setCacheFiles" which is user facing.

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677185#comment-13677185
 ] 

Grant Ingersoll edited comment on MAHOUT-992 at 6/6/13 4:17 PM:


[~ssc] or [~robin.a...@gmail.com] 

I see this in several places:
{code}
Path[] files = DistributedCache.getLocalCacheFiles(conf);
if (files == null) {
  throw new IOException("Cannot read Frequency list from Distributed 
Cache");
}
if (files.length != 1) {
  throw new IOException("Cannot read Frequency list from Distributed Cache 
(" + files.length + ')');
}
FileSystem fs = FileSystem.getLocal(conf);
Path fListLocalPath = fs.makeQualified(files[0]);
// Fallback if we are running locally.
if (!fs.exists(fListLocalPath)) {
  URI[] filesURIs = DistributedCache.getCacheFiles(conf);
  if (filesURIs == null) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache");
  }
  if (filesURIs.length != 1) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache (" + files.length + ')');
  }
  fListLocalPath = new Path(filesURIs[0].getPath());
}
{code}

I don't really follow the "Fallback if running locally" comment.  The first 
part of the code is looking in the local file system.  Doesn't (or shouldn't?) 
Hadoop handle this seamlessly?  Seems odd to me that we would get something 
that is in the local file system and then it not be there.


  was (Author: gsingers):
[~ssc] or [~robin.a...@gmail.com] 

I see this in several places:
{code}
Path[] files = DistributedCache.getLocalCacheFiles(conf);
if (files == null) {
  throw new IOException("Cannot read Frequency list from Distributed 
Cache");
}
if (files.length != 1) {
  throw new IOException("Cannot read Frequency list from Distributed Cache 
(" + files.length + ')');
}
FileSystem fs = FileSystem.getLocal(conf);
Path fListLocalPath = fs.makeQualified(files[0]);
// Fallback if we are running locally.
if (!fs.exists(fListLocalPath)) {
  URI[] filesURIs = DistributedCache.getCacheFiles(conf);
  if (filesURIs == null) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache");
  }
  if (filesURIs.length != 1) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache (" + files.length + ')');
  }
  fListLocalPath = new Path(filesURIs[0].getPath());
}
{code}

I don't really follow the "Fallback if running locally" comment.  The first 
part of the code is looking in the local file system.  Doesn't (or shouldn't?) 
Hadoop handle this seamlessly?

  
> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677185#comment-13677185
 ] 

Grant Ingersoll commented on MAHOUT-992:


[~ssc] or [~robin.a...@gmail.com] 

I see this in several places:
{code}
Path[] files = DistributedCache.getLocalCacheFiles(conf);
if (files == null) {
  throw new IOException("Cannot read Frequency list from Distributed 
Cache");
}
if (files.length != 1) {
  throw new IOException("Cannot read Frequency list from Distributed Cache 
(" + files.length + ')');
}
FileSystem fs = FileSystem.getLocal(conf);
Path fListLocalPath = fs.makeQualified(files[0]);
// Fallback if we are running locally.
if (!fs.exists(fListLocalPath)) {
  URI[] filesURIs = DistributedCache.getCacheFiles(conf);
  if (filesURIs == null) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache");
  }
  if (filesURIs.length != 1) {
throw new IOException("Cannot read Frequency list from Distributed 
Cache (" + files.length + ')');
  }
  fListLocalPath = new Path(filesURIs[0].getPath());
}
{code}

I don't really follow the "Fallback if running locally" comment.  The first 
part of the code is looking in the local file system.  Doesn't (or shouldn't?) 
Hadoop handle this seamlessly?


> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-992) Audit DistributedCache use to support EMR

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-992:
--

Assignee: Grant Ingersoll  (was: Matteo Riondato)

> Audit DistributedCache use to support EMR
> -
>
> Key: MAHOUT-992
> URL: https://issues.apache.org/jira/browse/MAHOUT-992
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.6
>Reporter: tom pierce
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: newbie
> Fix For: 0.8
>
>
> Apparently some of our DistributedCache use is not EMR-safe.  It would be 
> great if someone could audit our uses of DC, and fix up this problem where it 
> exists.
> For an example of problematic usage (and the fix), see MAHOUT-980.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677163#comment-13677163
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

[~mmolek] Any luck on the patch?  I'd like to close this out before 0.8 if 
possible.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-944:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Went ahead and committed, as I believe it is functional.  Extra eyeballs to 
review would be good.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-944:
---

Comment: was deleted

(was: I'll let it sit for a day or two and then commit.)

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-944:
---

Attachment: MAHOUT-944.patch

I think this is ready to go.  Some other eyeballs would be appreciated.

Changes from last patch:
# changelog addition
# Cleaned up and standardized a lot of the tests
# Added tests for multiple commit points and multiple directories
# Cleaned up and simplified a number of areas
# Added license headers where missing
# The sequential and M/R version are now consistent in their handling of empty 
id fields and values
# Added some counters to the M/R job


> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-06 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677058#comment-13677058
 ] 

Grant Ingersoll commented on MAHOUT-944:


I'll let it sit for a day or two and then commit.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-06 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-961:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thank you, Ikumasa!

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
> MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

2013-06-05 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-944:
---

Attachment: MAHOUT-944.patch

Reworked some of the collector stuff for the sequential case.  Tests pass, but 
haven't reviewed the thoroughness of the tests yet.  Still needs another run 
through and review of the M/R code, as I haven't looked at that in depth yet.

All that being said, this is getting really close.

> LuceneIndexToSequenceFiles (lucene2seq) utility
> ---
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
>  Issue Type: New Feature
>  Components: Integration
>Affects Versions: 0.5
>Reporter: Frank Scholten
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
> MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files 
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you 
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this 
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675924#comment-13675924
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

bq. But to generate the patch, I need to based on svn trunk version.

If you have a patch for 0.7, post it, perhaps we can help.  A half done patch 
is a better start than no patch.

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
> Fix For: Backlog
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1214) Improve the accuracy of the Spectral KMeans Method

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675783#comment-13675783
 ] 

Grant Ingersoll commented on MAHOUT-1214:
-

Should this be in 0.8?

> Improve the accuracy of the Spectral KMeans Method
> --
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.7
> Environment: Mahout 0.7
>Reporter: Yiqun Hu
>  Labels: clustering, improvement
> Fix For: Backlog
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675771#comment-13675771
 ] 

Grant Ingersoll commented on MAHOUT-961:


Ikumasa, thank you!  Applying now.

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
> MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-916) Make Mahout's tests run in parallel

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675746#comment-13675746
 ] 

Grant Ingersoll commented on MAHOUT-916:


Sounds right to me.

> Make Mahout's tests run in parallel
> ---
>
> Key: MAHOUT-916
> URL: https://issues.apache.org/jira/browse/MAHOUT-916
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Reporter: Grant Ingersoll
>Assignee: Isabel Drost-Fromm
>Priority: Minor
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-916.patch, MAHOUT-916.patch, MAHOUT-916.patch, 
> MAHOUT-916.patch
>
>
> Maven now supports parallel execution of tests.  We should hook this in to 
> Mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-916) Make Mahout's tests run in parallel

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675672#comment-13675672
 ] 

Grant Ingersoll commented on MAHOUT-916:


I don't have an SSD.  Let me re-run things and see how it goes.  Maybe I'm just 
used to Lucene's tests, which usually do peg my CPU (thanks, Dawid, for the 
insight on tuning them!)

> Make Mahout's tests run in parallel
> ---
>
> Key: MAHOUT-916
> URL: https://issues.apache.org/jira/browse/MAHOUT-916
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Reporter: Grant Ingersoll
>Assignee: Isabel Drost-Fromm
>Priority: Minor
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-916.patch, MAHOUT-916.patch, MAHOUT-916.patch, 
> MAHOUT-916.patch
>
>
> Maven now supports parallel execution of tests.  We should hook this in to 
> Mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-916) Make Mahout's tests run in parallel

2013-06-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675658#comment-13675658
 ] 

Grant Ingersoll commented on MAHOUT-916:


They are a lot faster for me and my computer is still usable. I'm on: MBP, 2.2, 
i7 w/ 16 GB of RAM.  So, slower machine, more memory, relatively speaking.

> Make Mahout's tests run in parallel
> ---
>
> Key: MAHOUT-916
> URL: https://issues.apache.org/jira/browse/MAHOUT-916
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Reporter: Grant Ingersoll
>Assignee: Isabel Drost-Fromm
>Priority: Minor
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-916.patch, MAHOUT-916.patch, MAHOUT-916.patch, 
> MAHOUT-916.patch
>
>
> Maven now supports parallel execution of tests.  We should hook this in to 
> Mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-916) Make Mahout's tests run in parallel

2013-06-04 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675640#comment-13675640
 ] 

Grant Ingersoll commented on MAHOUT-916:


Can we parameterize it?

> Make Mahout's tests run in parallel
> ---
>
> Key: MAHOUT-916
> URL: https://issues.apache.org/jira/browse/MAHOUT-916
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Reporter: Grant Ingersoll
>Assignee: Isabel Drost-Fromm
>Priority: Minor
>  Labels: MAHOUT_INTRO_CONTRIBUTE
> Fix For: 0.8
>
> Attachments: MAHOUT-916.patch, MAHOUT-916.patch, MAHOUT-916.patch, 
> MAHOUT-916.patch
>
>
> Maven now supports parallel execution of tests.  We should hook this in to 
> Mahout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-04 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-961:
---

Priority: Minor  (was: Major)

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Grant Ingersoll
>Priority: Minor
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
> MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-04 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-961:
--

Assignee: Grant Ingersoll  (was: Sebastian Schelter)

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Grant Ingersoll
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
> MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-961) Modify the Tree/Forest Visualizer on DF.

2013-06-04 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674222#comment-13674222
 ] 

Grant Ingersoll commented on MAHOUT-961:


I've started on updating this

> Modify the Tree/Forest Visualizer on DF.
> 
>
> Key: MAHOUT-961
> URL: https://issues.apache.org/jira/browse/MAHOUT-961
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ikumasa Mukai
>Assignee: Grant Ingersoll
>  Labels: RandomForest
> Fix For: 0.8
>
> Attachments: MAHOUT-961.patch, MAHOUT-961.patch, MAHOUT-961.patch, 
> MAHOUT-961.patch
>
>
> The Tree/Forest visualizer (MAHOUT-926) has problems.
> 1) a un-complemented stem which has no leaf or node is shown.
> 2) all stems are not shown when the data doesn't have all categories.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673689#comment-13673689
 ] 

Grant Ingersoll commented on MAHOUT-627:


Hi Dhruv,

Thanks for the response.  We are trying to get 0.8 in the next week or two.  
Any help on a short example as well as updating the code to trunk would be 
awesome.

Thanks,
Grant

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports se

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673143#comment-13673143
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

bq. I can submit a patch in the next couple of days once I find a little free 
time

Awesome.  Please do.  I've got momentum on this issue and towards 0.8, so if 
you can soon, that would be great.  Don't worry about codestyle, I can take 
care of that and any other pieces. If you have the gist of it working, then 
you'll save me a bit of time.

bq. different earlier patch

Ah, my confusion...  Sorry about that.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673071#comment-13673071
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

Matt, out of curiosity, what's your use case for the clusterpp?  [~robinanil] 
and I are both looking at this code and wondering why it is useful to separate 
out the clusters into their own directory.  MAHOUT-843 doesn't shed any light 
on it for us either.

Also, I don't think the current patch partitions correctly.  For instance, try 
a numPartitions of 2 and cluster ids of 153 and 53.  Then, 10^1 means you get 
153 % 10 and 53 % 10 both = 3 and you have a collision.  So, I think I'm back 
to my original thought, which is in the mappers and reducers, we need to load 
up the cluster ids and just map it there.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672981#comment-13672981
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

OK, I read up on partitioners and I'd agree, Matt, this is effectively hadoop's 
way of doing what I proposed and doesn't pollute the M/R code, so I'm going to 
go forward w/ your patch.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672836#comment-13672836
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

bq. Well yes, it is a bug. I've reproduced it on a real cluster (that's what 
lead me to origianlly post this jira)

:-)  Yeah, just confirming it.  We get a lot of non-bugs reported.  I wonder if 
we used to just sequentially dole out cluster ids and that changed w/ the 
clustering refactoring.

{quote}That would only happen in the situation where the clusters are numbered 
1 to k or some other convenient numbering. That is rarely, if ever, the case.
The only way I could think to get this working is to temporarily remap the 
cluster ids to a more convenient numbering that would play well with the hash 
partitioner{quote}

I don't know a lot about partitioners just yet and that makes me think they 
might be heavy handed here, but it occurs to me that we can take advantage of 
that the number of clusters is small and during setup simply load up the 
cluster id map and create the "convenient numbering" for writing during the 
reduce phase to 0 - n-1 (where n is the number of clusters).

Then, in the {code}movePartFilesToRespectiveDirectories{code} we should get 
renamed appropriately.

Would that work?


> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672718#comment-13672718
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

It has an assumption in the code that each cluster id ends up in a different 
part file by the fact the number of reducers is set to the number of clusters 
which is supposed to mean that there should be one output part file per reducer 
(i.e. per cluster id), but that isn't happening, at least in the simple testing 
I'm doing using pseudo M/R mode using data generated from.  Can someone test 
this on a real Hadoop cluster, as I don't have access to one right at the 
moment?  At least in the non-cluster env, the work around is to run in 
sequential mode.


{quote}
bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 
-cd 5 -t1 50 -t2 10 -dm 
org.apache.mahout.common.distance.EuclideanDistanceMeasure -i 
/path/content/synthetic_control.data  -ow -o output -cl
{quote}
and
{quote}
... 
org.apache.mahout.clustering.topdown.postprocessor.ClusterOutputPostProcessorDriver
 -i output -o output/postMR
{quote}

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672669#comment-13672669
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

Hmm, so this works in Sequential mode, but not in MapReduce mode

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672658#comment-13672658
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

This seems like a flat out bug in the ClusterPP, since it says it is supposed 
to write separate directories, so it doesn't seem to me like we need to add new 
classes here, but instead should fix the bug.  Still looking.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672657#comment-13672657
 ] 

Grant Ingersoll commented on MAHOUT-1103:
-

I can reproduce this.

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAHOUT-1103) clusterpp is not writing directories for all clusters

2013-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned MAHOUT-1103:
---

Assignee: Grant Ingersoll  (was: Paritosh Ranjan)

> clusterpp is not writing directories for all clusters
> -
>
> Key: MAHOUT-1103
> URL: https://issues.apache.org/jira/browse/MAHOUT-1103
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Matt Molek
>Assignee: Grant Ingersoll
>  Labels: clusterpp
> Fix For: 0.8
>
> Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to 
> populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that 
> fails to produce directories there is an empty part-r-* file in the output 
> directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 
> 2clusters/pca-clusters -dm 
> org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 
> -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 
> 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 
> containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by 
> the default hadoop hash partitioner. The hashes of these two clusters aren't 
> identical, but they are close. Putting both cluster names into a Text and 
> caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

2013-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved MAHOUT-966.


Resolution: Not A Problem

This is actually behaving correctly.  Here's what I did:

# bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 
-cd 5 -t1 50 -t2 10 -dm 
org.apache.mahout.common.distance.EuclideanDistanceMeasure -i 
/path/to/synthetic_control.data  -ow -o output -cl
# Independently, do:
## bin/mahout clusterdump -i output/clusters-5-final/ -p output/clusteredPoints 
-o /tmp/clusterdump.txt
## For clusterPP
### bin/mahout clusterpp -i output -o output/post
### bin/mahout seqdumper -i output/post/0/part-r-0 --facets

Both report 5 clusters total.  
For clusterpp, Seq Dumper reports the following number of points per cluster:
{quote}
-Facets---
Key Count
0   145
101 31
104 25
200 199
300 200
{quote}

For clusterdumper, I see:
{quote}
MSV-0{n=145 
MSV-101{n=31
MSV-104{n=25
MSV-200{n=199
MSV-300{n=200
{quote}

> Mismatch in the number of points given by the clusterDumper and 
> ClusterOutputPostProcessor
> --
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6 
>Reporter: Gaurav Redkar
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
> mtestdata.txt, points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster 
> contains is not matching the number of points each cluster should contain as 
> stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from 
> the number of points actually contained in d directory for each cluster. Any 
> idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (MAHOUT-966) Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

2013-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-966 started by Grant Ingersoll.

> Mismatch in the number of points given by the clusterDumper and 
> ClusterOutputPostProcessor
> --
>
> Key: MAHOUT-966
> URL: https://issues.apache.org/jira/browse/MAHOUT-966
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.6
> Environment: hadoop 0.20.2 mahout 0.6 
>Reporter: Gaurav Redkar
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: cluster-dumper-output.txt, clusterpp-output.txt, 
> mtestdata.txt, points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster 
> contains is not matching the number of points each cluster should contain as 
> stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from 
> the number of points actually contained in d directory for each cluster. Any 
> idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true

2013-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-1108:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed.  Reopen if it doesn't work on Hadoop.  KDD at UCI is down at the 
moment.

> cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
> ---
>
> Key: MAHOUT-1108
> URL: https://issues.apache.org/jira/browse/MAHOUT-1108
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Elmer Garduno
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1108.patch, MAHOUT-1108.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Got the following exception when running the command with HADOOP_CONF and  
> HADOOP_CONF_DIR
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/util/ProgramDriver
>   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.util.ProgramDriver
>   at java.net.URLClassLoader$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   ... 1 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1108) cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672605#comment-13672605
 ] 

Grant Ingersoll commented on MAHOUT-1108:
-

I'm going to commit.  [~ssc], can you test on Hadoop when you get a chance 
before the release?

> cluster-reuters.sh executes seqdirectory with MAHOUT_LOCAL=true
> ---
>
> Key: MAHOUT-1108
> URL: https://issues.apache.org/jira/browse/MAHOUT-1108
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.7
>Reporter: Elmer Garduno
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.8
>
> Attachments: MAHOUT-1108.patch, MAHOUT-1108.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Got the following exception when running the command with HADOOP_CONF and  
> HADOOP_CONF_DIR
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/util/ProgramDriver
>   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.util.ProgramDriver
>   at java.net.URLClassLoader$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>   at java.lang.ClassLoader.loadClass(Unknown Source)
>   ... 1 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   4   >