Re: [jira] Commented: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-04 Thread Jeff Eastman
+1. I read over most of the patch and see three distinct patterns: - replacing ad-hoc string arguments that name file paths with Path objects - replacing ad-hoc temp file allocation, deallocation with a uniform mechanism - whitespace formatting differences between your and my formatters Kudos

Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
You could try using more, smaller input splits, but large datasets and too-small distance thresholds will choke up the mappers with number of canopies approaching the number of points seen by the mapper. Also the single reducer will choke unless the thresholds allow condensing the mapper

Wiki Access

2010-05-02 Thread Jeff Eastman
I can't seem to log into the wiki any more and two password reset attempts have failed to produce the promised password email (I checked my spam filter too). Does anybody have enough karma to help me out? Jeff

Re: Wiki Access

2010-05-02 Thread Jeff Eastman
I saw that email too, but confluence appears to be working. I've sent a request to infrastructure... On 5/2/10 9:12 AM, Robin Anil wrote: I believe they are upgrading confluence. I got an email about it yesterday On Sun, May 2, 2010 at 9:40 PM, Jeff Eastmanjeast...@windwardsolutions.com

Re: Quickstart for kMeans

2010-05-02 Thread Jeff Eastman
Indeed, the wiki is pretty out of date in some areas and the actual apis have changed (since 2008!). For users wishing to launch clustering jobs using trunk I suggest checking out utils TestCDbwEvaluator and TestClusterDumper which employ the latest versions. These do not use the command-line

Re: Canopy Clustering not scaling

2010-05-02 Thread Jeff Eastman
These sorts of optimizations could delay the growth of canopy clusters in situations where the clustering thresholds are set too low for the dataset. At some point the mapper would still OME with enough points if all become clusters. That decision rests with the T2 threshold which determines

Intermittant Test Failure: testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)

2010-04-29 Thread Jeff Eastman
The surfire report seems to indicate this might be a timing problem with hdfs being lazy. Sometimes it passes and sometimes it fails, but of course, right at the end of the 15 min core tests which makes it especially annoying. Any resolution possible?

Similarity Tests Failing since 939074?

2010-04-28 Thread Jeff Eastman
Failed tests: testSimple(org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarityTest) testSimpleItem(org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarityTest) testNoCorrelation1(org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarityTest)

Re: NamedVector Run Amok?

2010-04-27 Thread Jeff Eastman
27, 2010 at 10:54 PM, Jeff Eastman j...@windwardsolutions.com wrote: Hi Sean, I was under the impression that the recently refactored NamedVectors would be just another kind of Vector and that they would not need to show up in method signatures unless there really was a requirement

[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-26 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860981#action_12860981 ] Jeff Eastman commented on MAHOUT-236: - Ok, the above patch was committed on the 21st

[jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-26 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861194#action_12861194 ] Jeff Eastman commented on MAHOUT-297: - I don't understand why the constructors

[jira] Issue Comment Edited: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-26 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861194#action_12861194 ] Jeff Eastman edited comment on MAHOUT-297 at 4/26/10 9:15 PM

Re: Mahout In Action

2010-04-23 Thread Jeff Eastman
version somewhere that I could get working again on trunk? On 4/23/10 9:10 AM, Sean Owen wrote: Good eye, this was fixed in the manuscript a while ago. I will ping Manning to re-publish Chapters 1-6 since a lot of small updates have happened since then. On Fri, Apr 23, 2010 at 4:53 PM, Jeff

Re: TLP Status

2010-04-21 Thread Jeff Eastman
Yeay team! On 4/21/10 1:09 PM, Grant Ingersoll wrote: The Board has approved Mahout, Tika, and Nutch moving to be top level status. Congrats! Now begins the fun part of changing mailing lists, domains, etc. -Grant

Re: [Idea] Support Facebook Opengraph JSON format as an input

2010-04-21 Thread Jeff Eastman
Mahout Vectors and Clusters currently support JSON encodings for input and output. What else is needed? Jeff On 4/21/10 4:18 PM, Robin Anil wrote: The details are not clear at the moment. But, I am sure this will help adoption of the mahout quickly. Things to do. Parse JSON and make the

[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859027#action_12859027 ] Jeff Eastman commented on MAHOUT-236: - I'm running into a challenge integrating Fuzzy

[jira] Updated: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-236: Attachment: MAHOUT-236.patch Added a mean shift clustering job and now it works for CDbw too

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote: In code we already have? -jake On Apr 18, 2010 9:53 AM, Jeff Eastmanj...@windwardsolutions.com wrote: I can think of situations where I need to use a clusterId as the

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Also mean shift clustering relies on vector identities and tbd emitting its clustered points for CDbw would need to retain them. On 4/18/10 10:22 AM, Jeff Eastman wrote: CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Sure, maybe just initialize names to instead of null? private String name = ; On 4/18/10 10:45 AM, Jake Mannix wrote: Ok this is a good concrete example, I like concrete. :) I'm still very wary of having to have some mapper or reducer classes deal with LabeledVector some deal with just

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
+1 NamedVector seems a lot like VectorView. I'm comfortable enough with this proposal for Sean to go forward with it grin. I agree with separating the naming/identifying/labeling into a separate wrapper class so that vectors themselves can be pure mathematical entities. Unifying as many as

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jeff Eastman
Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this code will have to change in the patch but I haven't yet looked: String name = point.getName(); String key = (name != null) (name.length() != 0) ? name : point.asFormatString(); output.collect(new Text(key),

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jeff Eastman
Are you thinking of replacing our Writable or Json (asFormatString) encodings? Certainly, using Avro as an I/O format for clustering would improve their utility for other languages. Seems like a major rewrite to replace Writable within our MR jobs. On 4/17/10 9:10 AM, Ted Dunning wrote: IF

Re: mahout/solr integration

2010-04-16 Thread Jeff Eastman
On 4/16/10 10:05 AM, Sean Owen wrote: Clojure isn't my cup of tea but that's not important. It's an interesting question, how much belongs under the Mahout tent? There's a tradeoff between excluding useful extensions to the project on the one hand, and becoming a spare parts bin of code of

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jeff Eastman
Ted Dunning wrote: On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen sro...@gmail.com wrote: I would actually prefer ripping names out of the base vectors entirely. They should decorate the mathematical vector, but as their use is decidedly non-mathematical and application specific

Re: VOTE: take 2: mahout-collections-1.0

2010-04-13 Thread Jeff Eastman
Benson Margulies wrote: https://repository.apache.org/content/repositories/orgapachemahout-015/ contains (this time for sure) all the artifacts for release 1.0 of the mahout-collections component. This is the first independent release of collections from the rest of mahout; it differs from the

Re: VOTE: release mahout-collections-codegen 1.0

2010-04-07 Thread Jeff Eastman
Benson Margulies wrote: In order to decouple the mahout-collections library from the rest of Mahout, to allow more frequent releases and other good things, we propose to release the code generator for the collections library as a separate Maven artifact. (Followed, in short order, by the

[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-04-07 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854742#action_12854742 ] Jeff Eastman commented on MAHOUT-270: - r931372 renames Printable to Cluster and adds

[jira] Commented: (MAHOUT-339) Class Cast Exception Running Synthetic Control MeanShift Clustering Job

2010-04-02 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852948#action_12852948 ] Jeff Eastman commented on MAHOUT-339: - Problem with example was introduced by a recent

Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-17 Thread Jeff Eastman
of the Apache Mahout Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Mahout Project: • Isabel Drost (isa...@...) • Ted Dunning (tdunn...@...) • Jeff Eastman (jeast

Re: Significance of name in AbstractVector

2010-03-17 Thread Jeff Eastman
Jake Mannix wrote: On Wed, Mar 17, 2010 at 6:14 AM, Jeff Eastman j...@windwardsolutions.comwrote: Pallavi Palleti wrote: Hi, Could some one kindly let me know the significance of instance variable name in AbstractVector? It is causing problems, when I write a vector to file and read

Working With Maven in Eclipse

2010-03-17 Thread Jeff Eastman
When I run mvn eclipse:eclipse it generates .classpath and .project files in each of the mahout module directories. The last time I did this I manually merged the .classpath library declarations from each module into the main project's .classpath and that made Eclipse happy. This was really

Re: Working With Maven in Eclipse

2010-03-17 Thread Jeff Eastman
Drew Farris wrote: On Wed, Mar 17, 2010 at 3:07 PM, Jeff Eastman j...@windwardsolutions.com wrote: Are any of you using this IDE in a more automatic way? I use eclipse Galileo and m2eclipse 0.9 and the 'import maven projects' feature. I check out the mahout sources into my workspace

[jira] Created: (MAHOUT-339) Class Cast Exception Running Synthetic Control MeanShift Clustering Job

2010-03-17 Thread Jeff Eastman (JIRA)
: Bug Components: Clustering Affects Versions: 0.3 Reporter: Jeff Eastman Priority: Critical Fix For: 0.4 Mar 17, 2010 2:15:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job run WARNING: job_local_0002 java.lang.ClassCastException

Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-17 Thread Jeff Eastman
I'm going to be on holiday in Mexico with unknown Internet connectivity for the next week. Please record my +1 vote on this resolution. Jeff Grant Ingersoll wrote: On Mar 17, 2010, at 9:51 AM, Jeff Eastman wrote: Hi Grant, This version still has the old copy/paste problem

Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Jeff Eastman
Grant Ingersoll wrote: It usually takes 24 hours. Just follow the release dirs and we'll be good. Tomorrow is a great day for a Mahout announcement! Maybe we can change the logo to be green for tomorrow. I've got corned beef n' cabbage on the boil so a bit o' green works for me. Maybe

Re: [DISCUSS] Mahout TLP Board Resolution

2010-03-15 Thread Jeff Eastman
committers: • Isabel Drost (isa...@...) • Ted Dunning (tdunn...@...) • Jeff Eastman (jeast...@...) • Drew Farris (d...@...) • Otis Gospodnetic (o...@...) • Grant Ingersoll (gsing...@...) • Sean Owen (sro...@...) • Karl Wettin (ka

Re: A mahout logo Revamp

2010-03-13 Thread Jeff Eastman
Robin Anil wrote: This one is with a blue elephant :P https://issues.apache.org/jira/secure/attachment/12438704/mahout-blueE-200.png +1 to this one. I like the yellow in the mahout(s) as it stands out more

Re: A mahout logo Revamp

2010-03-13 Thread Jeff Eastman
I can't see any difference between #3 and #4 but I do like the mahout with hair and arms. The avatar blue person color is growing on me too, especially when I put my 3d glasses on grin. The brownish elephant is nice and so is the yellow one. Have you tried a gray elephant? Sorry I'm not

Re: getLengthSquared() method in AbstractVector

2010-03-12 Thread Jeff Eastman
I think if you mark the instVar as transient then Json won't include its state in the JsonString. Sean Owen wrote: It seems like the length squared should not be part of the string representation. Does anyone know how to control this Gson output formatter to ignore this field? The

Build OME?

2010-03-11 Thread Jeff Eastman
I'm getting a consistent compiler heap overflow during mvn clean install on one of two machines with the last commit. Ironically, my MacBook Pro compiles and my Mac Pro does not. Both compiled before the commit.

Re: Build OME?

2010-03-11 Thread Jeff Eastman
Benson Margulies wrote: Did you get the number of that commit? The very last commit was my release arranging, and it's pretty hard to see how it could have that effect. On Thu, Mar 11, 2010 at 7:54 PM, Jeff Eastman j...@windwardsolutions.comwrote: I'm getting a consistent compiler heap

Re: [jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-01 Thread Jeff Eastman
And check the asFormatString(bindings) implementation in ClusterBase. It does this I think, though it has not yet been wired into ClusterDumper.printClusters. I wanted to give the ClusterDumper users a chance to critique my formatting but it is like the below. Jeff Jake Mannix (JIRA)

Re: [jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-01 Thread Jeff Eastman
wrote: It already does this, i think. But floats can be formatted better On Tue, Mar 2, 2010 at 2:55 AM, Jeff Eastman j...@windwardsolutions.comwrote: And check the asFormatString(bindings) implementation in ClusterBase. It does this I think, though it has not yet been wired

Re: [jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-01 Thread Jeff Eastman
they should all be Printable too (the latter two are already). That would let us refactor VectorDumper into AbstractVector and clean up another code duplication. On Tue, Mar 2, 2010 at 3:16 AM, Jeff Eastman j...@windwardsolutions.comwrote: The loop still needs to be closed in order to unify

Re: [jira] Created: (MAHOUT-315) VectorDumper should also do printing to simple {index : value, index : value, ... } output, if no dictionary is specified.

2010-03-01 Thread Jeff Eastman
. -jake On Mon, Mar 1, 2010 at 1:36 PM, Robin Anil robin.a...@gmail.com wrote: It already does this, i think. But floats can be formatted better On Tue, Mar 2, 2010 at 2:55 AM, Jeff Eastman j...@windwardsolutions.com wrote: And check the asFormatString(bindings) implementation

Re: 0.3 release issues

2010-02-26 Thread Jeff Eastman
I'm +1 on getting these changes in asap, but +0 on whether to do them during code freeze. I'm pretty confident Robin can pull it off, but it is code freeze. Jeff Robin Anil wrote: Hi guys, I have some patches ready, this cleans up our clustering code, gets the examples running,

[Fwd: Re: About Display Code]

2010-02-24 Thread Jeff Eastman
AbstractVector.minus has a bug in the first if clause. Don't know if my fix or this one would do what is intended by the optimization: if (x instanceof RandomAccessSparseVector || x instanceof DenseVector) { // TODO: if both are RandomAccess check the numNonDefault to determine which

Re: [Fwd: Re: About Display Code]

2010-02-24 Thread Jeff Eastman
Jake Mannix wrote: why is this not showing up in the unit tests? On Wed, Feb 24, 2010 at 6:36 PM, Jeff Eastman jeast...@windwardsolutions.com wrote: AbstractVector.minus has a bug in the first if clause. Don't know if my fix or this one would do what is intended by the optimization

Re: 0.3 release issues

2010-02-23 Thread Jeff Eastman
+1 from me too Ted Dunning wrote: +1 to code freeze, waiting for hadoop release and testing the RC On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote: On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman
If the Vector-MSCanopy pre-job outputs all of its canopies then each of those canopies would contain the generated canopyId and its canopy center would contain the original vector with its docId. Seems like one could use that data set to get the membership information in a separate

Re: [jira] Updated: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-22 Thread Jeff Eastman
Robin Anil wrote: after the ListVector - ListcanopyId optimization. I did that in the patch. Take a look :) +1 Simply marvelous

Re: [jira] Created: (MAHOUT-304) MeanShift doesn't read from VectorWritable

2010-02-21 Thread Jeff Eastman
+1. This will then enable a small step forwards towards reducing the memory footprint of MeanShiftCanopy.boundPoints by allowing the ListVector to be replaced by ListInteger. The boundPoints don't need to be accumulated at all if one is only interested in the resulting cluster centers, but

Re: Profiling SequentialAccessSparseVector

2010-02-20 Thread Jeff Eastman
+1 to upgrade, addTo did not exist when clustering was written. Should be pretty easy to upgrade it though. Robin Anil wrote: ah! Its not being used anywhere :). Should we make that a big task before 0.3 ? Sweep through code(mainly clustering) and change all these things. Robin On Fri, Feb

Re: Profiling SequentialAccessSparseVector

2010-02-20 Thread Jeff Eastman
/browse/MAHOUT-297 Robin On Sat, Feb 20, 2010 at 5:44 PM, Jeff Eastman j...@windwardsolutions.comwrote: +1 to upgrade, addTo did not exist when clustering was written. Should be pretty easy to upgrade it though. Robin Anil wrote: ah! Its not being used anywhere :). Should we make

Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
Very similar, especially when you consider that k-means only adds the whole point value to the single, closest cluster (i.e. weightedPointTotal += 1), whereas fuzzy adds it partially to all. I don't think the other clustering routines require/expect numPoints to be an integer and the instvar

Re: Fuzzy K Means

2010-02-18 Thread Jeff Eastman
please take a look Robin On Wed, Feb 17, 2010 at 3:35 PM, Jeff Eastman j...@windwardsolutions.comwrote: Robin Anil wrote: Hadoop reuses the *same* instance whenever it uses readFields and I've been bitten more than once by assuming otherwise. Yep!. Thats our bug

Re: Fuzzy K Means

2010-02-17 Thread Jeff Eastman
Robin Anil wrote: Hadoop reuses the *same* instance whenever it uses readFields and I've been bitten more than once by assuming otherwise. Yep!. Thats our bug. Always assume mutability in Hadoop :) . I will see the where the writable is causing the error. Best is if we could have some

Re: Fuzzy K Means

2010-02-16 Thread Jeff Eastman
Looks to me like the unit tests are the only calls to recomputeCenter, which is where the center is set. The clusterer seems to be calling computeCentroid, which sets the centroid, instead. I'm not sure why it needs both instance variables, as the pointProbSum and weightedPointTotal variables

Re: Fuzzy K Means

2010-02-16 Thread Jeff Eastman
to be identical (and especially if they are not all zeros). Jeff Robin Anil wrote: On Tue, Feb 16, 2010 at 10:25 PM, Jeff Eastman j...@windwardsolutions.comwrote: Looks to me like the unit tests are the only calls to recomputeCenter, which is where the center is set. The clusterer seems to be calling

Re: Fuzzy K Means

2010-02-16 Thread Jeff Eastman
I went to run the syntheticcontrol example (which uses MR) but there is no fuzzy version. It might be easier to debug if one was created from the kmeans job. It really feels to me like the problem lies somewhere in Writable handling and not in the ClusterBase refactoring. Jeff Eastman wrote

Re: Mahout as TLP

2010-02-15 Thread Jeff Eastman
+1 on Isabel's comments. Isabel Drost wrote: On Sat Grant Ingersoll gsing...@apache.org wrote: I don't see any harm in getting 0.3 out first if that makes folks more comfortable. Yeah, this feels better to me the more I think about it. +1 from me as well: I really like the

Re: VectorWritable bug et al

2010-02-12 Thread Jeff Eastman
+1 to all of Drew's comments Drew Farris wrote: +1 to eliminating the statics, they are indeed evil. The type to read should be stored in the thing doing/facilitating the reading not the vector itself and definitely not in a static field. Pretty sure vector shouldn't be facilitating the reading

Re: Some more dependencies

2010-02-10 Thread Jeff Eastman
Robin Anil wrote: any more +1s ? +1 keep Mahout as unentangled as possible

[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-02-09 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831678#action_12831678 ] Jeff Eastman commented on MAHOUT-270: - r908235 commits the Printable interface

[jira] Issue Comment Edited: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-02-09 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831678#action_12831678 ] Jeff Eastman edited comment on MAHOUT-270 at 2/9/10 9:39 PM

Re: Math compile errors in Eclipse

2010-02-08 Thread Jeff Eastman
in IntelliJ too. I assume its an artifact of reloading the pom.xml file and resetting some things. On Tue, Feb 9, 2010 at 12:06 AM, Jeff Eastman j...@windwardsolutions.com wrote: I'm getting a lot of compile errors in Eclipse after my most recent svn update today. The errors begin in the math module

[jira] Created: (MAHOUT-276) Alpha_0 mixture parameter is not implemented correctly in Dirichlet

2010-02-07 Thread Jeff Eastman (JIRA)
Components: Clustering Affects Versions: 0.2 Reporter: Jeff Eastman Assignee: Jeff Eastman I looked over the R reference code and alpha_0 is used in two places, not one as in the current implementation: - in state initialization beta = rbeta(K, 1, alpha_0) [K

[jira] Commented: (MAHOUT-276) Alpha_0 mixture parameter is not implemented correctly in Dirichlet

2010-02-07 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830740#action_12830740 ] Jeff Eastman commented on MAHOUT-276: - The fix involves adding alpha_0 as an argument

[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-02-07 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830741#action_12830741 ] Jeff Eastman commented on MAHOUT-270: - I'd like to deprecate the asFormatString

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-05 Thread Jeff Eastman
Jeff Eastman wrote: Jeff Eastman wrote: Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-04 Thread Jeff Eastman
Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. Could you elaborate more on the function

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-03 Thread Jeff Eastman
Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. Could you elaborate more on the function of alpha in the algorithm?

[Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-02 Thread Jeff Eastman
Just notice this didn't go to the list. ---BeginMessage--- Hi Jerry, I'm not sure why Dirichlet is doing that with this dataset and have not been able to get better results than you. I have gotten excellent results using it with other models on other datasets, so I'm pretty confident in the

[jira] Commented: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-01-31 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806873#action_12806873 ] Jeff Eastman commented on MAHOUT-270: - In the beginning, vectors, canopies and clusters

[jira] Created: (MAHOUT-270) Make ClusterDumper dump Dirichlet clusters too

2010-01-27 Thread Jeff Eastman (JIRA)
Affects Versions: 0.2 Reporter: Jeff Eastman Assignee: Jeff Eastman Given the binary representation of models/clusters in Dirichlet, extend the ClusterDumper utility to dump out a printable representation of them too. -- This message is automatically generated by JIRA

Re: Build issue with last dirichlet change

2010-01-20 Thread Jeff Eastman
Sean Owen wrote: That last commit concerning the dirichlet code and models seems to cause the build to fail -- or else I'm the victim of another environment-specific issue. I note it only because the fix raises a question. It causes core/ to depend utils/, and I had thought that was not the

Re: Build issue with last dirichlet change

2010-01-20 Thread Jeff Eastman
20, 2010 at 4:56 PM, Jeff Eastman j...@windwardsolutions.com wrote: Sean Owen wrote: That last commit concerning the dirichlet code and models seems to cause the build to fail -- or else I'm the victim of another environment-specific issue. I note it only because the fix raises

Re: Build issue with last dirichlet change

2010-01-20 Thread Jeff Eastman
I will run the build before I commit. I will run the build before I commit. ... I will run the build before I commit. my bad Ted Dunning wrote: Our modules aren't working out as well as expected. On Wed, Jan 20, 2010 at 4:56 PM, Jeff Eastman j...@windwardsolutions.comwrote: Sean Owen

Re: Build issue with last dirichlet change

2010-01-20 Thread Jeff Eastman
The build compiles but org.apache.mahout.math.TestVectorWritable fails for some reason and it does not get to my test. Jeff Eastman wrote: I will run the build before I commit. I will run the build before I commit. ... I will run the build before I commit. my bad Ted Dunning wrote: Our

Re: Unit test lag?

2010-01-18 Thread Jeff Eastman
I'm planning on attending Jeff Grant Ingersoll wrote: On Jan 17, 2010, at 8:35 PM, Ted Dunning wrote: We should have a beer some time anyway and the beers we owe you for cleaning up Colt more than cancel any potential beer on this issue so I will be happy to buy (Sean, you are included

[jira] Resolved: (MAHOUT-251) Generalize Dirichlet models and model distributions to handle n-d and sparse vectors

2010-01-18 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman resolved MAHOUT-251. - Resolution: Fixed r900519 wrapped up loose ends in the patch, adding new command line arguments

Eclipse and Maven Don't Agree

2010-01-17 Thread Jeff Eastman
I've made some changes for MAHOUT-251 and all the tests run in Eclipse, but two of them fail when run from Maven. How can I poke mvn to give me more diagnostics?

Re: Eclipse and Maven Don't Agree

2010-01-17 Thread Jeff Eastman
Sean Owen wrote: Could be. I took an indirect stab at mitigating possible sources of this issue by increasing encapsulation in the tests -- I still believe fields should never by non-private. This may start to surface the behind-the-scenes dependencies and side effects that shouldn't be there.

Re: build failure

2010-01-17 Thread Jeff Eastman
I just did a successful mvn install on trunk without seeing any problems. My checkout is a couple of days old and there have been a few other commits in addition to mine since. Drew Farris wrote: Yes, I'm seeing this too. Deneche encountered it back when working with:

[jira] Created: (MAHOUT-251) Generalize Dirichlet models and model distributions to handle n-d and sparse vectors

2010-01-15 Thread Jeff Eastman (JIRA)
: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.2 Reporter: Jeff Eastman Assignee: Jeff Eastman Users attempting to use Dirichlet Process Clustering on real life problems cannot use any of the existing models or model

[jira] Updated: (MAHOUT-251) Generalize Dirichlet models and model distributions to handle n-d and sparse vectors

2010-01-15 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-251: Attachment: MAHOUT-251.patch This patch generalizes the 2-d dense models by introducing a new

Re: [jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-12-06 Thread Jeff Eastman
Issue Type: Improvement Components: Classification, Clustering, Collaborative Filtering, Frequent Itemset/Association Rule Mining, Genetic Algorithms, Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Fix For: 0.4

[jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-10-21 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-167: Attachment: MAHOUT-167.patch Work in progress patch which compiles most Canopy changes needed

Re: 0.2

2009-10-12 Thread Jeff Eastman
I'm inclined towards Sean's perspective. Making the kinds of significant changes to the vector implementation that 165 entails strike me as non-trivial and likely to delay 0.2 significantly. I vote to not include it in this point release so that the functionality which is ready to go public

[jira] Commented: (MAHOUT-136) Change Canopy MR Implementation to use Vector Writable

2009-09-28 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760462#action_12760462 ] Jeff Eastman commented on MAHOUT-136: - I think this issue has been completed and should

Re: Unit Tests pretty slow?

2009-09-18 Thread Jeff Eastman
Some of the clustering unit tests that test the Hadoop jobs take a while to run through their iterations. This is on the order of a minute or two in some cases. I think testing the jobs should still be done in the pre-commit batch, since the commits really need them to pass successfully. Jeff

Re: 0.2 planning

2009-09-12 Thread Jeff Eastman
I propose leaving MAHOUT-167 out of 0.2 for the reasons which Sean mentioned previously. MAHOUT-136 is, afaict, done and can probably be closed. Grant, you had some comments in the issue; have they been resolved? Grant Ingersoll wrote: Here's the list of unresolved issues for 0.2:

Re: Yourkit License for all of you

2009-09-03 Thread Jeff Eastman
Robin Anil wrote: Dear Mahout Devs,Yourkit sales rep gave me my opensource license. If anyone would like to get one. I can aggregate and send all the requests to him. If you would like to have an opensource license of Yourkit Profiler, reply to on thread within 24 hours of this

Re: [jira] Commented: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-08-28 Thread Jeff Eastman
-167 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Fix For: 0.2 We need to update the clustering implementations to remove the deprecated Hadoop

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-08 Thread Jeff Eastman
Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740573#action_12740573 ] Grant Ingersoll commented on MAHOUT-121: bq. Please,

Re: Code quality enforcement?

2009-07-10 Thread Jeff Eastman
+1 Looks reasonable to me. I had to change my Lucene formatter profile to make lines and comments wrap at 120 vs. 80 in order for Eclipse to curtail most of its reformatting changes. A new line or two is still added/removed in the files I've checked but otherwise we are in synch. Sean Owen

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? The purpose of the clustering jobs, in general, was to simplify computing the clusters and then clustering the data. It has been applied - and

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
Ingersoll wrote: Check out the patch I just put up on M-138 On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote: Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? The purpose of the clustering jobs

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
then more. How about you do Canopy and KMeans and I do the others, since those seem to be in your critical path at the current time. Jeff Grant Ingersoll wrote: On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote: That looks reasonable, just reading the patch. You might also want to put the clusters-x

  1   2   3   4   >