[jira] [Commented] (MAHOUT-1183) remove duplicate (masked) unused field

2013-03-31 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13618271#comment-13618271 ] Ted Dunning commented on MAHOUT-1183: - Dave, Thanks and all for picking out

factorization machines as new project

2013-03-31 Thread Ted Dunning
Relative to Dan's recent mention of SOM as possible new project, here are slides from KDD Cup 2012 in which Stephen Rendle describes how he did using a very straightforward implementation of Factorization Machines [1,2]. FMs are interesting in the context of Mahout because they can be used in a w

[jira] [Updated] (MAHOUT-1182) remove useless append

2013-03-30 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1182: Resolution: Fixed Status: Resolved (was: Patch Available) Committed in r1462882 (this

Re: Interest in Self Organizing Maps?

2013-03-30 Thread Ted Dunning
SOM doesn't have to be constrained to two dimensions. That said, there are bunches of non-linear embedding methods that are more current than SOM's. SOM's were part of the neural plausibility movement of the late 80's which more recently can be seen as an approach toward modern formulations of st

Re: MODERATE for dev@mahout.apache.org

2013-03-30 Thread Ted Dunning
aimnphidmmapikej...@mahout.apache.org > > To reject: > >dev-reject-1364573050.63309.haimnphidmmapikej...@mahout.apache.org > > To give a reason to reject: > > %%% Start comment > > %%% End comment > > > > > > > > -- Forwarded message

Re: Cloudera ML: New Open Source Libraries and Tools for Data Scientists

2013-03-29 Thread Ted Dunning
Pity that they don't bother to contribute back to Mahout itself. On Fri, Mar 29, 2013 at 11:28 AM, Sean Owen wrote: > Not sure if people saw this from Josh at Cloudera: > http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/ > https://github.com/cloudera/ml > > This is a nice sho

Re: streaming k-means vs minibatch k-means

2013-03-28 Thread Ted Dunning
I think (casually, informally) that the guarantees are preserved by simple ordering arguments. The argument goes that the threshold on partitions will grow less than if the partitions were handled sequentially. Thus the partial sketches should have at least as much fidelity than if the same segme

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

2013-03-28 Thread Ted Dunning
um clusters: 20; maxDistance: 1.029701 > > > > > > On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon < > dangeorge.fili...@gmail.com>wrote: > > > >> You know what's even more odd? When I used Mahout's KMeans, everything > >> was assigned to one sin

Re: Fwd: Neural Network and Restricted Boltzman Machine in Mahout

2013-03-28 Thread Ted Dunning
Cool! We need the help. On Thu, Mar 28, 2013 at 8:08 PM, Ray wrote: > I'll focus for now on contributing to documentation, possibly some > patches. See how the contribution process works, gain a little confidence > there first. (I do have a background in neural networks.) >

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-28 Thread Ted Dunning
It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be p

Re: Clustering 20newsgroups with StreamingKMeans [was How to improve clustering?]

2013-03-28 Thread Ted Dunning
r 19 [2]: 98.733778 > Num clusters: 20; maxDistance: 762.326896 > > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning > wrote: > > I will have to think on this a bit. > > > > It should be possible to dump the sketches coming from each mapper and > look > > at t

Re: Design Issue: ConcurrentModificationException in FastProjectionSearch

2013-03-27 Thread Ted Dunning
e distances array that needs fixing. > In fact, an entire new copy is needed if we're to be able to safely > iterate and reindex. > > I'm shelving this for now. That ok? > > On Wed, Mar 27, 2013 at 4:35 AM, Ted Dunning > wrote: > > Another option is to make th

[jira] [Commented] (MAHOUT-1025) Update documentation for LDA before the release.

2013-03-27 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615384#comment-13615384 ] Ted Dunning commented on MAHOUT-1025: - Saikat, Your outline is just

Re: Mahout Suggestions - Refactoring Effort

2013-03-27 Thread Ted Dunning
browse/MAHOUT-1164<https://issues.apache.org/jira/browse/MAHOUT-1164> > > > > > On 03/27/2013 12:37 AM, Ted Dunning wrote: > >> Can you post a list of those patches? >> >> I haven't been tracking carefully and unless I have a moment when the >> email

[jira] [Commented] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-27 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615019#comment-13615019 ] Ted Dunning commented on MAHOUT-1164: - Marty, If you like this, I can commit

[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-27 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1164: Attachment: MAHOUT-1164.patch Another round of revision. Cleaned up unused variables

[jira] [Updated] (MAHOUT-1164) Make ARFF integration generate meta-data in JSON format

2013-03-27 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1164: Attachment: MAHOUT-1164.patch I factored large string constants out into resource files. More

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
stian Schelter wrote: > > > Totally agree on that. The impact of making Mahout more usable is much > > higher than that of adding a new algorithm. > > > > On 27.03.2013 05:41, Ted Dunning wrote: > >> It is critically important. > >> > >

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Ted Dunning
Here are some ideas: - reform and simplify the clustering API's. All of our main-line clustering systems should work identically and have good and simple diagnostics. - simplify the connection to Lucene for clustering and classification. On Tue, Mar 26, 2013 at 8:44 PM, Dan Filimon wrote: > O

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Ted Dunning
I can help peripherally, but my travel schedule is heinous and would prevent full on mentoring. On Tue, Mar 26, 2013 at 8:44 PM, Dan Filimon wrote: > Okay, we just need to add JIRA issues that have the tags "gsoc2013" > and "mentors" and we're good. > > The deadline for the ideas is March 29 and

[jira] [Commented] (MAHOUT-1176) Introduce an Changelog file to raise contributors attribution

2013-03-26 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614969#comment-13614969 ] Ted Dunning commented on MAHOUT-1176: - Let's keep it more terse than eve

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Ted Dunning
I meant that the committer should add the line. On Wed, Mar 27, 2013 at 7:14 AM, Mike Percy wrote: > On Tue, Mar 26, 2013 at 9:54 PM, Ted Dunning > wrote: > > > We should try to make sure to have the update of the CHANGELOG part of > the > > patch. > > >

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Ted Dunning
Fine idea. We should try to make sure to have the update of the CHANGELOG part of the patch. On Wed, Mar 27, 2013 at 1:38 AM, Mike Percy wrote: > If others on the list agree then I think this is a fine idea. > > Regards, > Mike > > > On Tue, Mar 26, 2013 at 3:04 PM, Sebastian Schelter > wrote:

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
g on consistent data format and command line option support. It's not > glamorous but it's important. > > > On 3/26/2013 8:26 PM, Ted Dunning wrote: > >> Gokhan, >> >> I think that the general drift of your recommendation is an excellent >> sugg

Re: Design Issue: ConcurrentModificationException in FastProjectionSearch

2013-03-26 Thread Ted Dunning
Another option is to make the iterator take a reference to the array as it exists and then during merging always create a new array. A second option is to just let the iterator get a bit confused (don't like the smell there). On Tue, Mar 26, 2013 at 10:59 PM, Dan Filimon wrote: > Ted, everyone,

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
Gokhan, I think that the general drift of your recommendation is an excellent suggestion and it is something that we have wrestled with a lot over time. The recommendations side of the house has more coherence in this matter than other parts largely because there was a clear flow early on. Now,

Re: Call to action – Mahout needs your help

2013-03-26 Thread Ted Dunning
Yowza these are really good comments. Where have you guys been? To answer just one of the questions, my own criterion for voting up a committer is that they a) will work with other contributors positively b) won't break things too often (breaking the build sometimes is probably a good sign)

[jira] [Resolved] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1174. - Resolution: Fixed Checked in updated links. > Lanczos code and javadocs sho

[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613289#comment-13613289 ] Ted Dunning commented on MAHOUT-1173: - For grins and general knowledge, Jen

[jira] [Reopened] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reopened MAHOUT-1174: - Assignee: Ted Dunning Fixing this now to point to the correct link. Thanks for spotting that

[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613274#comment-13613274 ] Ted Dunning commented on MAHOUT-1173: - The problem before was that it produced

Re: Call to action – Mahout needs your help

2013-03-25 Thread Ted Dunning
I am still a fan of GSOC, but there is no chance I have enough time to help (although my working with Dan recently is a bit of a counter example) On Mon, Mar 25, 2013 at 11:12 PM, Grant Ingersoll wrote: > > On Mar 25, 2013, at 4:24 PM, Isabel Drost-Fromm wrote: > > > Also, do we have any voluntee

[jira] [Resolved] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1174. - Resolution: Fixed Committed javadoc fixes. Also added Preconditions check + assert for null at

[jira] [Created] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1174: --- Summary: Lanczos code and javadocs should refer users to the SSVD stuff Key: MAHOUT-1174 URL: https://issues.apache.org/jira/browse/MAHOUT-1174 Project: Mahout

Re: changes without JIRA's

2013-03-25 Thread Ted Dunning
his refers to the cleanups I've done in the last days. In the > future, I will create a Jira for each and attach a patch. > > On 25.03.2013 16:31, Ted Dunning wrote: > > I would like it if all changes to the code be accompanied by a JIRA that > > describes the problem bei

changes without JIRA's

2013-03-25 Thread Ted Dunning
I would like it if all changes to the code be accompanied by a JIRA that describes the problem being solved and that the commit messages associated with the fix reference the JIRA.

Re: Call to action – Mahout needs your help

2013-03-25 Thread Ted Dunning
Switching to apache git would make this easier. On Mon, Mar 25, 2013 at 1:08 PM, Isabel Drost wrote: > > As non-committer I'd contribute more to Mahout, had github be primary > > source. Now, when I contribute a pull request, it gets merged to Apache > git > > server by committer, and I don't ge

checkstyle disabled

2013-03-24 Thread Ted Dunning
Sebastian, I was the one who turned off checkstyle. Here is my (minimal) rationale. http://mail-archives.apache.org/mod_mbox/mahout-dev/201212.mbox/%3CCAJwFCa3y6m8xiRW%3DcqTG29fOH5TWeTO%2Bi1YYFzv%3DQSMSk0SinQ%40mail.gmail.com%3E

Re: Checkstyle

2013-03-24 Thread Ted Dunning
http://mail-archives.apache.org/mod_mbox/mahout-dev/201206.mbox/browser There seems to be a discussion of issues with Jenkins in there. Unfortunately mail-archives seems very flaky at the moment and I can't see the actual messages. On Sun, Mar 24, 2013 at 10:37 PM, Ted Dunning wrote:

Re: Checkstyle

2013-03-24 Thread Ted Dunning
I believe that it was removed because it was making the build unstable. Probably worth trolling back the the email archives. On Sun, Mar 24, 2013 at 5:47 PM, Sebastian Schelter wrote: > Why is checkstyle removed from our pom? Is there a particular reason for > that? > > I would suggest to reint

Re: Call to action – Mahout needs your help

2013-03-24 Thread Ted Dunning
Saikat, This sounds fairly interesting. Are you talking about a non-commercial or commercial interest in doing this? I ask because a non-commercial interest would probably mean that you would be willing to donate more of your code but would have less time to spare. A commercial interest would p

[jira] [Resolved] (MAHOUT-1171) PMD regression

2013-03-24 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1171. - Resolution: Fixed PMD is slightly better than before, at 194 medium warnings versus 199. Build

Re: increase in PMD warnings

2013-03-24 Thread Ted Dunning
I created MAHOUT-1171 to track fixes. I just committed a hundred or so changes that should get most of these. On Sun, Mar 24, 2013 at 12:23 PM, Sebastian Schelter < ssc.o...@googlemail.com> wrote: > Guess I'm responsible for those warnings, let me have a look. > > On 2

[jira] [Commented] (MAHOUT-1171) PMD regression

2013-03-24 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612079#comment-13612079 ] Ted Dunning commented on MAHOUT-1171: - Committed r1460328 with many simple

[jira] [Assigned] (MAHOUT-1171) PMD regression

2013-03-24 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reassigned MAHOUT-1171: --- Assignee: Ted Dunning > PMD regression > -- > >

[jira] [Commented] (MAHOUT-1171) PMD regression

2013-03-24 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612066#comment-13612066 ] Ted Dunning commented on MAHOUT-1171: - Build #1920 [1] showed a sharply incre

[jira] [Created] (MAHOUT-1171) PMD regression

2013-03-24 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1171: --- Summary: PMD regression Key: MAHOUT-1171 URL: https://issues.apache.org/jira/browse/MAHOUT-1171 Project: Mahout Issue Type: Bug Reporter: Ted

increase in PMD warnings

2013-03-24 Thread Ted Dunning
Build #1920 [1] showed a sharply increased number of PMD warnings recently. The report that shows the new warnings [2] indicates that the new warnings seem to be primarily unused imports and other simple issues that should be fixable by using IDE inspections. IntelliJ, for instance, would bitch

Re: [jira] [Created] (MAHOUT-1170) missing java files from mahout-distribution-0.7

2013-03-23 Thread Ted Dunning
Do mvn compile These missing files will be created. The eclipse maven plugins aren't smart enough to understand the source file creation step that is included in the compilation. That means that you can mostly trust it, but not always. The first time is a good example of what "not always"

Re: BallKMeans clustering issues detailed

2013-03-22 Thread Ted Dunning
Indeed. Dan and I have discussed this. The space that he starts in is TF-IDF weighted and the projections is random so it should preserve much of the metric in the original. Based on the experience that we had with SSVD, using a properly learned projection would definitely give modest improveme

Re: https://issues.apache.org/jira/browse/MAHOUT-1168

2013-03-20 Thread Ted Dunning
Looks like a great idea. We are very weak RTC. Some things are pretty obviously good ideas and low risk so we wind up doing something like CTR. On Wed, Mar 20, 2013 at 8:47 PM, Benson Margulies wrote: > Anyone have any objections? > > Are we still formally RTC? >

Re: lucene 4.2.0?

2013-03-20 Thread Ted Dunning
Grant was going to do the commit, I think. On Wed, Mar 20, 2013 at 10:32 AM, Benson Margulies wrote: > I'm a bit rusty. People want a patch and a jira, or just the trivial commit > :-) > > > On Wed, Mar 20, 2013 at 6:52 AM, Ted Dunning > wrote: > > > Sounds good

Re: lucene 4.2.0?

2013-03-19 Thread Ted Dunning
AM > Subject: Re: lucene 4.2.0? > > I wouldn't think so. Go for it. > > On Mar 19, 2013, at 6:35 AM, Ted Dunning wrote: > > > Shouldn't affect compatibility with >= 4.0, should it? > > > > On Tue, Mar 19, 2013 at 3:16 AM, Benson Margulies >wrote: > > > >> Any objection? > >> >

Re: lucene 4.2.0?

2013-03-19 Thread Ted Dunning
Shouldn't affect compatibility with >= 4.0, should it? On Tue, Mar 19, 2013 at 3:16 AM, Benson Margulies wrote: > Any objection? >

Re: Fwd: Neural Network and Restricted Boltzman Machine in Mahout

2013-03-16 Thread Ted Dunning
t > variables. > > > On Fri, Mar 15, 2013 at 3:04 AM, Ted Dunning > wrote: > > > Is this a dense dataset or sparse? What is the average sparsity if so. > > > > On Thu, Mar 14, 2013 at 11:26 AM, Ying Liao wrote: > > > > > I have a training set with 9

[jira] [Commented] (MAHOUT-1166) Multithreaded version of distributed ALS

2013-03-16 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604477#comment-13604477 ] Ted Dunning commented on MAHOUT-1166: - There is a FileBasedMatrix implementa

Re: Fwd: Neural Network and Restricted Boltzman Machine in Mahout

2013-03-15 Thread Ted Dunning
Is this a dense dataset or sparse? What is the average sparsity if so. On Thu, Mar 14, 2013 at 11:26 AM, Ying Liao wrote: > I have a training set with 9M records and 1M independent variables. Any > other tool can process the dataset? > > > On Thu, Mar 14, 2013 at 11:33 AM, Danny Busch wrote: >

Re: Adding a new command-line program

2013-03-14 Thread Ted Dunning
I was unable to answer this off the cuff in direct email. Anybody else remember the answer? On Wed, Mar 13, 2013 at 12:44 PM, Dan Filimon wrote: > I'm trying to add the new StreamingKMeans job as > (o.a.m.clustering.streaming.mapreduce.StreamingKMeansDriver [1; not > yet a JIRA issue). > > I've

Re: Fwd: Neural Network and Restricted Boltzman Machine in Mahout

2013-03-14 Thread Ted Dunning
> http://openreview.net/iclr2013 > > table 2, compared to the result you mentioned > > http://arxiv.org/pdf/1301.4171v1.pdf > > > On Thu, Mar 14, 2013 at 11:44 AM, Ted Dunning > wrote: > > > Yeah we have had little pull on these techniques beyond the simplest &

Re: Fwd: Neural Network and Restricted Boltzman Machine in Mahout

2013-03-14 Thread Ted Dunning
Yeah we have had little pull on these techniques beyond the simplest case of logistic regression. Would you guys be willing to sign up for maintaining the code that might result? The thing that might move the needle would be a replication of this architecture: http://deeplearning.net/2012/12

Re: Discussion Of ML environment/MR, Mahout

2013-03-13 Thread Ted Dunning
Stick around! We would love to see the fruits of this. On Wed, Mar 13, 2013 at 1:01 AM, Nick Pentreath wrote: > The main point of interest in this context is that I intend to build a > minimal first-cut machine learning library for Spark. This is likely to > involve porting / using parts of Mah

Re: Discussion Of ML environment/MR, Mahout

2013-03-12 Thread Ted Dunning
On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills wrote: > > First, I wanted to say that I think that there are lots of problems that > can be handled well in MapReduce (the recent k-means streaming stuff being > a prime example), even if they could be performed even faster using an > in-memory mo

[jira] [Commented] (MAHOUT-668) Adding knn support to Mahout classifiers

2013-03-12 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600518#comment-13600518 ] Ted Dunning commented on MAHOUT-668: I don't think so. The searcher interf

Re: mahout collections updates

2013-03-12 Thread Ted Dunning
Indeed. We have considered switching in the past, but the momentum never developed. On Tue, Mar 12, 2013 at 8:59 AM, Andy Schlaikjer < andrew.schlaik...@gmail.com> wrote: > For sake of argument, fastutil has fast entry sets / iterators for most > specializations of maps: > > > http://fastutil.di

Re: Discossuon Of ML environment/MR, Mahout

2013-03-12 Thread Ted Dunning
pread unevenly enough. > > > On Mon, Mar 11, 2013 at 11:59 PM, Ted Dunning wrote: > >> Yarn by itself won't fix this problem. Yarn + Spark would fix it. But, >> then again, so would Mesos + Spark or AmigaOS + Spark. >> >> Should we open several additional mo

Re: Discossuon Of ML environment/MR, Mahout

2013-03-11 Thread Ted Dunning
Yay for PIG. I am still hoping that Drill does well and that the PIG folk build a syntax facade for it so that I can write PIG programs that run really fast. On Mon, Mar 11, 2013 at 5:46 PM, Jake Mannix wrote: > On Mon, Mar 11, 2013 at 4:59 PM, Ted Dunning > wrote: > > > Yarn

Re: mahout collections updates

2013-03-11 Thread Ted Dunning
On Mon, Mar 11, 2013 at 5:58 PM, Jake Mannix wrote: > On Mon, Mar 11, 2013 at 5:44 PM, Jake Mannix > wrote: > > > > > > > > > On Mon, Mar 11, 2013 at 5:14 PM, Ted Dunning >wrote: > > > >> [mvn compile|test|package] will do the trick. > &g

Re: mahout collections updates

2013-03-11 Thread Ted Dunning
On Mon, Mar 11, 2013 at 5:44 PM, Jake Mannix wrote: > On Mon, Mar 11, 2013 at 5:14 PM, Ted Dunning > wrote: > > > [mvn compile|test|package] will do the trick. > >... > > Not that it matters much since the compile is so fast. > > > > Ok, I'll

Re: mahout collections updates

2013-03-11 Thread Ted Dunning
r 11, 2013 at 4:42 PM, Jake Mannix wrote: > On Mon, Mar 11, 2013 at 4:21 PM, Ted Dunning > wrote: > > > It is part of math now since we had zero pull for it separate from math. > > > > I see the code templates living in math, yes, but how to build it? > > > &

Re: Discossuon Of ML environment/MR, Mahout

2013-03-11 Thread Ted Dunning
pis in Yarn etc. hadoop native stuff, but isn't really > what would solve iterative structured and interconnected stuff? > > > > > > On 11.03.2013 21:16, Ted Dunning wrote: > > > Kinda sorta.. > > > > > > You can defeat most of the sort if you want

[jira] [Commented] (MAHOUT-1130) Wrong logic in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

2013-03-11 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599512#comment-13599512 ] Ted Dunning commented on MAHOUT-1130: - Sebastian, I see no patch here. Is it

Re: Discossuon Of ML environment/MR, Mahout

2013-03-11 Thread Ted Dunning
Why not (b) if (b) implies Giraph (which seems to have some momentum) or Spark (which has its own momentum and was originally designed to support machine learning anyway)? Also, why not (b) if we agree now that it is an experiment that will will cut away if it leads to a mess. On Mon, Mar 11, 201

Re: mahout collections updates

2013-03-11 Thread Ted Dunning
It is part of math now since we had zero pull for it separate from math. What did you need? On Mon, Mar 11, 2013 at 1:43 PM, Jake Mannix wrote: > Question which I ought to know the answer to, but don't: if we want to make > changes to mahout-collections, what's the build process / maven target

Re: Discossuon Of ML environment/MR, Mahout

2013-03-11 Thread Ted Dunning
Kinda sorta.. You can defeat most of the sort if you want to just hash things to buckets. On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov wrote: > Sort component adds log to > the asymptotic complexity, whereas it is clear that any streaming merge > algorithm just wouldn't need to do sort and

Re: [Draft] Board Report

2013-03-11 Thread Ted Dunning
Nice! On Mon, Mar 11, 2013 at 10:48 AM, Jake Mannix wrote: > is currently on the wiki, please look it over, as the board meeting is this > wednesday, I believe, so I need to send it over soon (it was due on > saturday). > > https://cwiki.apache.org/confluence/display/MAHOUT/Monthly+Progress > >

Re: Out-of-core random forest implementation

2013-03-08 Thread Ted Dunning
erent point on a tradeoff curve, optimizing for a different type of > problem. > Yes. Yes. For instance, stochastic svd and streaming k-means both radically change the game when it comes to map-reduce. But the real issue has to do with whether scaling is truly linear or not. > >

Re: Out-of-core random forest implementation

2013-03-08 Thread Ted Dunning
The big cost in map-reduce iteration isn't just startup. It is that the input has to be read from disk and the output written to same. Were it to stay in memory, things would be vastly faster. Also, startup costs are still pretty significant. Even on MapR, one of the major problems in setting t

Re: [jira] [Updated] (MAHOUT-1157) AbstractCluster.formatVector iteration bug.

2013-03-08 Thread Ted Dunning
Thanks for the patch. Sent from my iPhone On Mar 8, 2013, at 1:22 AM, "Adam Bozanich (JIRA)" wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Adam Bozanich updated MAHOUT-1157: > -

[jira] [Commented] (MAHOUT-1155) Make MatrixSlice a Vector and fix Centroid cloning

2013-03-07 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596800#comment-13596800 ] Ted Dunning commented on MAHOUT-1155: - In this patch fragment: + @Over

Re: Out-of-core random forest implementation

2013-03-07 Thread Ted Dunning
See Giraph. On Thu, Mar 7, 2013 at 6:01 PM, Andy Twigg wrote: > That sounds like a horrid amount of work to do something simple. Is there a > hadoop implementation of a master-workers problem you can point me to? > On Mar 7, 2013 9:57 PM, "Ted Dunning" wrote: > > >

Re: Out-of-core random forest implementation

2013-03-07 Thread Ted Dunning
On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg wrote: > ... Right now what we have is a > single-machine procedure for scanning through some data, building a > set of histograms, combining histograms and then expanding the tree. > The next step is to decide the best way to distribute this. I'm not an

[jira] [Commented] (MAHOUT-1151) Object reuse in distributed ALS

2013-03-06 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594848#comment-13594848 ] Ted Dunning commented on MAHOUT-1151: - Regarding the patch, it is mostly quite

Re: [streaming k-means] Code review

2013-03-04 Thread Ted Dunning
I think it might be worth committing in steps. The standalone clustering and utility code has almost no impact on existing Mahout code (what small impacts there were on Vector and friends were committed some time ago). These can be committed sooner. Integration with the map-reduce and command li

Re: ARFF file support for random forest classifiers

2013-02-28 Thread Ted Dunning
making this consistent would be very helpful. On Thu, Feb 28, 2013 at 9:33 AM, Marty Kube wrote: > Hey, > > I've been looking at consuming ARFF files for random forest classification. > > If you look at the partial implementation example page one is asked to > download an ARFF file, edit the ARFF

Re: Out-of-core random forest implementation

2013-02-21 Thread Ted Dunning
ect.org > >>>>> > >>>>> I think this framework is a nice fit for the problem. > >>>>> If the input data fits into the "total cluster memory" you benefit > from > >>>>> the caching of the RDD's. > >>>>

Re: Out-of-core random forest implementation

2013-02-19 Thread Ted Dunning
If non-MR means map-only job with communicating mappers and a state store, I am down with that. What did you mean? On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > > Right now I'd lean towards the planet model, or maybe a non-MR > implementation. Anyon

[jira] [Commented] (MAHOUT-1148) QR Decomposition is too slow

2013-02-15 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579708#comment-13579708 ] Ted Dunning commented on MAHOUT-1148: - The difference was just a matter of rewri

Re: Out-of-core random forest implementation

2013-02-15 Thread Ted Dunning
M, deneche abdelhakim wrote: > >> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < >> martykube@**beavercreekconsulting.com> >> wrote: >> >> On 01/28/2013 02:33 PM, Ted Dunning wrote: >>> >>> I think I was suggesting something weaker. >>

Re: Out-of-core random forest implementation

2013-02-15 Thread Ted Dunning
Exactly. On Fri, Feb 15, 2013 at 5:37 PM, Marty Kube wrote: > Even if you are not doing map reduce exactly, hadoop does give you a nice > infrastructure for running jobs across a lot of host. > > On 02/15/2013 04:00 PM, Ted Dunning wrote: > >> Remember that Hadoop != map-r

Re: Out-of-core random forest implementation

2013-02-15 Thread Ted Dunning
Remember that Hadoop != map-reduce. If there is another style that we need to use, that isn't such a bad thing. On Fri, Feb 15, 2013 at 7:42 AM, Andy Twigg wrote: > I am having a hard time convincing myself that doing it on hadoop is > the best idea (and like I said, it's not like there are ot

Re: Accessing the local filesystem from AbstractJob

2013-02-14 Thread Ted Dunning
I think that file: is the right way to access the local file system. On Wed, Feb 13, 2013 at 4:14 AM, Sean Owen wrote: > Hmm I think it will work if you use "file:///..." URIs? I haven't tried in > a long time though. > > > On Wed, Feb 13, 2013 at 12:12 PM, Dan Filimon > wrote: > > > I see. Well

Re: Compiling Mahout without generating math sources

2013-02-12 Thread Ted Dunning
The way to fix this would be to make the fancy source generation plugin be more clever. It hasn't been worthwhile to date. On Tue, Feb 12, 2013 at 6:57 AM, Dan Filimon wrote: > Is there any way of building Mahout without regenerating the math sources? > It seems that every time I run 'mvn compil

[jira] [Commented] (MAHOUT-1148) QR Decomposition is too slow

2013-02-05 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571796#comment-13571796 ] Ted Dunning commented on MAHOUT-1148: - Well, Jake predicted a speedup. But thi

Re:

2013-02-05 Thread Ted Dunning
Your email has been hacked, it appears. On Tue, Feb 5, 2013 at 1:04 AM, Elena Smirnova wrote: > http://www.flysteve.it/k5ulzw.php?s=ot > >

[jira] [Comment Edited] (MAHOUT-1148) QR Decomposition is too slow

2013-02-05 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571563#comment-13571563 ] Ted Dunning edited comment on MAHOUT-1148 at 2/5/13 6:4

[jira] [Resolved] (MAHOUT-1148) QR Decomposition is too slow

2013-02-05 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1148. - Resolution: Fixed OK. That was a silly last minute change that went with the wrong polarity

[jira] [Commented] (MAHOUT-1148) QR Decomposition is too slow

2013-02-05 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571542#comment-13571542 ] Ted Dunning commented on MAHOUT-1148: - Suneel, Jenkins agrees with you. I

[jira] [Reopened] (MAHOUT-1148) QR Decomposition is too slow

2013-02-05 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reopened MAHOUT-1148: - Test failing > QR Decomposition is too s

[jira] [Resolved] (MAHOUT-1148) QR Decomposition is too slow

2013-02-04 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1148. - Resolution: Fixed I committed the new QR decomposition. This is a bit bold to do, but I think

Re: 0.8?

2013-02-03 Thread Ted Dunning
; On Sun, Feb 3, 2013 at 9:45 PM, Ted Dunning wrote: > > I think that getting it into the existing API would be very nice to have, > > but not absolutely critical. > > > > If extending the release by, say, 2-3 weeks would solve the problem I > would > > recommend e

<    4   5   6   7   8   9   10   11   12   13   >