Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
Hey Ted, Here are the JIRA tickets... https://issues.apache.org/jira/browse/MAHOUT-1163 https://issues.apache.org/jira/browse/MAHOUT-1164 On 03/27/2013 12:37 AM, Ted Dunning wrote: Can you post a list of those patches? I haven't been tracking carefully and unless I have a moment when the em

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
Can you post a list of those patches? I haven't been tracking carefully and unless I have a moment when the email comes through (<10% chance lately) then I lose track. On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube wrote: > So I'd like to continue to improve the RF classifier code. I've been > post

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
So I'd like to continue to improve the RF classifier code. I've been posting patches along the lines of the refactoring discussed here. The patches are not being looked at. Someone should be considering patches in this area. Maybe I could handle that :-) Sent from my iPhone On Mar 27, 2013,

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Ted Dunning
Here are some ideas: - reform and simplify the clustering API's. All of our main-line clustering systems should work identically and have good and simple diagnostics. - simplify the connection to Lucene for clustering and classification. On Tue, Mar 26, 2013 at 8:44 PM, Dan Filimon wrote: > O

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Ted Dunning
I can help peripherally, but my travel schedule is heinous and would prevent full on mentoring. On Tue, Mar 26, 2013 at 8:44 PM, Dan Filimon wrote: > Okay, we just need to add JIRA issues that have the tags "gsoc2013" > and "mentors" and we're good. > > The deadline for the ideas is March 29 and

[jira] [Commented] (MAHOUT-1176) Introduce an Changelog file to raise contributors attribution

2013-03-26 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614969#comment-13614969 ] Ted Dunning commented on MAHOUT-1176: - Let's keep it more terse than every commit. H

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Ted Dunning
I meant that the committer should add the line. On Wed, Mar 27, 2013 at 7:14 AM, Mike Percy wrote: > On Tue, Mar 26, 2013 at 9:54 PM, Ted Dunning > wrote: > > > We should try to make sure to have the update of the CHANGELOG part of > the > > patch. > > > > Hard to do, right? Assumes the submitt

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Sebastian Schelter
Totally agree on that. The impact of making Mahout more usable is much higher than that of adding a new algorithm. On 27.03.2013 05:41, Ted Dunning wrote: > It is critically important. > > On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < > martyk...@beavercreekconsulting.com> wrote: > >> IMHO usabi

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Mike Percy
On Tue, Mar 26, 2013 at 9:54 PM, Ted Dunning wrote: > We should try to make sure to have the update of the CHANGELOG part of the > patch. > Hard to do, right? Assumes the submitter knows who will commit their patch. Regards, Mike

[jira] [Resolved] (MAHOUT-1176) Introduce an Changelog file to raise contributors attribution

2013-03-26 Thread Sebastian Schelter (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1176. Resolution: Fixed Fix Version/s: 0.8 Assignee: Sebastian Schelter

[jira] [Created] (MAHOUT-1176) Introduce an Changelog file to raise contributors attribution

2013-03-26 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1176: -- Summary: Introduce an Changelog file to raise contributors attribution Key: MAHOUT-1176 URL: https://issues.apache.org/jira/browse/MAHOUT-1176 Project: Ma

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Ted Dunning
Fine idea. We should try to make sure to have the update of the CHANGELOG part of the patch. On Wed, Mar 27, 2013 at 1:38 AM, Mike Percy wrote: > If others on the list agree then I think this is a fine idea. > > Regards, > Mike > > > On Tue, Mar 26, 2013 at 3:04 PM, Sebastian Schelter > wrote:

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
It is critically important. On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > IMHO usability is really important.I've posted a couple of patches > recently around making the RF classifiers easier to use. I found myself > working on consistent data fo

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Marty Kube
IMHO usability is really important.I've posted a couple of patches recently around making the RF classifiers easier to use. I found myself working on consistent data format and command line option support. It's not glamorous but it's important. On 3/26/2013 8:26 PM, Ted Dunning wrote: Go

Re: Design Issue: ConcurrentModificationException in FastProjectionSearch

2013-03-26 Thread Ted Dunning
Another option is to make the iterator take a reference to the array as it exists and then during merging always create a new array. A second option is to just let the iterator get a bit confused (don't like the smell there). On Tue, Mar 26, 2013 at 10:59 PM, Dan Filimon wrote: > Ted, everyone,

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Mike Percy
If others on the list agree then I think this is a fine idea. Regards, Mike On Tue, Mar 26, 2013 at 3:04 PM, Sebastian Schelter wrote: > In response to the current discussion about raising attribution for > contributors, I suggest we introduce a CHANGELOG file similar to the one > used in Gira

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Ted Dunning
Gokhan, I think that the general drift of your recommendation is an excellent suggestion and it is something that we have wrestled with a lot over time. The recommendations side of the house has more coherence in this matter than other parts largely because there was a clear flow early on. Now,

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Gokhan Capan
You're right, I skipped that, sorry. On Tue, Mar 26, 2013 at 11:30 PM, Suneel Marthi wrote: > Gokhan, > > Thinking loud here, I have not tried running LDA so I could be wrong. > > As a precursor to Step 2 below, did u try running the RowIdJob that should > create pairs. > It also a creates a 'd

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Dan Filimon
I totally agree. Additionally, let me suggest a starting point. :) I'm working with Ted on StreamingKMeans. I'll be uploading patches for the MapReduce version and command line program soon. I could really use feedback on the parameters, documentation, comments, package structure, naming... On W

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Gokhan Capan
What I actually believe is to define a set of design principles for Mahout. This may be per different kinds of algorithms, may be a set for unsupervised ones, supervised ones, and dyadic prediction algorithms (recommenders), I am not sure yet. I consider myself a 'power user' of Mahout, and a Mah

Re: Introduction of a changelog file to raise attribution

2013-03-26 Thread Dan Filimon
Sounds good. It makes contributors more visible and doesn't take much time. On Wed, Mar 27, 2013 at 12:04 AM, Sebastian Schelter wrote: > In response to the current discussion about raising attribution for > contributors, I suggest we introduce a CHANGELOG file similar to the one > used in Giraph

Introduction of a changelog file to raise attribution

2013-03-26 Thread Sebastian Schelter
In response to the current discussion about raising attribution for contributors, I suggest we introduce a CHANGELOG file similar to the one used in Giraph [1]. For every commit, we add a single line with the id and name of the jira, the name of the committer and potentially the name of the contrib

Design Issue: ConcurrentModificationException in FastProjectionSearch

2013-03-26 Thread Dan Filimon
Ted, everyone, I added the multiple pass BallKMeans at the end of the reducer and have uncovered a series of issues with the searchers. FastProjectionSearch stores "pending" vectors in a separate list from the main projections, the idea being that use arrays throughout rather than a tree multiset

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Sebastian Schelter
Hi Gokhan, I like the idea, but I'm not sure whether its completely feasible for all parts of Mahout. A lot of jobs need a little more than a matrix, for example an additional dictionary for text-based stuff In the collaborative filtering code, we already have a common input format: All recommend

Re: Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Suneel Marthi
Gokhan, Thinking loud here, I have not tried running LDA so I could be wrong. As a precursor to Step 2 below, did u try running the RowIdJob that should create pairs.  It also a creates a 'docIndex' which is  to map the Ints back to the original Text. Suneel ___

Re: Call to action – Mahout needs your help

2013-03-26 Thread Mike Percy
Thank you also to Ted, Grant & Isabel for your responses, they're very helpful. Regards, Mike On Tue, Mar 26, 2013 at 6:59 AM, Isabel Drost wrote: > On Tue, Mar 26, 2013 at 4:13 AM, Mike Percy wrote: > > > However the reason I mention it is that up until now, > > others I've spoken to within t

Re: Call to action – Mahout needs your help

2013-03-26 Thread Mike Percy
Hi Sebastian, I suppose whatever people feel is not too much overhead seems reasonable. In Flume, we auto-generate the release changelog from JIRA. Regards, Mike On Tue, Mar 26, 2013 at 1:35 AM, Sebastian Schelter wrote: > Hi Mike, > > > Regarding attribution, I saw it mentioned elsewhere in

Mahout Suggestions - Refactoring Effort

2013-03-26 Thread Gokhan Capan
I am moving my email that I wrote to Call to Action upon request. I'll start with an example that I experience when I use Mahout, and list my humble suggestions. When I try to run Latent Dirichlet Allocation for topic discovery, here are the steps to follow: 1- First I use seq2sparse to convert

Re: Call to action – Mahout needs your help

2013-03-26 Thread Gokhan Capan
Sure. On Tue, Mar 26, 2013 at 7:14 PM, Dan Filimon wrote: > Gokhan, I totally agree that we need of all that. Would you mind > starting a new thread about this? > This thread is great for listing ideas, but it's already become pretty > long and it's getting hard to keep track. > > On Tue, Mar 26

Re: Mahout in GSOC 2013

2013-03-26 Thread Ulrich Stärk
You should ask the PMC to forward my original mail to your dev list then. Uli On 26.03.2013 20:43, Dan Filimon wrote: > Thanks! We have a JIRA. > I'm not on the private list. Have yet to become a committer. :) > I'll let everyone know so we can get the ideas going. > > On Tue, Mar 26, 2013 at 9:

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Dan Filimon
Okay, we just need to add JIRA issues that have the tags "gsoc2013" and "mentors" and we're good. The deadline for the ideas is March 29 and the mentor selection comes after. On Tue, Mar 26, 2013 at 7:26 PM, Dan Filimon wrote: > Makes a lot of sense. Maybe you can give some advice to this year's

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Dan Filimon
Makes a lot of sense. Maybe you can give some advice to this year's mentors though? I contacted the ComDev list to ask for what we need to do. But, any volunteers? :D On Tue, Mar 26, 2013 at 6:32 PM, Isabel Drost wrote: > On Tue, Mar 26, 2013 at 5:17 PM, Dan Filimon > wrote: > >> So, Isabel, do

Re: Call to action – Mahout needs your help

2013-03-26 Thread Dan Filimon
Gokhan, I totally agree that we need of all that. Would you mind starting a new thread about this? This thread is great for listing ideas, but it's already become pretty long and it's getting hard to keep track. On Tue, Mar 26, 2013 at 6:38 PM, Gokhan Capan wrote: > Hi, > > Would you consider to

Re: Call to action – Mahout needs your help

2013-03-26 Thread Gokhan Capan
Hi, Would you consider to refactor Mahout, so that the project follows a clear, layered structure for all algorithms, and to document it, such as: - All algorithms take Mahout matrices as input, and outputs matrices as learned model - All preprocessing tools should be generic enough, so

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 5:17 PM, Dan Filimon wrote: > So, Isabel, do you want to be a mentor? > As much as I'd love to and as much as I've enjoyed it in previous years I know that I cannot put in as much time as I would like to. I've once made the mistake of committing to too much between March a

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Dan Filimon
So, Isabel, do you want to be a mentor? On Tue, Mar 26, 2013 at 6:12 PM, Isabel Drost wrote: > On Tue, Mar 26, 2013 at 4:24 PM, Dan Filimon > wrote: >> >> [1] >> http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#1._How_does_a_mentoring_organization >> [2] ht

Re: GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 4:24 PM, Dan Filimon wrote: > > [1] > http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2013/help_page#1._How_does_a_mentoring_organization > [2] http://en.flossmanuals.net/GSoCMentoring/what-makes-a-good-mentor/ > [3] http://en.flossmanuals.net/GSoCMe

Re: Call to action – Mahout needs your help

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 3:59 PM, Grant Ingersoll wrote: > I believe the GSOC proposal for Mentors is due soon, so if someone is > doing it, they better hop on comdev ASAP and submit. > For more information also check - in particular the "for mentors" bit of

Re: Call to action – Mahout needs your help

2013-03-26 Thread Dan Filimon
You're right Grant, forked the conversation to another thread. Let's talk about GSOC there. On Tue, Mar 26, 2013 at 4:59 PM, Grant Ingersoll wrote: > I believe the GSOC proposal for Mentors is due soon, so if someone is doing > it, they better hop on comdev ASAP and submit. > > On Mar 26, 2013,

GSOC proposals and mentors [was Call to action – Mahout needs your help]

2013-03-26 Thread Dan Filimon
GSOC was mentioned in the other thread a couple of times. And it turns out the deadline for mentoring organizations is coming up this Friday _March 29_! Isabel and Ted say they've had good experiences in the past and Shannon and I are interested in helping this year. __Do the rest of you want to

Re: Call to action – Mahout needs your help

2013-03-26 Thread Grant Ingersoll
I believe the GSOC proposal for Mentors is due soon, so if someone is doing it, they better hop on comdev ASAP and submit. On Mar 26, 2013, at 10:06 AM, Isabel Drost wrote: > On Tue, Mar 26, 2013 at 12:12 PM, Dan Filimon > wrote: > >> If you guys decide to participate in GSOC this year, I'd be

Build failed in Jenkins: Mahout-Examples-Cluster-Reuters #280

2013-03-26 Thread Apache Jenkins Server
See Changes: [ssc] MAHOUT-1173 Reactivate checkstyle -- [...truncated 9516 lines...] Mar 26, 2013 2:48:34 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush INFO: Starting f

[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-26 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614065#comment-13614065 ] Hudson commented on MAHOUT-1173: Integrated in Mahout-Quality #1935 (See [https://builds

[jira] [Commented] (MAHOUT-1025) Update documentation for LDA before the release.

2013-03-26 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614059#comment-13614059 ] Isabel Drost commented on MAHOUT-1025: -- First of all that's not a naive question. Yo

Re: Call to action – Mahout needs your help

2013-03-26 Thread Shannon Quinn
I would love to help in any way I can. I'm fairly busy with my PhD studies until early May when I shift to an internship for the summer, so if I could have some help setting up tickets on JIRA for things we'd like to see done, I could take over the legwork once the summer hits. I'd be happy to

Re: Call to action – Mahout needs your help

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 12:12 PM, Dan Filimon wrote: > If you guys decide to participate in GSOC this year, I'd be happy to > spread the word and maybe even have a presentation about Mahout at > school. Also, since I'm squarely on the student side (doing my senior > project with Ted on Mahout) I t

Re: Call to action – Mahout needs your help

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 4:13 AM, Mike Percy wrote: > However the reason I mention it is that up until now, others I've spoken to within the Hadoop community have felt that large new > algorithm contributions are basically what will earn someone committership > on Mahout. Based on this thread, co

Re: Call to action – Mahout needs your help

2013-03-26 Thread Isabel Drost
On Tue, Mar 26, 2013 at 3:12 AM, Daniel Longest wrote: > I've been a lurker on this list for a few months and trying to figure > out a way to contribute. Welcome! > I'm very interested in ML but am not a > professional in it. I am a fulltime .NET developer by trade, but have > used Java aca

Re: Call to action – Mahout needs your help

2013-03-26 Thread Isabel Drost
On Mon, Mar 25, 2013 at 11:16 PM, Ted Dunning wrote: > I am still a fan of GSOC, Same here - though thinking back I may just have been extremely lucky with the students I had gotten. > but there is no chance I have enough time to help > Again same here. Note: This doesn't mean that GSoC is

[jira] [Resolved] (MAHOUT-1173) Reactivate checkstyle

2013-03-26 Thread Sebastian Schelter (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Schelter resolved MAHOUT-1173. Resolution: Fixed Reworked the sources and committed the changes, only 3 justified

Re: Call to action – Mahout needs your help

2013-03-26 Thread Dan Filimon
Regarding GSOC, my school has always had *lots* of students accepted (we were second in 2012 and over the past 5 years [1]). I'm involved in an open-source organization at school and we can get in touch with lots of students. If you guys decide to participate in GSOC this year, I'd be happy to spr

Re: Call to action – Mahout needs your help

2013-03-26 Thread Grant Ingersoll
On Mar 25, 2013, at 11:13 PM, Mike Percy wrote: > Something that the Mahout PMC might want to do is share the (rough) > criteria for becoming a Mahout committer. https://cwiki.apache.org/MAHOUT/how-to-contribute.html https://cwiki.apache.org/MAHOUT/how-to-become-a-committer.html ---

Re: Call to action – Mahout needs your help

2013-03-26 Thread Sebastian Schelter
Hi Mike, > Regarding attribution, I saw it mentioned elsewhere in this thread and I > noticed it myself so I thought I'd throw in my 2 cents. While it seems like > a small thing, I wonder whether instituting the Hadoopish "Contributed by > so-and-so" in commit messages to assign credit for patches

Re: Call to action – Mahout needs your help

2013-03-26 Thread Ted Dunning
Yowza these are really good comments. Where have you guys been? To answer just one of the questions, my own criterion for voting up a committer is that they a) will work with other contributors positively b) won't break things too often (breaking the build sometimes is probably a good sign)