[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613535#comment-13613535
 ] 

Sebastian Schelter commented on MAHOUT-1173:


Agreed. I would commit the change and fix all issues with that commit, so that 
we are back at a clean checkstyle level. If people find some rules to be 
annoying/useless afterwards, we can simply remove them from the config.

> Reactivate checkstyle 
> --
>
> Key: MAHOUT-1173
> URL: https://issues.apache.org/jira/browse/MAHOUT-1173
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Attachments: mahout-checkstyle.xml
>
>
> I would like to reactivate checkstyle in our build. IMHO we should not make 
> it fail on checkstyle errors at the moment (anyone disagree on this?).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.

2013-03-25 Thread Afsal Thaj (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613470#comment-13613470
 ] 

Afsal Thaj edited comment on MAHOUT-1175 at 3/26/13 4:27 AM:
-

Already started working on it..Methods are changed to accept Configuration 
object along with Parameter object

  was (Author: afsal thaj):
Already started working on it..
  
> IllegalStateException and FileNotFoundException occures when running mahout 
> inbuilt mapreduce implementation of frequent pattern mining.
> 
>
> Key: MAHOUT-1175
> URL: https://issues.apache.org/jira/browse/MAHOUT-1175
> Project: Mahout
>  Issue Type: Improvement
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.6
>Reporter: Afsal Thaj
>Priority: Minor
>
> We cannot integrate the code for parallel frequent pattern mining to a 
> project which is supposed to be run in an external server that connects to 
> cluster.Program works fine only inside the cluster (from command line to be 
> specific).IllegalStateException and FileNotFoundException can occur otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.

2013-03-25 Thread Afsal Thaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Afsal Thaj updated MAHOUT-1175:
---

Component/s: Frequent Itemset/Association Rule Mining

> IllegalStateException and FileNotFoundException occures when running mahout 
> inbuilt mapreduce implementation of frequent pattern mining.
> 
>
> Key: MAHOUT-1175
> URL: https://issues.apache.org/jira/browse/MAHOUT-1175
> Project: Mahout
>  Issue Type: Improvement
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.6
>Reporter: Afsal Thaj
>Priority: Minor
>
> We cannot integrate the code for parallel frequent pattern mining to a 
> project which is supposed to be run in an external server that connects to 
> cluster.Program works fine only inside the cluster (from command line to be 
> specific).IllegalStateException and FileNotFoundException can occur otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.

2013-03-25 Thread Afsal Thaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Afsal Thaj updated MAHOUT-1175:
---


Already started working on it..

> IllegalStateException and FileNotFoundException occures when running mahout 
> inbuilt mapreduce implementation of frequent pattern mining.
> 
>
> Key: MAHOUT-1175
> URL: https://issues.apache.org/jira/browse/MAHOUT-1175
> Project: Mahout
>  Issue Type: Improvement
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.6
>Reporter: Afsal Thaj
>Priority: Minor
>
> We cannot integrate the code for parallel frequent pattern mining to a 
> project which is supposed to be run in an external server that connects to 
> cluster.Program works fine only inside the cluster (from command line to be 
> specific).IllegalStateException and FileNotFoundException can occur otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Call to action – Mahout needs your help

2013-03-25 Thread Suneel Marthi
I have been occasionally contributing patches for JIRA tickets when time 
permits, always wanted to make a major contribution to Mahout but was not sure 
as to what the vision of the project and what was expected by way of 
contributions.

I would be more than willing to make a major contribution.



 From: Saikat Kanjilal 
To: dev@mahout.apache.org 
Sent: Monday, March 25, 2013 11:38 PM
Subject: RE: Call to action – Mahout needs your help
 

Hey Daniel,I am in the same boat as you, I have decided to try my hand at 
documentation first, I went into JIRA and will try to help date some curent 
wiki descriptions of one of the algorithms.  I figure this is a good first step 
as any to get familiar with some of the algorithms before devoting more time to 
create a large code patch.  I agree for a newcomer its a bit daunting to figure 
out which parts of the code/docs need the most attention.
Regards

> From: mpe...@apache.org
> Date: Mon, 25 Mar 2013 20:13:18 -0700
> Subject: Re: Call to action – Mahout needs your help
> To: dev@mahout.apache.org
> 
> Something that the Mahout PMC might want to do is share the (rough)
> criteria for becoming a Mahout committer. In many projects, this is quite
> vague and leaves a lot of leeway up to the PMC, which is desirable for a
> variety of reasons. However the reason I mention it is that up until now,
> others I've spoken to within the Hadoop community have felt that large new
> algorithm contributions are basically what will earn someone committership
> on Mahout. Based on this thread, consensus seems to be forming that that is
> *not* what is desired. So what's your rough ideal committer at this point
> in the life of Mahout if they are not contributing new algorithms? I guess
> it's things like code reviews, correctness fixes, perf improvements, and
> refactorings / enhancements?
> 
> Regarding attribution, I saw it mentioned elsewhere in this thread and I
> noticed it myself so I thought I'd throw in my 2 cents. While it seems like
> a small thing, I wonder whether instituting the Hadoopish "Contributed by
> so-and-so" in commit messages to assign credit for patches by
> non-committers would be help make contributors feel more appreciated for
> their work. Especially if you want to encourage people to contribute lots
> of small patches on their way to committership. Alternatively, putting
> "(Joe Newbie via Jim Veteran)" into every commit also acknowledges the
> committer/reviewer, which is not an easy job and can help people feel
> appreciated for that work as well.
> 
> Finally, if there are places where the current committers know Mahout needs
> work, or has holes, have those been articulated in any specific way? If not
> I think that would be awesome. I know that in general, several of the docs
> are out of date on the wiki. I suppose that's one. I wonder what else tops
> the to-do list. Is there something other than just the open JIRA list <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
> >?
> 
> Hope this helps.
> 
> Regards,
> Mike
> 
> 
> 
> On Mon, Mar 25, 2013 at 7:12 PM, Daniel Longest  wrote:
> 
> > I've been a lurker on this list for a few months and trying to figure
> > out a way to contribute.  I'm very interested in ML but am not a
> > professional in it.  I am a fulltime .NET developer by trade, but have
> > used Java academically (undergrad and grad school).  I would love the
> > opportunity to contribute in a testing or optimization capacity if
> > someone could help point me in the right direction.
> >
> > Regards,
> > Daniel
> >
> >
> > >
> > > As a side note on GSoC: At least at German universities the general
> > concept of
> > > GSoC isn't particularly well known which makes me think that reaching
> > out to
> > > students could be helpful. I'm aware of two PhD. students on this list
> > who
> > > probably know students with good coding skills - it might be worth the
> > effort
> > > reaching out to those directly for testing and optimisation tasks.
> > >
> >

[jira] [Created] (MAHOUT-1175) IllegalStateException and FileNotFoundException occures when running mahout inbuilt mapreduce implementation of frequent pattern mining.

2013-03-25 Thread Afsal Thaj (JIRA)
Afsal Thaj created MAHOUT-1175:
--

 Summary: IllegalStateException and FileNotFoundException occures 
when running mahout inbuilt mapreduce implementation of frequent pattern mining.
 Key: MAHOUT-1175
 URL: https://issues.apache.org/jira/browse/MAHOUT-1175
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.6
Reporter: Afsal Thaj
Priority: Minor


We cannot integrate the code for parallel frequent pattern mining to a project 
which is supposed to be run in an external server that connects to 
cluster.Program works fine only inside the cluster (from command line to be 
specific).IllegalStateException and FileNotFoundException can occur otherwise.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


RE: Call to action – Mahout needs your help

2013-03-25 Thread Saikat Kanjilal

Hey Daniel,I am in the same boat as you, I have decided to try my hand at 
documentation first, I went into JIRA and will try to help date some curent 
wiki descriptions of one of the algorithms.  I figure this is a good first step 
as any to get familiar with some of the algorithms before devoting more time to 
create a large code patch.  I agree for a newcomer its a bit daunting to figure 
out which parts of the code/docs need the most attention.
Regards

> From: mpe...@apache.org
> Date: Mon, 25 Mar 2013 20:13:18 -0700
> Subject: Re: Call to action – Mahout needs your help
> To: dev@mahout.apache.org
> 
> Something that the Mahout PMC might want to do is share the (rough)
> criteria for becoming a Mahout committer. In many projects, this is quite
> vague and leaves a lot of leeway up to the PMC, which is desirable for a
> variety of reasons. However the reason I mention it is that up until now,
> others I've spoken to within the Hadoop community have felt that large new
> algorithm contributions are basically what will earn someone committership
> on Mahout. Based on this thread, consensus seems to be forming that that is
> *not* what is desired. So what's your rough ideal committer at this point
> in the life of Mahout if they are not contributing new algorithms? I guess
> it's things like code reviews, correctness fixes, perf improvements, and
> refactorings / enhancements?
> 
> Regarding attribution, I saw it mentioned elsewhere in this thread and I
> noticed it myself so I thought I'd throw in my 2 cents. While it seems like
> a small thing, I wonder whether instituting the Hadoopish "Contributed by
> so-and-so" in commit messages to assign credit for patches by
> non-committers would be help make contributors feel more appreciated for
> their work. Especially if you want to encourage people to contribute lots
> of small patches on their way to committership. Alternatively, putting
> "(Joe Newbie via Jim Veteran)" into every commit also acknowledges the
> committer/reviewer, which is not an easy job and can help people feel
> appreciated for that work as well.
> 
> Finally, if there are places where the current committers know Mahout needs
> work, or has holes, have those been articulated in any specific way? If not
> I think that would be awesome. I know that in general, several of the docs
> are out of date on the wiki. I suppose that's one. I wonder what else tops
> the to-do list. Is there something other than just the open JIRA list <
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
> >?
> 
> Hope this helps.
> 
> Regards,
> Mike
> 
> 
> 
> On Mon, Mar 25, 2013 at 7:12 PM, Daniel Longest  wrote:
> 
> > I've been a lurker on this list for a few months and trying to figure
> > out a way to contribute.  I'm very interested in ML but am not a
> > professional in it.  I am a fulltime .NET developer by trade, but have
> > used Java academically (undergrad and grad school).  I would love the
> > opportunity to contribute in a testing or optimization capacity if
> > someone could help point me in the right direction.
> >
> > Regards,
> > Daniel
> >
> >
> > >
> > > As a side note on GSoC: At least at German universities the general
> > concept of
> > > GSoC isn't particularly well known which makes me think that reaching
> > out to
> > > students could be helpful. I'm aware of two PhD. students on this list
> > who
> > > probably know students with good coding skills - it might be worth the
> > effort
> > > reaching out to those directly for testing and optimisation tasks.
> > >
> >
  

Re: Call to action – Mahout needs your help

2013-03-25 Thread Mike Percy
Something that the Mahout PMC might want to do is share the (rough)
criteria for becoming a Mahout committer. In many projects, this is quite
vague and leaves a lot of leeway up to the PMC, which is desirable for a
variety of reasons. However the reason I mention it is that up until now,
others I've spoken to within the Hadoop community have felt that large new
algorithm contributions are basically what will earn someone committership
on Mahout. Based on this thread, consensus seems to be forming that that is
*not* what is desired. So what's your rough ideal committer at this point
in the life of Mahout if they are not contributing new algorithms? I guess
it's things like code reviews, correctness fixes, perf improvements, and
refactorings / enhancements?

Regarding attribution, I saw it mentioned elsewhere in this thread and I
noticed it myself so I thought I'd throw in my 2 cents. While it seems like
a small thing, I wonder whether instituting the Hadoopish "Contributed by
so-and-so" in commit messages to assign credit for patches by
non-committers would be help make contributors feel more appreciated for
their work. Especially if you want to encourage people to contribute lots
of small patches on their way to committership. Alternatively, putting
"(Joe Newbie via Jim Veteran)" into every commit also acknowledges the
committer/reviewer, which is not an easy job and can help people feel
appreciated for that work as well.

Finally, if there are places where the current committers know Mahout needs
work, or has holes, have those been articulated in any specific way? If not
I think that would be awesome. I know that in general, several of the docs
are out of date on the wiki. I suppose that's one. I wonder what else tops
the to-do list. Is there something other than just the open JIRA list <
https://issues.apache.org/jira/issues/?jql=project%20%3D%20MAHOUT%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
>?

Hope this helps.

Regards,
Mike



On Mon, Mar 25, 2013 at 7:12 PM, Daniel Longest  wrote:

> I've been a lurker on this list for a few months and trying to figure
> out a way to contribute.  I'm very interested in ML but am not a
> professional in it.  I am a fulltime .NET developer by trade, but have
> used Java academically (undergrad and grad school).  I would love the
> opportunity to contribute in a testing or optimization capacity if
> someone could help point me in the right direction.
>
> Regards,
> Daniel
>
>
> >
> > As a side note on GSoC: At least at German universities the general
> concept of
> > GSoC isn't particularly well known which makes me think that reaching
> out to
> > students could be helpful. I'm aware of two PhD. students on this list
> who
> > probably know students with good coding skills - it might be worth the
> effort
> > reaching out to those directly for testing and optimisation tasks.
> >
>


[jira] [Commented] (MAHOUT-1025) Update documentation for LDA before the release.

2013-03-25 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613426#comment-13613426
 ] 

Saikat Kanjilal commented on MAHOUT-1025:
-

Ok next naive question, do I need any special permissions to update this 
confluence page, I tried to log in with my asf jira creds and was not able to 
do so.  Thanks for your help

> Update documentation for LDA before the release.
> 
>
> Key: MAHOUT-1025
> URL: https://issues.apache.org/jira/browse/MAHOUT-1025
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.7
>Reporter: Robin Anil
>Assignee: Jake Mannix
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Call to action – Mahout needs your help

2013-03-25 Thread Daniel Longest
I've been a lurker on this list for a few months and trying to figure
out a way to contribute.  I'm very interested in ML but am not a
professional in it.  I am a fulltime .NET developer by trade, but have
used Java academically (undergrad and grad school).  I would love the
opportunity to contribute in a testing or optimization capacity if
someone could help point me in the right direction.

Regards,
Daniel


>
> As a side note on GSoC: At least at German universities the general concept of
> GSoC isn't particularly well known which makes me think that reaching out to
> students could be helpful. I'm aware of two PhD. students on this list who
> probably know students with good coding skills - it might be worth the effort
> reaching out to those directly for testing and optimisation tasks.
>


Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters #279

2013-03-25 Thread Apache Jenkins Server
See 



[jira] [Resolved] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1174.
-

Resolution: Fixed

Checked in updated links.

> Lanczos code and javadocs should refer users to the SSVD stuff
> --
>
> Key: MAHOUT-1174
> URL: https://issues.apache.org/jira/browse/MAHOUT-1174
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
>Assignee: Ted Dunning
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : mahout-nightly #1188

2013-03-25 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : Mahout-Quality #1934

2013-03-25 Thread Apache Jenkins Server
See 



Build failed in Jenkins: Mahout-Examples-Cluster-Reuters #278

2013-03-25 Thread Apache Jenkins Server
See 

Changes:

[tdunning] MAHOUT-1174 - Make Lanczos code point to the preferred SSVD code 
(and removed one code warning)

--
[...truncated 5626 lines...]
INFO: Reduce shuffle bytes=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Spilled Records=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Map output bytes=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Total committed heap usage (bytes)=937951232
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: SPLIT_RAW_BYTES=150
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Combine input records=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input records=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce input groups=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Combine output records=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Reduce output records=0
Mar 25, 2013 11:17:37 PM org.apache.hadoop.mapred.Counters log
INFO: Map output records=0
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat 
listStatus
INFO: Total input paths to process : 1
Mar 25, 2013 11:17:38 PM 
org.apache.hadoop.filecache.TrackerDistributedCacheManager downloadCacheObject
INFO: Creating frequency.file-0 in 
/tmp/hadoop-hudson/mapred/local/archive/2416076284993857644_1334525619_601341155/file/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans-work--3790651960783613026
 with rwxr-xr-x
Mar 25, 2013 11:17:38 PM 
org.apache.hadoop.filecache.TrackerDistributedCacheManager downloadCacheObject
INFO: Cached 
/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans/frequency.file-0 as 
/tmp/hadoop-hudson/mapred/local/archive/2416076284993857644_1334525619_601341155/file/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans/frequency.file-0
Mar 25, 2013 11:17:38 PM 
org.apache.hadoop.filecache.TrackerDistributedCacheManager 
localizePublicCacheObject
INFO: Cached 
/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans/frequency.file-0 as 
/tmp/hadoop-hudson/mapred/local/archive/2416076284993857644_1334525619_601341155/file/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans/frequency.file-0
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0006
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task initialize
INFO:  Using ResourceCalculatorPlugin : null
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
INFO: io.sort.mb = 100
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
INFO: data buffer = 79691776/99614720
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
INFO: record buffer = 262144/327680
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
INFO: Starting flush of map output
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0006_m_00_0 is done. And is in the process of 
commiting
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job 
statusUpdate
INFO: 
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0006_m_00_0' done.
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task initialize
INFO:  Using ResourceCalculatorPlugin : null
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job 
statusUpdate
INFO: 
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 1 sorted segments
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job 
statusUpdate
INFO: 
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task done
INFO: Task:attempt_local_0006_r_00_0 is done. And is in the process of 
commiting
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job 
statusUpdate
INFO: 
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task commit
INFO: Task attempt_local_0006_r_00_0 is allowed to commit now
Mar 25, 2013 11:17:38 PM 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
INFO: Saved output of task 'attempt_local_0006_r_00_0' to 
/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-kmeans/partial-vectors-0
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.LocalJobRunner$Job 
statusUpdate
INFO: reduce > reduce
Mar 25, 2013 11:17:38 PM org.apache.hadoop.mapred.Task sendDone
INFO: Task 'attempt_local_0006_r_00_0' done.
Mar 25, 2013 11:17:39 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 100% reduce 100%
Mar 25, 2013 11:17:39 PM org.apache.hadoop.mapred.JobClient monito

[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613289#comment-13613289
 ] 

Ted Dunning commented on MAHOUT-1173:
-

For grins and general knowledge, Jenkins provides a history of job 
configuration.

What you have done is better than just reactivating the old build config, btw.  
This has a chance of making things work better while reactivating the old build 
config just perpetuates disfunction.


> Reactivate checkstyle 
> --
>
> Key: MAHOUT-1173
> URL: https://issues.apache.org/jira/browse/MAHOUT-1173
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Attachments: mahout-checkstyle.xml
>
>
> I would like to reactivate checkstyle in our build. IMHO we should not make 
> it fail on checkstyle errors at the moment (anyone disagree on this?).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reopened MAHOUT-1174:
-

  Assignee: Ted Dunning

Fixing this now to point to the correct link.

Thanks for spotting that!

(some days I hate some aspects of Confluence)

> Lanczos code and javadocs should refer users to the SSVD stuff
> --
>
> Key: MAHOUT-1174
> URL: https://issues.apache.org/jira/browse/MAHOUT-1174
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
>Assignee: Ted Dunning
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613274#comment-13613274
 ] 

Ted Dunning commented on MAHOUT-1173:
-

The problem before was that it produced vats of very low value warnings.  For 
instance, I really could care less about trailing white space (who really 
does?).

My guess is that the best way to review this is to commit the change and have 
people review the output.  If it is too full of garbage, we can trim the config 
based on that feedback.

> Reactivate checkstyle 
> --
>
> Key: MAHOUT-1173
> URL: https://issues.apache.org/jira/browse/MAHOUT-1173
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Attachments: mahout-checkstyle.xml
>
>
> I would like to reactivate checkstyle in our build. IMHO we should not make 
> it fail on checkstyle errors at the moment (anyone disagree on this?).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613265#comment-13613265
 ] 

Dmitriy Lyubimov commented on MAHOUT-1174:
--

Ted, 

I am not sure why but this reference in the javadoc 
https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.html

brings back an outdated version that i normally access and update here 

https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition

Perhaps there was a migration of that wiki content somewhere and i missed that 
and kept updating wrong location?

> Lanczos code and javadocs should refer users to the SSVD stuff
> --
>
> Key: MAHOUT-1174
> URL: https://issues.apache.org/jira/browse/MAHOUT-1174
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Call to action – Mahout needs your help

2013-03-25 Thread Ted Dunning
I am still a fan of GSOC, but there is no chance I have enough time to help
(although my working with Dan recently is a bit of a counter example)

On Mon, Mar 25, 2013 at 11:12 PM, Grant Ingersoll wrote:

>
> On Mar 25, 2013, at 4:24 PM, Isabel Drost-Fromm wrote:
>
> > Also, do we have any volunteers to drive a GSoC at Mahout initiative?
>
> I gave up on GSOC.  I think our success rate as a project was pretty low
> and it wasn't worth it to me to continue.   Others are certainly welcome to
> try though.
>
>
>


Re: Call to action – Mahout needs your help

2013-03-25 Thread Grant Ingersoll

On Mar 25, 2013, at 4:24 PM, Isabel Drost-Fromm wrote:

> Also, do we have any volunteers to drive a GSoC at Mahout initiative?

I gave up on GSOC.  I think our success rate as a project was pretty low and it 
wasn't worth it to me to continue.   Others are certainly welcome to try though.




[jira] [Resolved] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning resolved MAHOUT-1174.
-

Resolution: Fixed

Committed javadoc fixes.  Also added Preconditions check + assert for null at 
one point to kill a warning.

> Lanczos code and javadocs should refer users to the SSVD stuff
> --
>
> Key: MAHOUT-1174
> URL: https://issues.apache.org/jira/browse/MAHOUT-1174
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1174) Lanczos code and javadocs should refer users to the SSVD stuff

2013-03-25 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1174:
---

 Summary: Lanczos code and javadocs should refer users to the SSVD 
stuff
 Key: MAHOUT-1174
 URL: https://issues.apache.org/jira/browse/MAHOUT-1174
 Project: Mahout
  Issue Type: Bug
Reporter: Ted Dunning




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1173:
---

Attachment: mahout-checkstyle.xml

I recreated the checkstyle file. I started with the default file for Sun's code 
conventions and modified it to fit to our codebase.

I added custom things like allowing variable names to start with uppercase 
letters (which is sometimes necessary for keeping mathematical notations found 
in papers) and tried to fit the rules to our codebase.

Checkstyle gives a lot of warnings, but most of them are due to easy to fix 
things like trailing whitespaces in files or indentation issues.

It would great if committers could have a look at the rules and give their ok. 
I tried to make sure that we get a clean up but don't change our coding style.

If there is agreement about the rules, I would suggest to fix all issues in one 
big commit here.
 


> Reactivate checkstyle 
> --
>
> Key: MAHOUT-1173
> URL: https://issues.apache.org/jira/browse/MAHOUT-1173
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
> Attachments: mahout-checkstyle.xml
>
>
> I would like to reactivate checkstyle in our build. IMHO we should not make 
> it fail on checkstyle errors at the moment (anyone disagree on this?).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1173) Reactivate checkstyle

2013-03-25 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1173:
--

 Summary: Reactivate checkstyle 
 Key: MAHOUT-1173
 URL: https://issues.apache.org/jira/browse/MAHOUT-1173
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Sebastian Schelter
Assignee: Sebastian Schelter


I would like to reactivate checkstyle in our build. IMHO we should not make it 
fail on checkstyle errors at the moment (anyone disagree on this?).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Call to action – Mahout needs your help

2013-03-25 Thread Open Data Geek
+1

On Mar 25, 2013, at 4:43 AM, Manuel Blechschmidt  
wrote:

> Hello,
> 
> On 25.03.2013, at 09:10, Sebastian Schelter wrote:
> 
>> Hi,
>> 
>> throwing in my 2 cents here:
>> 
>> I don't agree that we simply lack manpower but have a clear vision. I
>> actually think its the other way round. I think Mahout is kind of stuck,
>> because it does not have a clear vision.
> 
> I fully agree. So I think Mahout needs a vision. The big problem about ML is 
> that you can do everything with it but to make a difference you have to focus.
> 
> I am using Mahout for solving business problems e.g.:
> 
> - Online fraud
> - eCommerce recommendations
> - Demand forecasting
> 
> One big piece that is missing for all the algorithms is a complete bundled 
> data set that is solving a real business problem and with bundled I mean that 
> it is in the Mahout source tree. If no real data is available generated data 
> could be used.
> 
> I tried to fill this gap for recommendations with my github project:
> 
> https://github.com/ManuelB/facebook-recommender-demo
> 
> This project seams to be  used by the community. You can get it, compile it 
> and start it with 4 commands.
> 
>> ...
>> 
>> It is also my personal experience (= I heard it over and over again from
>> our users) that it is extremely hard to get started with Mahout using
>> the available documentation. MiA is the exception to this, but people
>> have to buy it first and it lacks a lot of the latest developments. It
>> would be awesome to have a reworked wiki that is qualitatively
>> comparable to MiA.
> 
> So this is the nature of a framework. If you really want people to get 
> started easily you have to provide a full blown example where you can just 
> replace the example data with your data.
> 
> I don't think that enough manpower can be acquired to create a visual GUI for 
> Mahout. Further I don't think that this would help. There are already 
> excellent GUIs for ML e.g. Weka (http://www.cs.waikato.ac.nz/ml/weka/) and 
> RStudio (http://www.rstudio.com/)
> 
> 
>> 
>> Best,
>> Sebastian
> 
> Hope this helps
>Manuel
> 
>> 
>> On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
>>> 
>>> 
>>> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
 On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
> On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
>> What about an experiment: If you (reading this mail) were to write a two
>> sentence vision statement for Mahout as you see it - what would that be?
> 
> Produce open source, scalable machine learning code using a community
> development model.
 
 So taking that apart:
 
 - Hadoop is not necessarily part of the equation. All that we promise are
 implemenations that are reasonably scalable.
>>> 
>>> - We play well with small-ish (fits in memory) and large (fits only in 
>>> memory of 
>>> many machines) or huge (fits only on disk) datasets.
>>> 
 - There is no restriction in there wrt. supporting only specific use cases 
 -
 in particular no restriction to be recommendations only.
 
 - There is no restriction to "only batch" or "only online" learning.
 
 If we want to be that broad we definitely lack lots of people, I think.
 
 The other question that I cannot answer today: Do we want to be a Java
 Library that people link with their project, a standalone program that
 people interact with via the command line, a basis that people can easily
 integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
 workflows or all of these?
> 
> -- 
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
> 


Re: Call to action – Mahout needs your help

2013-03-25 Thread Isabel Drost-Fromm
On Monday, March 25, 2013 01:27:50 PM Shannon Quinn wrote:
> On that note: GSoC is coming up, and I think it's a great opportunity to
> build some momentum in this direction. I know that when students see
> "scalable machine learning" their first thought isn't improving testing
> and documentation, but if we pushed hard in those areas specifically, in
> addition to making a broad effort on JIRA to elucidate exactly what
> needs work, we could likely pick up several quality students that could
> make lasting contributions.

As a side note on GSoC: At least at German universities the general concept of 
GSoC isn't particularly well known which makes me think that reaching out to 
students could be helpful. I'm aware of two PhD. students on this list who 
probably know students with good coding skills - it might be worth the effort 
reaching out to those directly for testing and optimisation tasks. 

Also, do we have any volunteers to drive a GSoC at Mahout initiative?


Isabel


Re: Call to action – Mahout needs your help

2013-03-25 Thread Shannon Quinn





I think that you mentioned a very good point with stating that it is not
clear whether Mahout is a library, a standalone program to interact with
via the command line. IMO, its first and foremost a library (similar to
Lucene), and this should also be reflected in the codebase.

That is my view as well and I think we have been moderately successful at it.


+1


As for the complexity issue, I don't know that we ever solve it, we just need 
to identify contributors in those areas quickly, mentor them, and make them 
committers as soon as they are ready.


On that note: GSoC is coming up, and I think it's a great opportunity to 
build some momentum in this direction. I know that when students see 
"scalable machine learning" their first thought isn't improving testing 
and documentation, but if we pushed hard in those areas specifically, in 
addition to making a broad effort on JIRA to elucidate exactly what 
needs work, we could likely pick up several quality students that could 
make lasting contributions.





I think that Mahout is and should always be more than recommenders, but
that we should be more courageous in throwing out things that are not
used very much or not maintained very much or don't meet the quality
standards which we would like to see.


+1 . On my end of things, while I do think some sort of canonical 
spectral clustering algorithm would be very useful to have, e.g. 
spectral k-means, the Eigencuts algorithm is one example of something 
that is so specialized that it could probably be jettisoned.


Re: Call to action – Mahout needs your help

2013-03-25 Thread Dmitriy Lyubimov
On Mar 25, 2013 8:36 AM, "Grant Ingersoll"  wrote:
>
>
> On Mar 25, 2013, at 4:10 AM, Sebastian Schelter wrote:
>
> > Hi,
> >
> > throwing in my 2 cents here:
> >
> > I think that you mentioned a very good point with stating that it is not
> > clear whether Mahout is a library, a standalone program to interact with
> > via the command line. IMO, its first and foremost a library (similar to
> > Lucene), and this should also be reflected in the codebase.
>
> That is my view as well and I think we have been moderately successful at
it.
>
> >
> > I don't agree that we simply lack manpower but have a clear vision. I
> > actually think its the other way round. I think Mahout is kind of stuck,
> > because it does not have a clear vision. I think we faced and still face
> > very hard challenges, as we have to provide answers for the following
> > questions:
> >
> > * for which problems and algorithms does it really make sense to use
> > MapReduce?
>
> My test is simply whether someone has implemented it or not.  I don't
think we have to have a line in the sand.

It is in fact very easy to test. (Imo). Most of the complaints are
revolving around highly iterative methods. It is sufficient to estimate
startup and interstep persistence costs per required no of iterations and
that would give overhead no.1. E.g. popular stationary pagerankish
distribution related methods fall into this category as well as iterative
bootstrapish search techniques such as search for optimum fit in
regularized als.

Slightly more subtle overhead no.2 in my experience stems from forced sort
required for grouping of anything (especially i think in things such as
matrix matrix multiplication) and perhaps to much lesser degree, what
people mentioned, lack of scatter operator.



 A working, tested, demonstrable implementation beats the one that isn't,
regardless of which approach it uses, so I don't think we have to decide up
front but instead look at it on a case by case basis.  At the end of the
day, those who do the work get to decide.
>
> >
> > * how broad can the spectrum of things that we offer be without a
> > decline in quality?
> >
> > * how do we deal with the fact that our codebase is split up into a
> > collection of algorithms with very few people being able to work on all
> > of them, due to the required theoretical background and the complexity
> > of efficient code
> >
> > * how do we provide solutions that allow users to scale very fine
> > grained, e.g. from online to precomputed on a single machine to
> > precomputed via Hadoop in the recommender stuff.
>
> I don't see these as vision issues, I see them as implementation issues.
 Regardless, it doesn't matter which category they fall under, as they are
the important issues we face.
>
> As for the complexity issue, I don't know that we ever solve it, we just
need to identify contributors in those areas quickly, mentor them, and make
them committers as soon as they are ready.
>
>
>
> >
> > I think that Mahout is and should always be more than recommenders, but
> > that we should be more courageous in throwing out things that are not
> > used very much or not maintained very much or don't meet the quality
> > standards which we would like to see.
>
> +1.  I think we have gotten a lot better at this, thanks to Sean, you and
others.
>
> >
> > It is also my personal experience (= I heard it over and over again from
> > our users) that it is extremely hard to get started with Mahout using
> > the available documentation. MiA is the exception to this, but people
> > have to buy it first and it lacks a lot of the latest developments. It
> > would be awesome to have a reworked wiki that is qualitatively
> > comparable to MiA.
> >
>
> Good docs are always hard.  Whatever reduces barriers, the better.  Going
w/ the Github model, there's a lot to be said for Javadocs and/or Markdown
right in the code base, but neither solves the developer inertia of
actually writing them.
>
>
> > Best,
> > Sebastian
> >
> > On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
> >>
> >>
> >> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
> >>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
>  On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
> > What about an experiment: If you (reading this mail) were to write
a two
> > sentence vision statement for Mahout as you see it - what would
that be?
> 
>  Produce open source, scalable machine learning code using a community
>  development model.
> >>>
> >>> So taking that apart:
> >>>
> >>> - Hadoop is not necessarily part of the equation. All that we promise
are
> >>> implemenations that are reasonably scalable.
> >>
> >> - We play well with small-ish (fits in memory) and large (fits only in
memory of
> >> many machines) or huge (fits only on disk) datasets.
> >>
> >>> - There is no restriction in there wrt. supporting only specific use
cases -
> >>> in particular no restriction to be recommendations only.
>

Re: changes without JIRA's

2013-03-25 Thread Isabel Drost
On Mon, Mar 25, 2013 at 4:34 PM, Sebastian Schelter  wrote:

> I guess this refers to the cleanups I've done in the last days. In the
> future, I will create a Jira for each and attach a patch.
>

I think what is more important than attaching a patch to the JIRA issue is
to mention the JIRA issue in the commit message.

The reason I think this is important is that only that way you can track
which commit fixes what issue. In addition Apache JIRA and SVN should be
setup such that given there is a JIRA issue mentioned in the commit you can
click through to the commit from the JIRA issue.

Isabel


Re: changes without JIRA's

2013-03-25 Thread Ted Dunning
No need to attach the patch.  Just go ahead with non-controversial commits.
 Jenkins will put a comment on the JIRA with the SVN revision number.

All I want is a historical record so we can tell what was done and why.

On Mon, Mar 25, 2013 at 4:34 PM, Sebastian Schelter  wrote:

> I guess this refers to the cleanups I've done in the last days. In the
> future, I will create a Jira for each and attach a patch.
>
> On 25.03.2013 16:31, Ted Dunning wrote:
> > I would like it if all changes to the code be accompanied by a JIRA that
> > describes the problem being solved and that the commit messages
> associated
> > with the fix reference the JIRA.
> >
>
>


Re: Call to action – Mahout needs your help

2013-03-25 Thread Grant Ingersoll

On Mar 25, 2013, at 4:10 AM, Sebastian Schelter wrote:

> Hi,
> 
> throwing in my 2 cents here:
> 
> I think that you mentioned a very good point with stating that it is not
> clear whether Mahout is a library, a standalone program to interact with
> via the command line. IMO, its first and foremost a library (similar to
> Lucene), and this should also be reflected in the codebase.

That is my view as well and I think we have been moderately successful at it.

> 
> I don't agree that we simply lack manpower but have a clear vision. I
> actually think its the other way round. I think Mahout is kind of stuck,
> because it does not have a clear vision. I think we faced and still face
> very hard challenges, as we have to provide answers for the following
> questions:
> 
> * for which problems and algorithms does it really make sense to use
> MapReduce?

My test is simply whether someone has implemented it or not.  I don't think we 
have to have a line in the sand.  A working, tested, demonstrable 
implementation beats the one that isn't, regardless of which approach it uses, 
so I don't think we have to decide up front but instead look at it on a case by 
case basis.  At the end of the day, those who do the work get to decide.

> 
> * how broad can the spectrum of things that we offer be without a
> decline in quality?
> 
> * how do we deal with the fact that our codebase is split up into a
> collection of algorithms with very few people being able to work on all
> of them, due to the required theoretical background and the complexity
> of efficient code
> 
> * how do we provide solutions that allow users to scale very fine
> grained, e.g. from online to precomputed on a single machine to
> precomputed via Hadoop in the recommender stuff.

I don't see these as vision issues, I see them as implementation issues.  
Regardless, it doesn't matter which category they fall under, as they are the 
important issues we face.

As for the complexity issue, I don't know that we ever solve it, we just need 
to identify contributors in those areas quickly, mentor them, and make them 
committers as soon as they are ready.



> 
> I think that Mahout is and should always be more than recommenders, but
> that we should be more courageous in throwing out things that are not
> used very much or not maintained very much or don't meet the quality
> standards which we would like to see.

+1.  I think we have gotten a lot better at this, thanks to Sean, you and 
others.

> 
> It is also my personal experience (= I heard it over and over again from
> our users) that it is extremely hard to get started with Mahout using
> the available documentation. MiA is the exception to this, but people
> have to buy it first and it lacks a lot of the latest developments. It
> would be awesome to have a reworked wiki that is qualitatively
> comparable to MiA.
> 

Good docs are always hard.  Whatever reduces barriers, the better.  Going w/ 
the Github model, there's a lot to be said for Javadocs and/or Markdown right 
in the code base, but neither solves the developer inertia of actually writing 
them.


> Best,
> Sebastian
> 
> On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
>> 
>> 
>> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
>>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
 On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
> What about an experiment: If you (reading this mail) were to write a two
> sentence vision statement for Mahout as you see it - what would that be?
 
 Produce open source, scalable machine learning code using a community
 development model.
>>> 
>>> So taking that apart:
>>> 
>>> - Hadoop is not necessarily part of the equation. All that we promise are
>>> implemenations that are reasonably scalable.
>> 
>> - We play well with small-ish (fits in memory) and large (fits only in 
>> memory of 
>> many machines) or huge (fits only on disk) datasets.
>> 
>>> - There is no restriction in there wrt. supporting only specific use cases -
>>> in particular no restriction to be recommendations only.
>>> 
>>> - There is no restriction to "only batch" or "only online" learning.
>>> 
>>> If we want to be that broad we definitely lack lots of people, I think.
>>> 
>>> The other question that I cannot answer today: Do we want to be a Java
>>> Library that people link with their project, a standalone program that
>>> people interact with via the command line, a basis that people can easily
>>> integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
>>> workflows or all of these?
>> 
>> 
> 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: changes without JIRA's

2013-03-25 Thread Sebastian Schelter
I guess this refers to the cleanups I've done in the last days. In the
future, I will create a Jira for each and attach a patch.

On 25.03.2013 16:31, Ted Dunning wrote:
> I would like it if all changes to the code be accompanied by a JIRA that
> describes the problem being solved and that the commit messages associated
> with the fix reference the JIRA.
> 



changes without JIRA's

2013-03-25 Thread Ted Dunning
I would like it if all changes to the code be accompanied by a JIRA that
describes the problem being solved and that the commit messages associated
with the fix reference the JIRA.


Re: Call to action – Mahout needs your help

2013-03-25 Thread Ted Dunning
Switching to apache git would make this easier.

On Mon, Mar 25, 2013 at 1:08 PM, Isabel Drost  wrote:

> > As non-committer I'd contribute more to Mahout, had github be primary
> > source. Now, when I contribute a pull request, it gets merged to Apache
> git
> > server by committer, and I don't get recorded as contributor on github.
> > Maybe just workflow can be changed to improve this.
> >
>
> Valuable input indeed. Though given that we are still using svn as
> canonical version control system that sounds like a bigger project to me.
>


Re: Call to action – Mahout needs your help

2013-03-25 Thread Manuel Blechschmidt
Hi,

On 25.03.2013, at 10:37, Isabel Drost wrote:

> Hi,
> 
>> 
>> I tried to fill this gap for recommendations with my github project:
>> 
>> https://github.com/ManuelB/facebook-recommender-demo
> 
> Hmm - I would actually like to list such "complimentary" projects
> prominently from the Mahout page somewhere. What do you think?

This is ok with me.

I have to admit that the primary goal of this project is to show off to acquire 
potential customers. We at Apaxo have a lot of Know How around recommendations. 
We are not interested in supporting the solution for free or building a rock 
solid basis for scalable recommendation services. Therefore the expectations 
for users compared to an Apache project are quite differently. 

> Isabel

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B



Re: Call to action – Mahout needs your help

2013-03-25 Thread Stevo Slavić
Thanks for heads up!

I meant maybe to re-implement Mahout java Hadoop code to use scalding, and
algebird. For me it would be great way to learn all of these technologies
(scala, mahout, hadoop, cascading, scalding, algebird). Expected/desired
improvement for Mahout committers and users would be hopefully less code in
Mahout projec itselft, easier to maintain and learn, existing and implement
new algorithms.

Kind regards,
Stevo Slavic.


On Mon, Mar 25, 2013 at 1:08 PM, Isabel Drost  wrote:

> Hi,
>
> On Mon, Mar 25, 2013 at 11:15 AM, Stevo Slavić  wrote:
>
> > Please consider shipping Mahout 0.8 organized as it is now, and come back
> > to ideas for the future after release.
> >
>
> Personally I see the current discussion as a means to find out what people
> want Mahout to be in the long time and on how to come back to increased
> activity.
>
>
>
> >
> > I agree, having more freely accessible data sets would help, not only
> > Mahout. Maybe create a subproject or separate Apache project for that.
> >
>
> There is one already:  - though
> currently there's not too much activity.
>
>
>
> >
> > As non-committer I'd contribute more to Mahout, had github be primary
> > source. Now, when I contribute a pull request, it gets merged to Apache
> git
> > server by committer, and I don't get recorded as contributor on github.
> > Maybe just workflow can be changed to improve this.
> >
>
> Valuable input indeed. Though given that we are still using svn as
> canonical version control system that sounds like a bigger project to me.
>
>
> >
> > Discussing about ideas for the future, have Mahout committers considered
> > using scalding and/or algebird instead of or along with Java Hadoop API?
> >
> >
> Just to  clarify: Do you mean re-implenting what is available in these
> languages or making what is implemented available to these languages?
>
>
> Isabel
>


Re: Call to action – Mahout needs your help

2013-03-25 Thread Isabel Drost
Hi,

On Mon, Mar 25, 2013 at 11:15 AM, Stevo Slavić  wrote:

> Please consider shipping Mahout 0.8 organized as it is now, and come back
> to ideas for the future after release.
>

Personally I see the current discussion as a means to find out what people
want Mahout to be in the long time and on how to come back to increased
activity.



>
> I agree, having more freely accessible data sets would help, not only
> Mahout. Maybe create a subproject or separate Apache project for that.
>

There is one already:  - though
currently there's not too much activity.



>
> As non-committer I'd contribute more to Mahout, had github be primary
> source. Now, when I contribute a pull request, it gets merged to Apache git
> server by committer, and I don't get recorded as contributor on github.
> Maybe just workflow can be changed to improve this.
>

Valuable input indeed. Though given that we are still using svn as
canonical version control system that sounds like a bigger project to me.


>
> Discussing about ideas for the future, have Mahout committers considered
> using scalding and/or algebird instead of or along with Java Hadoop API?
>
>
Just to  clarify: Do you mean re-implenting what is available in these
languages or making what is implemented available to these languages?


Isabel


Re: Call to action – Mahout needs your help

2013-03-25 Thread Stevo Slavić
Hello Mahout devs,

Please consider shipping Mahout 0.8 organized as it is now, and come back
to ideas for the future after release.

Personally, I'll consider Mahout only for problems that need to scale
horizontally, use a cluster, and use widely adopted platforms like Hadoop.
It's good to have library like Mahout focused to be container for just a
bunch of algorithms, and I'd like it to stay that way - fosters community
of other more specialized projects.

Btw, I agree wiki/docs needs to be improved. It would help to have better
definition of done - no undocumented commits/changes/new algorithms. Also,
Confluence powering wiki is outdated - doesn't Atlassian provide Apache
projects with free upgrades as well?
Because of infra issues, maybe better limit use of wiki and extend project
with reference documentation.

I agree, having more freely accessible data sets would help, not only
Mahout. Maybe create a subproject or separate Apache project for that.

As non-committer I'd contribute more to Mahout, had github be primary
source. Now, when I contribute a pull request, it gets merged to Apache git
server by committer, and I don't get recorded as contributor on github.
Maybe just workflow can be changed to improve this.

Discussing about ideas for the future, have Mahout committers considered
using scalding and/or algebird instead of or along with Java Hadoop API?

Kind regards,
Stevo Slavic.


On Mon, Mar 25, 2013 at 9:43 AM, Manuel Blechschmidt <
manuel.blechschm...@gmx.de> wrote:

> Hello,
>
> On 25.03.2013, at 09:10, Sebastian Schelter wrote:
>
> > Hi,
> >
> > throwing in my 2 cents here:
> >
> > I don't agree that we simply lack manpower but have a clear vision. I
> > actually think its the other way round. I think Mahout is kind of stuck,
> > because it does not have a clear vision.
>
> I fully agree. So I think Mahout needs a vision. The big problem about ML
> is that you can do everything with it but to make a difference you have to
> focus.
>
> I am using Mahout for solving business problems e.g.:
>
> - Online fraud
> - eCommerce recommendations
> - Demand forecasting
>
> One big piece that is missing for all the algorithms is a complete bundled
> data set that is solving a real business problem and with bundled I mean
> that it is in the Mahout source tree. If no real data is available
> generated data could be used.
>
> I tried to fill this gap for recommendations with my github project:
>
> https://github.com/ManuelB/facebook-recommender-demo
>
> This project seams to be  used by the community. You can get it, compile
> it and start it with 4 commands.
>
> > ...
> >
> > It is also my personal experience (= I heard it over and over again from
> > our users) that it is extremely hard to get started with Mahout using
> > the available documentation. MiA is the exception to this, but people
> > have to buy it first and it lacks a lot of the latest developments. It
> > would be awesome to have a reworked wiki that is qualitatively
> > comparable to MiA.
>
> So this is the nature of a framework. If you really want people to get
> started easily you have to provide a full blown example where you can just
> replace the example data with your data.
>
> I don't think that enough manpower can be acquired to create a visual GUI
> for Mahout. Further I don't think that this would help. There are already
> excellent GUIs for ML e.g. Weka (http://www.cs.waikato.ac.nz/ml/weka/)
> and RStudio (http://www.rstudio.com/)
>
>
> >
> > Best,
> > Sebastian
>
> Hope this helps
> Manuel
>
> >
> > On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
> >>
> >>
> >> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
> >>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
>  On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
> > What about an experiment: If you (reading this mail) were to write a
> two
> > sentence vision statement for Mahout as you see it - what would that
> be?
> 
>  Produce open source, scalable machine learning code using a community
>  development model.
> >>>
> >>> So taking that apart:
> >>>
> >>> - Hadoop is not necessarily part of the equation. All that we promise
> are
> >>> implemenations that are reasonably scalable.
> >>
> >> - We play well with small-ish (fits in memory) and large (fits only in
> memory of
> >> many machines) or huge (fits only on disk) datasets.
> >>
> >>> - There is no restriction in there wrt. supporting only specific use
> cases -
> >>> in particular no restriction to be recommendations only.
> >>>
> >>> - There is no restriction to "only batch" or "only online" learning.
> >>>
> >>> If we want to be that broad we definitely lack lots of people, I think.
> >>>
> >>> The other question that I cannot answer today: Do we want to be a Java
> >>> Library that people link with their project, a standalone program that
> >>> people interact with via the command line, a basis that people can
> easily
> >>> integrate into their
> Pig/

Re: Call to action – Mahout needs your help

2013-03-25 Thread Isabel Drost
Hi,

On Mon, Mar 25, 2013 at 9:43 AM, Manuel Blechschmidt <
manuel.blechschm...@gmx.de> wrote:

> One big piece that is missing for all the algorithms is a complete bundled
> data set that is solving a real business problem and with bundled I mean
> that it is in the Mahout source tree. If no real data is available
> generated data could be used.
>

Good point. There are a few examples in the "examples" module - either
relying on generated data or on easy to download data.

One problem with bringing data into the project is the licensing of that
data. There's not too much I'm aware of that can easily be re-distributed
under an Apache license.



>
> I tried to fill this gap for recommendations with my github project:
>
> https://github.com/ManuelB/facebook-recommender-demo



Hmm - I would actually like to list such "complimentary" projects
prominently from the Mahout page somewhere. What do you think?



> So this is the nature of a framework. If you really want people to get
> started easily you have to provide a full blown example where you can just
> replace the example data with your data.
>
> I don't think that enough manpower can be acquired to create a visual GUI
> for Mahout. Further I don't think that this would help. There are already
> excellent GUIs for ML e.g. Weka (http://www.cs.waikato.ac.nz/ml/weka/)
> and RStudio (http://www.rstudio.com/)
>

+1

In addition to my knowledge Mahout itself has been integrated with a nice
graphical ML tool already:
<
http://rapid-i.com/component/option,com_myblog/show,Big-data-analytics-made-easy-Radoop.html/Itemid,172/
>


Isabel


Re: Call to action – Mahout needs your help

2013-03-25 Thread Isabel Drost
Hi,

On Mon, Mar 25, 2013 at 9:10 AM, Sebastian Schelter  wrote:

> throwing in my 2 cents here:
>
> IMO, its first and foremost a library (similar to Lucene), and this should
> also be reflected in the codebase.
>

This would be my view as well. It should be easy for people who speak Java
to take the implementations and plug them into their own projects. For
those dealing with text it should be trivial to combine Lucene and its
analyzers for data pre-processing and feed resulting vectors into Mahout
algorithms.



>
> I don't agree that we simply lack manpower but have a clear vision. I
> actually think its the other way round. I think Mahout is kind of stuck,
> because it does not have a clear vision. I think we faced and still face
> very hard challenges, as we have to provide answers for the following
> questions:
>
> * for which problems and algorithms does it really make sense to use
> MapReduce?
>

Being a notorious optimist I'm confident that we should be in a good
position to provide answers for that question now.



>
> * how broad can the spectrum of things that we offer be without a
> decline in quality?
>
> * how do we deal with the fact that our codebase is split up into a
> collection of algorithms with very few people being able to work on all
> of them, due to the required theoretical background and the complexity
> of efficient code
>

One thing that has always been on my mind is to focus on a handful of core
use cases - defined as broadly as "classification" is a use case on its
own. For each use case there should be a limited number of algorithm
implementations. If being parallel is still on our agenda, than for each
use case we should at least have a single machine and a going parallel
story with a clear path for users to scale their application from single to
multiple without to many adjustments in code (if that is at all possible)
or conceptual client side architecture.



>
> * how do we provide solutions that allow users to scale very fine
> grained, e.g. from online to precomputed on a single machine to
> precomputed via Hadoop in the recommender stuff.
>

+1


>
> I think that Mahout is and should always be more than recommenders, but
> that we should be more courageous in throwing out things that are not
> used very much or not maintained very much or don't meet the quality
> standards which we would like to see.
>

Do we have an equivalent of the "attach clothes-pegs to your trousers in
January and throw out anything that still has the peg by end of December" -
that is, can we reliably identify what has not been used by each release?



>
> It is also my personal experience (= I heard it over and over again from
> our users) that it is extremely hard to get started with Mahout using
> the available documentation. MiA is the exception to this, but people
> have to buy it first and it lacks a lot of the latest developments. It
> would be awesome to have a reworked wiki that is qualitatively
> comparable to MiA.
>
>
Strange idea:  What do people think of moving some core documentation out
of the wiki and into the distribution (both as JavaDoc and as a few high
level HTML pages)? Advantage: Documentation is available offline after
downloading the artifact, contributions to the documentation get very
visible which would make active documenters committers, documentation gets
versioned along with the code. (Not sure if moving to Apache CMS could
already help here.)


Isabel


[jira] [Commented] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612519#comment-13612519
 ] 

Hudson commented on MAHOUT-1172:


Integrated in Mahout-Quality #1930 (See 
[https://builds.apache.org/job/Mahout-Quality/1930/])
MAHOUT-1172 Replace org.apache.mahout.cf.taste.common.TopK with Lucene's 
PriorityQueue (Revision 1460541)

 Result = SUCCESS
ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1460541
Files : 
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/common/FixedSizePriorityQueue.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/common/MinK.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/common/TopK.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/MutableRecommendedItem.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/TopItemsQueue.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/MutableRecommendedItem.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/PredictionMapper.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/TopItemQueue.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/AggregateAndRecommendReducer.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorSplitterMapper.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/TopSimilarItemsQueue.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/similarity/precompute/SimilarItem.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/MutableElement.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/TopElementsQueue.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/Vectors.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/common/TopKMinKTest.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/TopItemsQueueTest.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/als/TopItemQueueTest.java
* 
/mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJobTest.java


> Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue
> --
>
> Key: MAHOUT-1172
> URL: https://issues.apache.org/jira/browse/MAHOUT-1172
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
> Fix For: 0.8
>
> Attachments: MAHOUT-1172.patch
>
>
> Using Lucene's PriorityQueue allows for faster and more memory-efficient 
> top-k selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : Mahout-Examples-Classify-20News #164

2013-03-25 Thread Apache Jenkins Server
See 



Re: Call to action – Mahout needs your help

2013-03-25 Thread Manuel Blechschmidt
Hello,

On 25.03.2013, at 09:10, Sebastian Schelter wrote:

> Hi,
> 
> throwing in my 2 cents here:
> 
> I don't agree that we simply lack manpower but have a clear vision. I
> actually think its the other way round. I think Mahout is kind of stuck,
> because it does not have a clear vision.

I fully agree. So I think Mahout needs a vision. The big problem about ML is 
that you can do everything with it but to make a difference you have to focus.

I am using Mahout for solving business problems e.g.:

- Online fraud
- eCommerce recommendations
- Demand forecasting

One big piece that is missing for all the algorithms is a complete bundled data 
set that is solving a real business problem and with bundled I mean that it is 
in the Mahout source tree. If no real data is available generated data could be 
used.

I tried to fill this gap for recommendations with my github project:

https://github.com/ManuelB/facebook-recommender-demo

This project seams to be  used by the community. You can get it, compile it and 
start it with 4 commands.

> ...
> 
> It is also my personal experience (= I heard it over and over again from
> our users) that it is extremely hard to get started with Mahout using
> the available documentation. MiA is the exception to this, but people
> have to buy it first and it lacks a lot of the latest developments. It
> would be awesome to have a reworked wiki that is qualitatively
> comparable to MiA.

So this is the nature of a framework. If you really want people to get started 
easily you have to provide a full blown example where you can just replace the 
example data with your data.

I don't think that enough manpower can be acquired to create a visual GUI for 
Mahout. Further I don't think that this would help. There are already excellent 
GUIs for ML e.g. Weka (http://www.cs.waikato.ac.nz/ml/weka/) and RStudio 
(http://www.rstudio.com/)


> 
> Best,
> Sebastian

Hope this helps
Manuel

> 
> On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
>> 
>> 
>> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
>>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
 On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
> What about an experiment: If you (reading this mail) were to write a two
> sentence vision statement for Mahout as you see it - what would that be?
 
 Produce open source, scalable machine learning code using a community
 development model.
>>> 
>>> So taking that apart:
>>> 
>>> - Hadoop is not necessarily part of the equation. All that we promise are
>>> implemenations that are reasonably scalable.
>> 
>> - We play well with small-ish (fits in memory) and large (fits only in 
>> memory of 
>> many machines) or huge (fits only on disk) datasets.
>> 
>>> - There is no restriction in there wrt. supporting only specific use cases -
>>> in particular no restriction to be recommendations only.
>>> 
>>> - There is no restriction to "only batch" or "only online" learning.
>>> 
>>> If we want to be that broad we definitely lack lots of people, I think.
>>> 
>>> The other question that I cannot answer today: Do we want to be a Java
>>> Library that people link with their project, a standalone program that
>>> people interact with via the command line, a basis that people can easily
>>> integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
>>> workflows or all of these?
>> 
>> 
> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B



[jira] [Updated] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1172:
---

Issue Type: Improvement  (was: Bug)

> Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue
> --
>
> Key: MAHOUT-1172
> URL: https://issues.apache.org/jira/browse/MAHOUT-1172
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
> Fix For: 0.8
>
> Attachments: MAHOUT-1172.patch
>
>
> Using Lucene's PriorityQueue allows for faster and more memory-efficient 
> top-k selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1172:
---

Fix Version/s: 0.8

> Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue
> --
>
> Key: MAHOUT-1172
> URL: https://issues.apache.org/jira/browse/MAHOUT-1172
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
> Fix For: 0.8
>
> Attachments: MAHOUT-1172.patch
>
>
> Using Lucene's PriorityQueue allows for faster and more memory-efficient 
> top-k selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter resolved MAHOUT-1172.


Resolution: Fixed

> Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue
> --
>
> Key: MAHOUT-1172
> URL: https://issues.apache.org/jira/browse/MAHOUT-1172
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
> Attachments: MAHOUT-1172.patch
>
>
> Using Lucene's PriorityQueue allows for faster and more memory-efficient 
> top-k selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1025) Update documentation for LDA before the release.

2013-03-25 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612455#comment-13612455
 ] 

Sebastian Schelter commented on MAHOUT-1025:


The issue here is that 
https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation 
still refers to an outdated version of LDA, which got replaced. The 
documentation should be updated to use the new implementation.

> Update documentation for LDA before the release.
> 
>
> Key: MAHOUT-1025
> URL: https://issues.apache.org/jira/browse/MAHOUT-1025
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.7
>Reporter: Robin Anil
>Assignee: Jake Mannix
> Fix For: 0.8
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Sebastian Schelter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Schelter updated MAHOUT-1172:
---

Attachment: MAHOUT-1172.patch

> Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue
> --
>
> Key: MAHOUT-1172
> URL: https://issues.apache.org/jira/browse/MAHOUT-1172
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Sebastian Schelter
> Attachments: MAHOUT-1172.patch
>
>
> Using Lucene's PriorityQueue allows for faster and more memory-efficient 
> top-k selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1172) Replace org.apache.mahout.cf.taste.common.TopK with Lucene's PriorityQueue

2013-03-25 Thread Sebastian Schelter (JIRA)
Sebastian Schelter created MAHOUT-1172:
--

 Summary: Replace org.apache.mahout.cf.taste.common.TopK with 
Lucene's PriorityQueue
 Key: MAHOUT-1172
 URL: https://issues.apache.org/jira/browse/MAHOUT-1172
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Sebastian Schelter


Using Lucene's PriorityQueue allows for faster and more memory-efficient top-k 
selection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Call to action – Mahout needs your help

2013-03-25 Thread Sebastian Schelter
Hi,

throwing in my 2 cents here:

I think that you mentioned a very good point with stating that it is not
clear whether Mahout is a library, a standalone program to interact with
via the command line. IMO, its first and foremost a library (similar to
Lucene), and this should also be reflected in the codebase.

I don't agree that we simply lack manpower but have a clear vision. I
actually think its the other way round. I think Mahout is kind of stuck,
because it does not have a clear vision. I think we faced and still face
very hard challenges, as we have to provide answers for the following
questions:

* for which problems and algorithms does it really make sense to use
MapReduce?

* how broad can the spectrum of things that we offer be without a
decline in quality?

* how do we deal with the fact that our codebase is split up into a
collection of algorithms with very few people being able to work on all
of them, due to the required theoretical background and the complexity
of efficient code

* how do we provide solutions that allow users to scale very fine
grained, e.g. from online to precomputed on a single machine to
precomputed via Hadoop in the recommender stuff.

I think that Mahout is and should always be more than recommenders, but
that we should be more courageous in throwing out things that are not
used very much or not maintained very much or don't meet the quality
standards which we would like to see.

It is also my personal experience (= I heard it over and over again from
our users) that it is extremely hard to get started with Mahout using
the available documentation. MiA is the exception to this, but people
have to buy it first and it lacks a lot of the latest developments. It
would be awesome to have a reworked wiki that is qualitatively
comparable to MiA.

Best,
Sebastian

On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
> 
> 
> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
>>> On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
 What about an experiment: If you (reading this mail) were to write a two
 sentence vision statement for Mahout as you see it - what would that be?
>>>
>>> Produce open source, scalable machine learning code using a community
>>> development model.
>>
>> So taking that apart:
>>
>> - Hadoop is not necessarily part of the equation. All that we promise are
>> implemenations that are reasonably scalable.
> 
> - We play well with small-ish (fits in memory) and large (fits only in memory 
> of 
> many machines) or huge (fits only on disk) datasets.
>  
>> - There is no restriction in there wrt. supporting only specific use cases -
>> in particular no restriction to be recommendations only.
>>
>> - There is no restriction to "only batch" or "only online" learning.
>>
>> If we want to be that broad we definitely lack lots of people, I think.
>>
>> The other question that I cannot answer today: Do we want to be a Java
>> Library that people link with their project, a standalone program that
>> people interact with via the command line, a basis that people can easily
>> integrate into their Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
>> workflows or all of these?
> 
>