Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Shannon Quinn
+1

iPhone'd

> On May 16, 2014, at 14:46, Andrew Musselman  
> wrote:
> 
> +1
> 
> 
> On Fri, May 16, 2014 at 11:02 AM, Dmitriy Lyubimov wrote:
> 
>> Hi,
>> 
>> I would like to initiate a procedural vote moving to git as our primary
>> commit system, and using github PRs as described in Jake Farrel's email to
>> @dev [1]
>> 
>> [1]
>> 
>> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
>> 
>> If voting succeeds, i will file a ticket with infra to commence necessary
>> changes and to move our project to git-wp as primary source for commits as
>> well as add github integration features [1]. (I assume pure git commits
>> will be required after that's done, with no svn commits allowed).
>> 
>> The motivation is to engage GIT and github PR features as described, and
>> avoid git mirror history messes like we've seen associated with authors.txt
>> file fluctations.
>> 
>> PMC and committers have binding votes, so please vote. Lazy consensus with
>> minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
>> for weekend (i.e. Tuesday afternoon PST) .
>> 
>> here is my +1
>> 
>> -d
>> 


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000613#comment-14000613
 ] 

Anand Avati commented on MAHOUT-1490:
-

[~dlyubimov] I could only receive a lot of the comments (> week old) just today 
on the mailing list, possibly because of the outage. Reading through the 
history to understand this task better.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-16 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1529:
-

Description: 
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
-(1) checkpoint() accepts Spark constant StorageLevel directly;-
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

(5) drmBroadcast returns a Spark-specific Broadcast object

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.

  was:
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
(1) checkpoint() accepts Spark constant StorageLevel directly;
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

-(5) drmBroadcast returns a Spark-specific Broadcast object-

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.


> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Examples-Cluster-Reuters-II #833

2014-05-16 Thread Apache Jenkins Server
See 

--
[...truncated 3068 lines...]
14/05/15 18:32:48 INFO mapred.JobClient: Map input records=0
14/05/15 18:32:48 INFO mapred.JobClient: Reduce shuffle bytes=0
14/05/15 18:32:48 INFO mapred.JobClient: Spilled Records=0
14/05/15 18:32:48 INFO mapred.JobClient: Map output bytes=0
14/05/15 18:32:48 INFO mapred.JobClient: Total committed heap usage 
(bytes)=736624640
14/05/15 18:32:48 INFO mapred.JobClient: Combine input records=0
14/05/15 18:32:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=166
14/05/15 18:32:48 INFO mapred.JobClient: Reduce input records=0
14/05/15 18:32:48 INFO mapred.JobClient: Reduce input groups=0
14/05/15 18:32:48 INFO mapred.JobClient: Combine output records=0
14/05/15 18:32:48 INFO mapred.JobClient: Reduce output records=0
14/05/15 18:32:48 INFO mapred.JobClient: Map output records=0
14/05/15 18:32:48 INFO common.HadoopUtil: Deleting 
/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-streamingkmeans/partial-vectors-0
14/05/15 18:32:48 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating 
IDF
14/05/15 18:32:49 INFO input.FileInputFormat: Total input paths to process : 1
14/05/15 18:32:49 INFO mapred.JobClient: Running job: job_local1584740539_0005
14/05/15 18:32:49 INFO mapred.LocalJobRunner: Waiting for map tasks
14/05/15 18:32:49 INFO mapred.LocalJobRunner: Starting task: 
attempt_local1584740539_0005_m_00_0
14/05/15 18:32:49 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
14/05/15 18:32:49 INFO mapred.MapTask: Processing split: 
file:/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-streamingkmeans/tf-vectors-toprune/part-r-0:0+90
14/05/15 18:32:49 INFO mapred.MapTask: io.sort.mb = 100
14/05/15 18:32:49 INFO mapred.MapTask: data buffer = 79691776/99614720
14/05/15 18:32:49 INFO mapred.MapTask: record buffer = 262144/327680
14/05/15 18:32:49 INFO mapred.MapTask: Starting flush of map output
14/05/15 18:32:49 INFO mapred.Task: 
Task:attempt_local1584740539_0005_m_00_0 is done. And is in the process of 
commiting
14/05/15 18:32:49 INFO mapred.LocalJobRunner: 
14/05/15 18:32:49 INFO mapred.Task: Task 
'attempt_local1584740539_0005_m_00_0' done.
14/05/15 18:32:49 INFO mapred.LocalJobRunner: Finishing task: 
attempt_local1584740539_0005_m_00_0
14/05/15 18:32:49 INFO mapred.LocalJobRunner: Map task executor complete.
14/05/15 18:32:49 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
14/05/15 18:32:49 INFO mapred.LocalJobRunner: 
14/05/15 18:32:49 INFO mapred.Merger: Merging 1 sorted segments
14/05/15 18:32:49 INFO mapred.Merger: Down to the last merge-pass, with 0 
segments left of total size: 0 bytes
14/05/15 18:32:49 INFO mapred.LocalJobRunner: 
14/05/15 18:32:49 INFO mapred.Task: 
Task:attempt_local1584740539_0005_r_00_0 is done. And is in the process of 
commiting
14/05/15 18:32:49 INFO mapred.LocalJobRunner: 
14/05/15 18:32:49 INFO mapred.Task: Task 
attempt_local1584740539_0005_r_00_0 is allowed to commit now
14/05/15 18:32:49 INFO output.FileOutputCommitter: Saved output of task 
'attempt_local1584740539_0005_r_00_0' to 
/tmp/mahout-work-hudson/reuters-out-seqdir-sparse-streamingkmeans/df-count
14/05/15 18:32:49 INFO mapred.LocalJobRunner: reduce > reduce
14/05/15 18:32:49 INFO mapred.Task: Task 
'attempt_local1584740539_0005_r_00_0' done.
14/05/15 18:32:50 INFO mapred.JobClient:  map 0% reduce 100%
14/05/15 18:32:50 INFO mapred.JobClient: Job complete: job_local1584740539_0005
14/05/15 18:32:50 INFO mapred.JobClient: Counters: 17
14/05/15 18:32:50 INFO mapred.JobClient:   File Output Format Counters 
14/05/15 18:32:50 INFO mapred.JobClient: Bytes Written=105
14/05/15 18:32:50 INFO mapred.JobClient:   File Input Format Counters 
14/05/15 18:32:50 INFO mapred.JobClient: Bytes Read=102
14/05/15 18:32:50 INFO mapred.JobClient:   FileSystemCounters
14/05/15 18:32:50 INFO mapred.JobClient: FILE_BYTES_READ=252065916
14/05/15 18:32:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=254568597
14/05/15 18:32:50 INFO mapred.JobClient:   Map-Reduce Framework
14/05/15 18:32:50 INFO mapred.JobClient: Map output materialized bytes=6
14/05/15 18:32:50 INFO mapred.JobClient: Map input records=0
14/05/15 18:32:50 INFO mapred.JobClient: Reduce shuffle bytes=0
14/05/15 18:32:50 INFO mapred.JobClient: Spilled Records=0
14/05/15 18:32:50 INFO mapred.JobClient: Map output bytes=0
14/05/15 18:32:50 INFO mapred.JobClient: Total committed heap usage 
(bytes)=937164800
14/05/15 18:32:50 INFO mapred.JobClient: Combine input records=0
14/05/15 18:32:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=167
14/05/15 18:32:50 INFO mapred.JobClient: Reduce input records=0
14/05/15 18:32:50 INFO mapred.JobClient: Reduce input groups=0
14/05/15 18:32:50 INFO mapred.JobClient: Combine output records=0
14/05/15 18:32:50 INFO mapred.JobClient: Reduce output re

[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000600#comment-14000600
 ] 

Anand Avati commented on MAHOUT-1490:
-

[~dlyubimov], unsafe access surely a factor for the performance. Compression is 
implemented in 
https://github.com/0xdata/h2o/blob/master/src/main/java/water/fvec/NewChunk.java#L379.
 The performance boost is really in coding the access (various at8_impl and 
atd_impl methods) such that inflation happens completely in registers and only 
compressed data is transferred over the memory bus. This seems to work very 
effectively in practice.

AFAIK all the code is public in that github.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Ted Dunning
Well, the system is becoming unmaintainable due to accumulated technical
debt.  But infra has David Nalley in charge and he is leading a very
significant effort to pay down that debt and rebuild key services in a more
maintainable form.  I have spoken at length with David and he really
impresses me as the right guy to make this all work better.

That said, he has a big task ahead and it won't happen in a moment.  It
probably won't even be completely done in the next 24 months.

Regarding the suggestion to use out-sourced tools like Github, there is
established policy at Apache that critical information, notably including
the authoritative email record and source code have to be maintained on
Apache resources.  Period.  End of Discussion.

That doesn't mean that additional toolsets can be used to drive
productivity, but if it doesn't get decided on the mailing list, it didn't
get decided.  This is critical to the community and longevity mission.  The
recent problems are significant, but they will be resolved.






On Fri, May 16, 2014 at 4:35 PM, Pat Ferrel  wrote:

> Well since it’s come up (I guess I’m bringing it up)
>
> Maybe Apache is getting too big to manage all it’s own list servers (down
> for several groups last week), repos, bug trackers, wikis (cwiki looks like
> it’s down)
>
> Have we thought about Github (repo, bug tracking, Jenkins)  + Google or
> Yahoo Groups + Stackoverflow?
> Replacing, er, most Apache infra? I’m super grateful for the guys who work
> on all this but is it becoming unmaintainable? With every new thing we ask
> for (git + svn integration? plus running git repos) they can probably do it
> but it adds  more moving parts until some day...
>
>
> On May 16, 2014, at 3:37 PM, Ted Dunning  wrote:
>
> On Fri, May 16, 2014 at 10:16 AM, Pat Ferrel  wrote:
>
> > Agree 100% but there is seems to be an extra level of indirection here.
> >
> > 1) do your work in your personal github repo, probably using a branch
> >
>
> Exactly.
>
>
> > 2) issue a pull request to ??? the github Mahout repo or Apache Mahout
> git
> > repo?
> >
>
> To the github Mahout mirror.  Otherwise, yes.
>
>
> > 3) respond to the request somewhere to make sure the merge goes well.
> >
>
> exactly.
>
>
> > It would be nice if we only had to deal with github imo. They will
> trigger
> > a Jenkins build on repo changes.
> >
>
> Well... yes.  It would be nice to use the pretty green button on the github
> site, but it isn't much more work to do the merges manual.  For merges that
> are clean, there are 3 or 4 commands required to merge the PR and github
> provides them.  Doing the JIRA is additional effort.
>
>
>
>
>
> >
> > On May 15, 2014, at 3:33 PM, Ted Dunning  wrote:
> >
> > On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel 
> wrote:
> >
> >> But pull requests would just be for contributors, right? Committers
> > should
> >> be able to push to the master directly, right?
> >>
> >
> > Not necessarily.  Pull requests are so lovely nice to work with no matter
> > who you are.
> >
> > I can easily imagine using them myself for a long-lived change.
> >
> >
>
>


[jira] [Updated] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-16 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1529:
-

Description: 
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
(1) checkpoint() accepts Spark constant StorageLevel directly;
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

-(5) drmBroadcast returns a Spark-specific Broadcast object-

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.

  was:
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
(1) checkpoint() accepts Spark constant StorageLevel directly;
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

+(5) drmBroadcast returns a Spark-specific Broadcast object+

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.


> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Ted Dunning
+1

This is a good move.  Several of the projects I have been involved in
(Drill, Spark for instance) use git and the results are very positive.  The
benefits are slightly technical (git works better for me than SVN), but
majorly social (use of git as primary with github integration is a big
deal, average developer familiarity with git is higher)




On Fri, May 16, 2014 at 11:02 AM, Dmitriy Lyubimov wrote:

> Hi,
>
> I would like to initiate a procedural vote moving to git as our primary
> commit system, and using github PRs as described in Jake Farrel's email to
> @dev [1]
>
> [1]
>
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
>
> If voting succeeds, i will file a ticket with infra to commence necessary
> changes and to move our project to git-wp as primary source for commits as
> well as add github integration features [1]. (I assume pure git commits
> will be required after that's done, with no svn commits allowed).
>
> The motivation is to engage GIT and github PR features as described, and
> avoid git mirror history messes like we've seen associated with authors.txt
> file fluctations.
>
> PMC and committers have binding votes, so please vote. Lazy consensus with
> minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
> for weekend (i.e. Tuesday afternoon PST) .
>
> here is my +1
>
> -d
>


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-16 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000488#comment-14000488
 ] 

Anand Avati commented on MAHOUT-1529:
-

I see drmBroadcast() has already been listed (somehow did not find it last time 
I saw)

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: consensus statement?

2014-05-16 Thread Dmitriy Lyubimov
Pat, it can't be as high-level or as dteailed as it can be, I don't care,
as long as it doesn't contain misstatements. It simply can state "we adhere
to the "Apache's power of doing" principle and accept new contributions".
This is ok with me. But, as offered, it does try to enumerate strategic
directions, and in doing so, its wording is either vague, or incomplete, or
just wrong.


For example, it says "it is clear that what the committers are working on
is Spark". This is less than accurate.

First, if I interpret it literally, it is wrong, as our committers for most
part are not working on Spark, and even if they do, to whatever negligible
degree it esxists, why Mahout would care.

Second, if it is meant to say "we develop algorithms for Spark", this is
also wrong, because whatever algorithms we have added to day, have 0 Spark
dependencies.

Third, if it is meant to say that majority of what we are working on is
Spark bindings, this is still incorrect. Head count-wise, Mahout-math
tweaks and Scala enablement were at least a big effort. Hadoop 2.0 stuff
was at least as big. Documentation and tutorial work engagement was
absolute leader headcount-wise to date.

The problem i am trying to explain here is that we obviously internally
know what we are doing; but this is for external consumption so we have to
be careful to avoid miscommunication here. It is easy for us to pass on
less than accurate info delivery exactly because we already know what we
are doing and therefore our brain is happy to jump to conclusions and make
up the missing connections between stated and implied as we see it. But for
an outsider, this would sound vague or make him make wrong connections.



On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel  wrote:

> This doesn’t seem to be a vision statement. I was +1 to a simple consensus
> statement.
>
> The vision is up to you.
>
> We have an interactive shell that scales to huge datasets without
> resorting to massive subsampling. One that allows you to deal with the
> exact data your black box algos work on. Every data tool has an interactive
> mode except Mahout—now it does.  Virtually every complex transform as well
> as basic linear algebra works on massive datasets. The interactivity will
> allow people to do things with Mahout they could never do before.
>
> We also have the building blocks to make the fastest most flexible cutting
> edge collaborative filtering+metadata recommenders in the world. Honestly I
> don’t see anything like this elsewhere. We will also be able to fit into
> virtually any workflow and directly consume data produced in those systems
> with no intermediate scrubbing. This has never happened before in Mahout
> and I don’t see it in MLlib either. Even the interactive shell will benefit
> from this.
>
> Other feature champions will be able to add to this list.
>
> Seems like the vision comes from feature champions. I may not use Mahout
> in the same way you do but I rely on your code. Maybe I serve a different
> user type than you. I don’t see a problem with that, do you?
>
> On May 6, 2014, at 2:32 PM, Dmitriy Lyubimov  wrote:
>
> Pat et. al,
>
> The whole problem with original suggested consensus statement is that it
> reads as "we are building MLLib for Spark (oh wait, there's already such a
> thing)" and then "we are building MLLib for 0xdata" and then perhaps for
> something else. Which can't be farther from the true philosophy of what has
> been done. If not it, then at best it reads as "we don't know what it is we
> are building, but we are including some Spark dependencies now". So it is
> either misleading, or sufficiently vague, not sure which is worse.
>
> If a collection of backend-specific separated MLLibs is the new consensus,
> i can't say i can share it. In fact, the only motivation for me to do
> anything within this project was to fix everything that  (per my perhaps
> lopsided perception) is less than ideal with the approach of building ML
> projects as backend-specific collections of black-box trainers and solvers
> and bring in an ideology similar to Julia and R to the jvm-based big data
> ML .
>
> If users are to love us, somehow i think it will not be because we ported
> yet another flavor of K-means to Spark.
>
> At this point I think it is a little premature to talk about an existing
> consensus, it seems.
>
> On Tue, May 6, 2014 at 12:41 PM, Pat Ferrel  wrote:
>
> > +1
> >
> > I personally won’t spend a lot of time generalizing right now.
> > Contributors can help with that if they want or make suggestions.
> >
> > On May 6, 2014, at 9:23 AM, Ted Dunning  wrote:
> >
> > As a bit of commentary, it is clear that what the committers are working
> on
> > is Spark
> >
>
> Mahout committers, with very rare exceptions, are not working on Spark.
> Spark committers and contributors are working on Spark.
>
>


[jira] [Updated] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-16 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1529:
-

Description: 
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
(1) checkpoint() accepts Spark constant StorageLevel directly;
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

+(5) drmBroadcast returns a Spark-specific Broadcast object+

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.

  was:
We have a few situations when algorithm-facing API has Spark dependencies 
creeping in. 

In particular, we know of the following cases:
(1) checkpoint() accepts Spark constant StorageLevel directly;
(2) certain things in CheckpointedDRM;
(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 

(5) drmBroadcast returns a Spark-specific Broadcast object

*Current tracker:* https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
*Pull requests are welcome*.


> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> +(5) drmBroadcast returns a Spark-specific Broadcast object+
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Jenkins build is back to normal : mahout-nightly #1569

2014-05-16 Thread Apache Jenkins Server
See 



[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000492#comment-14000492
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

Yes. ok. using unsafe is at least 15% faster for reading 64 bit values out of 
byte arrays. So, with that, uncompressed memory bandwidth saturation will 
likely start occuring at about 5-6 cores.

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Ted Dunning
On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:

> But pull requests would just be for contributors, right? Committers should
> be able to push to the master directly, right?
>

Not necessarily.  Pull requests are so lovely nice to work with no matter
who you are.

I can easily imagine using them myself for a long-lived change.


[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-16 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000347#comment-14000347
 ] 

Andrew Musselman commented on MAHOUT-1505:
--

Not yet, been overbooked at work, but I am reviewing the patch and will post 
something this weekend.

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000510#comment-14000510
 ] 

Ted Dunning commented on MAHOUT-1490:
-

This column orientation sounds like a good plan.


> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Andrew Musselman
+1


On Fri, May 16, 2014 at 11:02 AM, Dmitriy Lyubimov wrote:

> Hi,
>
> I would like to initiate a procedural vote moving to git as our primary
> commit system, and using github PRs as described in Jake Farrel's email to
> @dev [1]
>
> [1]
>
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
>
> If voting succeeds, i will file a ticket with infra to commence necessary
> changes and to move our project to git-wp as primary source for commits as
> well as add github integration features [1]. (I assume pure git commits
> will be required after that's done, with no svn commits allowed).
>
> The motivation is to engage GIT and github PR features as described, and
> avoid git mirror history messes like we've seen associated with authors.txt
> file fluctations.
>
> PMC and committers have binding votes, so please vote. Lazy consensus with
> minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
> for weekend (i.e. Tuesday afternoon PST) .
>
> here is my +1
>
> -d
>


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Sebastian Schelter
+1 (binding)

-sebastian
Am 16.05.2014 20:38 schrieb "Dmitriy Lyubimov" :

> Hi,
>
> I would like to initiate a procedural vote moving to git as our primary
> commit system, and using github PRs as described in Jake Farrel's email to
> @dev [1]
>
> [1]
>
> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
>
> If voting succeeds, i will file a ticket with infra to commence necessary
> changes and to move our project to git-wp as primary source for commits as
> well as add github integration features [1]. (I assume pure git commits
> will be required after that's done, with no svn commits allowed).
>
> The motivation is to engage GIT and github PR features as described, and
> avoid git mirror history messes like we've seen associated with authors.txt
> file fluctations.
>
> PMC and committers have binding votes, so please vote. Lazy consensus with
> minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
> for weekend (i.e. Tuesday afternoon PST) .
>
> here is my +1
>
> -d
>


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Pat Ferrel
Well since it’s come up (I guess I’m bringing it up)

Maybe Apache is getting too big to manage all it’s own list servers (down for 
several groups last week), repos, bug trackers, wikis (cwiki looks like it’s 
down)

Have we thought about Github (repo, bug tracking, Jenkins)  + Google or Yahoo 
Groups + Stackoverflow?
Replacing, er, most Apache infra? I’m super grateful for the guys who work on 
all this but is it becoming unmaintainable? With every new thing we ask for 
(git + svn integration? plus running git repos) they can probably do it but it 
adds  more moving parts until some day...


On May 16, 2014, at 3:37 PM, Ted Dunning  wrote:

On Fri, May 16, 2014 at 10:16 AM, Pat Ferrel  wrote:

> Agree 100% but there is seems to be an extra level of indirection here.
> 
> 1) do your work in your personal github repo, probably using a branch
> 

Exactly.


> 2) issue a pull request to ??? the github Mahout repo or Apache Mahout git
> repo?
> 

To the github Mahout mirror.  Otherwise, yes.


> 3) respond to the request somewhere to make sure the merge goes well.
> 

exactly.


> It would be nice if we only had to deal with github imo. They will trigger
> a Jenkins build on repo changes.
> 

Well... yes.  It would be nice to use the pretty green button on the github
site, but it isn't much more work to do the merges manual.  For merges that
are clean, there are 3 or 4 commands required to merge the PR and github
provides them.  Doing the JIRA is additional effort.





> 
> On May 15, 2014, at 3:33 PM, Ted Dunning  wrote:
> 
> On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:
> 
>> But pull requests would just be for contributors, right? Committers
> should
>> be able to push to the master directly, right?
>> 
> 
> Not necessarily.  Pull requests are so lovely nice to work with no matter
> who you are.
> 
> I can easily imagine using them myself for a long-lived change.
> 
> 



[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-16 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000384#comment-14000384
 ] 

Andrew Palumbo commented on MAHOUT-1527:


Thanks.  The script is pretty messy- works fine, but several trivial things 
like typos etc.   I was going to attach it separately but put it in with the 
patch at the last minute.   I'll clean it up if its going to be used. 

> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-05-16 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000445#comment-14000445
 ] 

Anand Avati commented on MAHOUT-1529:
-

Another thing I notice is that drmBroadcast() returns a raw 
org.apache.spark.Broadcast variable. I'm thinking a simple wrapper around it to 
create an abstraction for various backends would be nice. Thoughts?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> (1) checkpoint() accepts Spark constant StorageLevel directly;
> (2) certain things in CheckpointedDRM;
> (3) drmParallelize etc. routines in the "drm" and "sparkbindings" package. 
> (5) drmBroadcast returns a Spark-specific Broadcast object
> *Current tracker:* 
> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1529.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Dmitriy Lyubimov
On Fri, May 16, 2014 at 10:16 AM, Pat Ferrel  wrote:

> Agree 100% but there is seems to be an extra level of indirection here.
>
> 1) do your work in your personal github repo, probably using a branch
>
personal github branch, yes.


> 2) issue a pull request to ??? the github Mahout repo or Apache Mahout git
> repo?
>

I assume it will be apache/mahout on github. At least it is for Spark.


> 3) respond to the request somewhere to make sure the merge goes well.
>

a merge of PR is handled by github itself. At least if merge is possible by
recursive strategy. Actually it is an interesting detail here. I am not
sure I understand 100% how it works exactly, here, but i am sure we will
figure it out. most likely, like Jake said, we will have to merge it
locally and push it to git-wp.a.o with a message that would close the PR in
github. If I understand it correctly, that's one of the things they've
implemented in github integration

>
> It would be nice if we only had to deal with github imo. They will trigger
> a Jenkins build on repo changes.
>
> On May 15, 2014, at 3:33 PM, Ted Dunning  wrote:
>
> On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:
>
> > But pull requests would just be for contributors, right? Committers
> should
> > be able to push to the master directly, right?
> >
>
> Not necessarily.  Pull requests are so lovely nice to work with no matter
> who you are.
>
> I can easily imagine using them myself for a long-lived change.
>
>


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Ted Dunning
On Fri, May 16, 2014 at 10:16 AM, Pat Ferrel  wrote:

> Agree 100% but there is seems to be an extra level of indirection here.
>
> 1) do your work in your personal github repo, probably using a branch
>

Exactly.


> 2) issue a pull request to ??? the github Mahout repo or Apache Mahout git
> repo?
>

To the github Mahout mirror.  Otherwise, yes.


> 3) respond to the request somewhere to make sure the merge goes well.
>

exactly.


> It would be nice if we only had to deal with github imo. They will trigger
> a Jenkins build on repo changes.
>

Well... yes.  It would be nice to use the pretty green button on the github
site, but it isn't much more work to do the merges manual.  For merges that
are clean, there are 3 or 4 commands required to merge the PR and github
provides them.  Doing the JIRA is additional effort.





>
> On May 15, 2014, at 3:33 PM, Ted Dunning  wrote:
>
> On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:
>
> > But pull requests would just be for contributors, right? Committers
> should
> > be able to push to the master directly, right?
> >
>
> Not necessarily.  Pull requests are so lovely nice to work with no matter
> who you are.
>
> I can easily imagine using them myself for a long-lived change.
>
>


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Dmitriy Lyubimov
Yes, and no.

Yes, committers should be able to push, or, rather, merge pull requests.
But in case committers are also acting as contributors, i'd expect a pull
request to be filed, since it is more than just this. It is also a review
board. I've been struggling with integration issues in apache review board,
and it seems, not just me alone. Github reviews are nicely integrated, it
is a nice way to collaborate. Instead of cluttering jira, we could just
discuss details right on the PR and be very specific about points of code
we are talking to.


On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:

> Yes! I mentioned this awhile back. Pull requests are just patches under
> the covers, just soo easy to create.
>
> But pull requests would just be for contributors, right? Committers should
> be able to push to the master directly, right?
>
> On May 6, 2014, at 6:39 PM, Dmitriy Lyubimov  wrote:
>
> Hi,
>
> We are trying to explore an option of moving Mahout to GIT as its main
> repository. What are the options to do this within current infrastucture?
>
> How will we be able to do github pull requests similar to how Apache Spark
> does it?
>
> Thank you.
> -Dmitriy
>
>


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread tuxdna
+1 for Git repositories
+1 for GitHub pull requests

Regarding "github pull", on reading the the wiki [1], below is what I
would do, unless there is some automation involved.

As acommitter, I would first merge pull request on GitHub repo and
then push it onto Apache's Git repository.

Infrastructure team can provide better ways :-)

Regards,
Saleem

[1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


Build failed in Jenkins: mahout-nightly #1568

2014-05-16 Thread Apache Jenkins Server
See 

--
Failed to access build log

java.io.IOException: remote file operation failed: 
/home/jenkins/jenkins-slave/workspace/mahout-nightly at 
hudson.remoting.Channel@1bd13118:ubuntu6
at hudson.FilePath.act(FilePath.java:916)
at hudson.FilePath.act(FilePath.java:893)
at hudson.FilePath.toURI(FilePath.java:1042)
at hudson.tasks.MailSender.createFailureMail(MailSender.java:278)
at hudson.tasks.MailSender.getMail(MailSender.java:153)
at hudson.tasks.MailSender.execute(MailSender.java:101)
at 
hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.cleanUp(MavenModuleSetBuild.java:1058)
at hudson.model.Run.execute(Run.java:1752)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:529)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:231)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:541)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:739)
at hudson.FilePath.act(FilePath.java:909)
... 10 more
Caused by: java.io.IOException
at hudson.remoting.Channel.close(Channel.java:1027)
at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
at hudson.remoting.PingThread.ping(PingThread.java:120)
at hudson.remoting.PingThread.run(PingThread.java:81)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1400194937838 
hasn't completed at 1400195177839
... 2 more


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Pat Ferrel
Agree 100% but there is seems to be an extra level of indirection here.

1) do your work in your personal github repo, probably using a branch
2) issue a pull request to ??? the github Mahout repo or Apache Mahout git repo?
3) respond to the request somewhere to make sure the merge goes well.

It would be nice if we only had to deal with github imo. They will trigger a 
Jenkins build on repo changes.

On May 15, 2014, at 3:33 PM, Ted Dunning  wrote:

On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:

> But pull requests would just be for contributors, right? Committers should
> be able to push to the master directly, right?
> 

Not necessarily.  Pull requests are so lovely nice to work with no matter
who you are.

I can easily imagine using them myself for a long-lived change.



[jira] [Commented] (MAHOUT-1505) structure of clusterdump's JSON output

2014-05-16 Thread Terry Blankers (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000242#comment-14000242
 ] 

Terry Blankers commented on MAHOUT-1505:


Did that patch ever get posted?

> structure of clusterdump's JSON output
> --
>
> Key: MAHOUT-1505
> URL: https://issues.apache.org/jira/browse/MAHOUT-1505
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.9
>Reporter: Terry Blankers
>Assignee: Andrew Musselman
>  Labels: json
> Fix For: 1.0
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> [
> {"action":0.023},
> {"adherence":0.223},
> {"administration":0.011}
> ],
> "r":
> [
> {"action":0.446},
> {"adherence":1.501},
> {"administration":0.306}
> ]
> }
> {noformat}
> and:
> {noformat}
> {
> "point": {
> "body": 6.904,
> "harm": 10.101
> },
> "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
> "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
> "cluster":"VL-10515",
> "n":5924,
> "c":
> {
> "action":0.023,
> "adherence":0.223,
> "administration":0.011
> },
> "r":
> {
>"action":0.446,
>"adherence":1.501,
>"administration":0.306
> }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Pat Ferrel
+1

On May 16, 2014, at 11:02 AM, Dmitriy Lyubimov  wrote:

Hi,

I would like to initiate a procedural vote moving to git as our primary
commit system, and using github PRs as described in Jake Farrel's email to
@dev [1]

[1]
https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

If voting succeeds, i will file a ticket with infra to commence necessary
changes and to move our project to git-wp as primary source for commits as
well as add github integration features [1]. (I assume pure git commits
will be required after that's done, with no svn commits allowed).

The motivation is to engage GIT and github PR features as described, and
avoid git mirror history messes like we've seen associated with authors.txt
file fluctations.

PMC and committers have binding votes, so please vote. Lazy consensus with
minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
for weekend (i.e. Tuesday afternoon PST) .

here is my +1

-d



Re: consensus statement?

2014-05-16 Thread Ted Dunning
On Wed, May 7, 2014 at 9:54 AM, Pat Ferrel  wrote:

> Seems like the vision comes from feature champions. I may not use Mahout
> in the same way you do but I rely on your code. Maybe I serve a different
> user type than you. I don’t see a problem with that, do you?
>

Sounds good to me.


[jira] [Updated] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-16 Thread Andrew Palumbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1527:
---

Attachment: MAHOUT-1527.patch

Patch fixes WikipediaToSequenceFile.java.  Also changes the output of 
WikipediaMapper.java to output Key: /Category/document_name. 

I'm not sure if this is the way to go but it could be the convention for now. 
The new Naive Bayes expects input like this, so it works well for documents 
classified by their respective directories and keeps it easy to use 
seqdirectory.   

I basically modified classify-20Newsgroups.sh to run the wikipedia CBayes 
example on a medium sized wikipediaXMLdump.  I'm not sure if  we're looking for 
new  example scripts, but i included it in the patch anyways.

I've set it up to only look at 10 countries for a couple reasons.
  1. Confusion matrix fits on the screen
  2.  When using all  countries split will almost always put a label into the 
training set that was not encountered in the training set, and will thus crash 
testnb at the confusion matrix.

This script will occasionally run into the same problem of having a split that 
crashes the confusion matrix-  but rarely with these settings.

I have another script here that gives the option to use all of the countries 
and a constant size test set ( same number of docs in the test set for each 
category) but it needs a little work.  For now I've included the simple one.  

If this script is going to be added, let me know and I'll do some work on it.
  

> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Exploring moving Mahout to git as main repo

2014-05-16 Thread Andrew Musselman
Yes, this is a very important part of the workflow that we're having
trouble with/ignoring via reviewboard and svn.


On Thu, May 15, 2014 at 12:40 PM, Dmitriy Lyubimov wrote:

> Yes, and no.
>
> Yes, committers should be able to push, or, rather, merge pull requests.
> But in case committers are also acting as contributors, i'd expect a pull
> request to be filed, since it is more than just this. It is also a review
> board. I've been struggling with integration issues in apache review board,
> and it seems, not just me alone. Github reviews are nicely integrated, it
> is a nice way to collaborate. Instead of cluttering jira, we could just
> discuss details right on the PR and be very specific about points of code
> we are talking to.
>
>
> On Fri, May 9, 2014 at 4:17 PM, Pat Ferrel  wrote:
>
> > Yes! I mentioned this awhile back. Pull requests are just patches under
> > the covers, just soo easy to create.
> >
> > But pull requests would just be for contributors, right? Committers
> should
> > be able to push to the master directly, right?
> >
> > On May 6, 2014, at 6:39 PM, Dmitriy Lyubimov  wrote:
> >
> > Hi,
> >
> > We are trying to explore an option of moving Mahout to GIT as its main
> > repository. What are the options to do this within current infrastucture?
> >
> > How will we be able to do github pull requests similar to how Apache
> Spark
> > does it?
> >
> > Thank you.
> > -Dmitriy
> >
> >
>


[jira] [Commented] (MAHOUT-1446) Create an intro for matrix factorization

2014-05-16 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998685#comment-13998685
 ] 

Sebastian Schelter commented on MAHOUT-1446:


I'll have a look at this on the weekend.




> Create an intro for matrix factorization
> 
>
> Key: MAHOUT-1446
> URL: https://issues.apache.org/jira/browse/MAHOUT-1446
> Project: Mahout
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Maciej Mazur
> Fix For: 1.0
>
> Attachments: matrix-factorization.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1480#comment-1480
 ] 

Dmitriy Lyubimov commented on MAHOUT-1490:
--

Cool, thanks for doing this. 

Here is what i've been thinking. 

let's try and bring columnar vector representations(DataFrameVectorLike trait), 
which at this point will just extend Iterable[T], where T can be one of Long, 
Double, Int, String or Byte[]. 

Let's start with simple column represenations, non-variable length, e.g. 
LongDataFrameVector. While doing so, let's try to engage Unsafe class as in 
H20. (actually i talked to various people and they are telling me that 
surprisingly many jvm vendors actually adhere to Unsafe api). Obviously such 
vector must not use actual objects (such as of Double type) but rather use 
Unsafe to pick values from/to backing byte array. 

At this point we will ignore any compression (but we may want to start using 
some ideas along the VLQ and perhaps prefix tries for unordered collections).

Let's also assume that vector length is constant for now (i.e. we can read and 
update random element but we can't change the number of elements in it).



> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


VOTE: moving commits to git-wp.o.a & github PR features.

2014-05-16 Thread Dmitriy Lyubimov
Hi,

I would like to initiate a procedural vote moving to git as our primary
commit system, and using github PRs as described in Jake Farrel's email to
@dev [1]

[1]
https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

If voting succeeds, i will file a ticket with infra to commence necessary
changes and to move our project to git-wp as primary source for commits as
well as add github integration features [1]. (I assume pure git commits
will be required after that's done, with no svn commits allowed).

The motivation is to engage GIT and github PR features as described, and
avoid git mirror history messes like we've seen associated with authors.txt
file fluctations.

PMC and committers have binding votes, so please vote. Lazy consensus with
minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
for weekend (i.e. Tuesday afternoon PST) .

here is my +1

-d


[jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-16 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998664#comment-13998664
 ] 

Sebastian Schelter commented on MAHOUT-1527:


I'll have a look at this on the weekend.




> Fix wikipedia classifier example
> 
>
> Key: MAHOUT-1527
> URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> Project: Mahout
>  Issue Type: Task
>  Components: Classification, Documentation, Examples
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1490) Data frame R-like bindings

2014-05-16 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999514#comment-13999514
 ] 

Saikat Kanjilal commented on MAHOUT-1490:
-

Dmitry,
Should I keep working on the proposal to narrow down the APIs or help with 
code, I can add some detail around a few of the APIs, let me know your thoughts 
on next steps, I am pretty open :) and willing to help

Thanks

> Data frame R-like bindings
> --
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Saikat Kanjilal
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>   Original Estimate: 20h
>  Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: [jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example

2014-05-16 Thread Andrew Palumbo
This comment is 5 days old and just came through to the dev list. I noticed 
that an email that I sent to the dev list earlier this week that took about 8 
hours to go out and had a 2 or 3 hour lag on the user list as well.  Should I 
start a JIRA for this?

> Date: Sat, 10 May 2014 22:14:29 +
> From: j...@apache.org
> To: dev@mahout.apache.org
> Subject: [jira] [Commented] (MAHOUT-1527) Fix wikipedia classifier example
> 
> 
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993257#comment-13993257
>  ] 
> 
> Andrew Palumbo commented on MAHOUT-1527:
> 
> 
> I've almost got this working with NB.  Will put a patch up soon.
> 
> > Fix wikipedia classifier example
> > 
> >
> > Key: MAHOUT-1527
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1527
> > Project: Mahout
> >  Issue Type: Task
> >  Components: Classification, Documentation, Examples
> >Reporter: Sebastian Schelter
> > Fix For: 1.0
> >
> >
> > The examples package has a classification showcase for prediciting the 
> > labels of wikipedia  pages. Unfortunately, the example is totally broken:
> > It relies on the old NB implementation which has been removed, suggests to 
> > use the whole wikipedia as input, which will not work well on a single 
> > machine and the documentation uses commands that have long been removed 
> > from bin/mahout. 
> > The example needs to be updated to use the current naive bayes 
> > implementation and documentation on the website needs to be written.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)