[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906729#comment-13906729
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

Oh. and the implicit paper doesn't generalize the search for confidence 
parameters of course. I ignore that formulation here completely. but eventually 
there should be an outer procedure for search for optimum. My particular 
problem was including multiple events with generally unknown confidence weights 
unlike the original implicit feedback work.

> Weighted ALS-WR iterator for Spark
> --
>
> Key: MAHOUT-1365
> URL: https://issues.apache.org/jira/browse/MAHOUT-1365
> Project: Mahout
>  Issue Type: Task
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: distributed-als-with-confidence.pdf
>
>
> Given preference P and confidence C distributed sparse matrices, compute 
> ALS-WR solution for implicit feedback (Spark Bagel version).
> Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
> build C matrix), with parameterized test for convergence.
> The computational scheme is following ALS-WR method (which should be slightly 
> more efficient for sparser inputs). 
> The best performance will be achieved if non-sparse anomalies prefilitered 
> (eliminated) (such as an anomalously active user which doesn't represent 
> typical user anyway).
> the work is going here 
> https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
> porting away our (A1) implementation so there are a few issues associated 
> with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906725#comment-13906725
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1365 at 2/20/14 7:54 AM:
---

quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. see the pdf.


was (Author: dlyubimov):
quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. 

> Weighted ALS-WR iterator for Spark
> --
>
> Key: MAHOUT-1365
> URL: https://issues.apache.org/jira/browse/MAHOUT-1365
> Project: Mahout
>  Issue Type: Task
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: distributed-als-with-confidence.pdf
>
>
> Given preference P and confidence C distributed sparse matrices, compute 
> ALS-WR solution for implicit feedback (Spark Bagel version).
> Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
> build C matrix), with parameterized test for convergence.
> The computational scheme is following ALS-WR method (which should be slightly 
> more efficient for sparser inputs). 
> The best performance will be achieved if non-sparse anomalies prefilitered 
> (eliminated) (such as an anomalously active user which doesn't represent 
> typical user anyway).
> the work is going here 
> https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
> porting away our (A1) implementation so there are a few issues associated 
> with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906725#comment-13906725
 ] 

Dmitriy Lyubimov commented on MAHOUT-1365:
--

quite possibly could be. The only thing that i do differently here is the merge 
of approaches of implicit feedback and wieghed regularization paper, but that's 
minor. 

> Weighted ALS-WR iterator for Spark
> --
>
> Key: MAHOUT-1365
> URL: https://issues.apache.org/jira/browse/MAHOUT-1365
> Project: Mahout
>  Issue Type: Task
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: distributed-als-with-confidence.pdf
>
>
> Given preference P and confidence C distributed sparse matrices, compute 
> ALS-WR solution for implicit feedback (Spark Bagel version).
> Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
> build C matrix), with parameterized test for convergence.
> The computational scheme is following ALS-WR method (which should be slightly 
> more efficient for sparser inputs). 
> The best performance will be achieved if non-sparse anomalies prefilitered 
> (eliminated) (such as an anomalously active user which doesn't represent 
> typical user anyway).
> the work is going here 
> https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
> porting away our (A1) implementation so there are a few issues associated 
> with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906724#comment-13906724
 ] 

Sean Owen commented on MAHOUT-1365:
---

Dmitriy isn't this exactly what is already implemented in MLlib? Same paper for 
sure.

http://spark.incubator.apache.org/docs/latest/mllib-guide.html#collaborative-filtering-1

> Weighted ALS-WR iterator for Spark
> --
>
> Key: MAHOUT-1365
> URL: https://issues.apache.org/jira/browse/MAHOUT-1365
> Project: Mahout
>  Issue Type: Task
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: distributed-als-with-confidence.pdf
>
>
> Given preference P and confidence C distributed sparse matrices, compute 
> ALS-WR solution for implicit feedback (Spark Bagel version).
> Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
> build C matrix), with parameterized test for convergence.
> The computational scheme is following ALS-WR method (which should be slightly 
> more efficient for sparser inputs). 
> The best performance will be achieved if non-sparse anomalies prefilitered 
> (eliminated) (such as an anomalously active user which doesn't represent 
> typical user anyway).
> the work is going here 
> https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
> porting away our (A1) implementation so there are a few issues associated 
> with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1365:
-

Fix Version/s: (was: Backlog)
   1.0

> Weighted ALS-WR iterator for Spark
> --
>
> Key: MAHOUT-1365
> URL: https://issues.apache.org/jira/browse/MAHOUT-1365
> Project: Mahout
>  Issue Type: Task
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Attachments: distributed-als-with-confidence.pdf
>
>
> Given preference P and confidence C distributed sparse matrices, compute 
> ALS-WR solution for implicit feedback (Spark Bagel version).
> Following Hu-Koren-Volynsky method (stripping off any concrete methodology to 
> build C matrix), with parameterized test for convergence.
> The computational scheme is following ALS-WR method (which should be slightly 
> more efficient for sparser inputs). 
> The best performance will be achieved if non-sparse anomalies prefilitered 
> (eliminated) (such as an anomalously active user which doesn't represent 
> typical user anyway).
> the work is going here 
> https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
> porting away our (A1) implementation so there are a few issues associated 
> with that.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906710#comment-13906710
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

a few obvious optimizer rules 

A.t %*% A is obviously detected as a family of unary algorithsm rather than a 
binary multiplication alborithm

Geometry and non-zero element estimate plays role in selection of type of 
algorithm. 

Biggest multiplication via group-by will have to deal, obviously, with 
cartesian operator and will apply to (A * B')

Obvious rewrites: 
A'*B' = (B * A )' (transposition push-up, including elementwise operators too)
(A')' = A (transposition merge)
cost based grouping (A*B)*C versus A*(B*C)
special distributed algorithm versions for in-core operands and diagonal 
matrices



> Spark Bindings (DRM)
> 
>
> Key: MAHOUT-1346
> URL: https://issues.apache.org/jira/browse/MAHOUT-1346
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic 
> functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout on Spark?

2014-02-19 Thread Nick Pentreath
MLlib may be less production tested than Mahout that is true, but I would
say Spark is heavily production tested and getting close to a true 1.0
release. Why do you favour Hadoop for "sturdiness"? Spark uses HDFS as an
input source (or any Hadoop InputFormat) so benefits from the same fault
tolerance wrt input sources. Spark's fault tolerance model for tasks / jobs
is if anything superior to Hadoop M/R.

For a Downpour SGD-like implementation on Spark see:
https://github.com/apache/incubator-spark/pull/407. Assuming the framework
for Spark SGD / gradients etc is flexible enough, one should be able to
implement neural net / perceptron on top of this. Would be interested to
hear if it can be done easily with the current code framework.


On Wed, Feb 19, 2014 at 11:55 PM, peng  wrote:

> I was suggested to switch to MLlib for its performance, but I doubt if
> that is production ready, even if it is I would still favour hadoop's
> sturdiness and self-healing.
> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>
>
> On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:
>
>> To set expectations appropriately, I think it's important to point out
>> this is completely infeasible short of a total rewrite, and I can't
>> imagine that will happen. It may not be obvious if you haven't looked
>> at the code how completely dependent on M/R it is.
>>
>> You can swap out M/R and Spark if you write in terms of something like
>> Crunch, but that is not at all the case here.
>>
>> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas  wrote:
>>
>>> +100 for this, different execution engines, like the direction  pig and
>>> crunch take
>>>
>>> Sent from my iPhone
>>>
>>>  On Feb 19, 2014, at 5:19 AM, Gokhan Capan  wrote:

 I imagine in Mahout offering an option to the users to select from
 different execution engines (just like we currently do by giving M/R or
 sequential options), and starting from Spark. I am not sure what changes
 needed in the codebase, though. Maybe following MLI (or alike) and
 implementing some more stuff, such as common interfaces for iterating
 over
 data (the M/R way and the Spark way).

 IMO, another effort might be porting pre-online machine learning (such
 transforming text into vector based on the dictionary generated by
 seq2sparse before), machine learning based on mini-batches, and
 streaming
 summarization stuff in Mahout to Spark-Streaming.

 Best,
 Gokhan

 On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov >>> >wrote:

  PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different
> cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only
> doing it
> like couple nights in a month. It does not look like i will be doing
> it on
> company time any time soon (and even if i did, the company doesn't
> seem to
> be inclined to contribute anything I do anything new on their time).
> It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would
> not
> be able to get much traction compared to university-based
> contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov 
>> wrote:
>>
>
>  Unfortunately methinks the prospects of something like Mahout/MLLib
>> merge
>> seem very unlikely due to vastly diverged approach to the basics of
>>
> linear
>
>> algebra (and other things). Just like one cannot grow single tree out
>> of
>> two trunks -- not easily, anyway.
>>
>> It is fairly easy to port (and subsequently beat) MLib at this point
>> from
>> collection of algorithms point of view. But IMO goal should be more
>> MLI-like first, and port second. And be very careful with concepts.
>> Something that i so far don't see happening with MLib. MLib seems to
>> be
>> old-style Mahout-like rush to become a collection of basic algorithms
>> rather than coherent foundation. Admittedly, i havent looked very
>>
> closely.
>
>>
>>
>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter > wrote:
>>
>>  I'm also convinced that Spark is a superior platform for executing
>>> distributed ML algorithms. We've had a discussion about a change from
>>> Hadoop to another platform some time ago, but at that point in time
>>> it
>>>
>> was
>
>> not clear which of the upcoming dataflow processing systems (Spark,
>>> Hyracks, Stratosphere) would establish itself amongst the users. To
>>> me
>>>
>> it
>
>> seems pretty obvious that Spark made the race.
>>>
>>> I concur with Ted, i

[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906699#comment-13906699
 ] 

Dmitriy Lyubimov commented on MAHOUT-1346:
--

This is now tracked here 
https://github.com/dlyubimov/mahout-commits/tree/dev-1.0-spark
new module spark. 

I have been rewriting certain things anew. 

Concepts : 
(a) Logical operators (including DRM sources) are expressed as DRMLike trait.
(b) taking a note from spark book, DRM operators (such as %*% or t) form 
operator lineage.  Operator lineage does not get optimized into RDD until 
"action" applied (spark terminology used). 

(c) Unlike in spark, "action" doesn't really cause any execution but (1) 
forming optimized RDD sequence (2) producing "checkpointed" DRM. Consequently, 
"checkpointed" DRM has RDD lineage attached to it, which is also marked for 
cacheing. Subsequently additional lineages starting out of a checkpointed DRM, 
will not be able to optimize beyond this checkpoint.

(d) there's a "super action" on checkpointed RDD  - such as collection or 
persitence to HDFS that triggers, if necessary, optimization checkpoint and 
Spark action. 

E.g. 

{code}
val A = drmParallelize(...)

// doesn't do anything, give opportunity for operator lineage to grow further 
before being optimized
val squaredA = A.t %*% A

// we may trigger optimizer and RDD lineage generation and cacheing explicitly 
by: 
squaredA.checkpoint()

// Or, we can call "superAction" directly. This will trigger checkpoint() 
implicitly if not yet done
val inCoreSquaredA = squaredA.collect()
{code}

Generally, i support for very few things -- I actually dropped all previously 
implemented Bagel algorithms. So in fact i have less support now than in 0.9 
branch. 

i have kryo support for Mahout vectors and matrix blocks. 
I have hdfs read/write of Mahout's DRM into DRMLike trait. 

I have some DSL defined such as 
A %*% B 
A %*% inCoreB
inCoreA %*%: B

A.t
inCoreA = A.collect

A.blockify (coalesces split records into RDD of vertical blocks -- sort of 
paradigm simiilar to MLI's MatrixSubmatrix except I implemented it before MLI 
was announced for the first time :) so no MLI influence here in fact )

So now i need to reimplement what Bagel used to be doing, plus optimizer rules 
for choosing distributed algorithm based on cost rules.

In fact i came to conclusion there was 0 benefit in using Bagel in the first 
place, since it just maps all its primitives into shuffle-and-hash group-by RDD 
operations so there is no any actual operational benefit to using it.

I probably will reconstitute algorithms at the first iteration using regular 
spark primitives (groupBy and cartesian for multiplication blocks)

Once i plug missing pieces (e.g. slim matrix multiplication) I bet i would be 
able to fit distributed SSVD version in 40 lines just like the in-core one :)

Weighted ALS will still be looking less elegant because of some lacking 
features in linear algebra. For example, it seems like sparse block support 
(i.e. bunch of sparse row or column vectors hanging off a very small hash map 
instead of full-size array as in SparseRow(column)Matrix today), but still 
mostly R-like scripted as far as working with matrix blocks and decompositions.

So at this point i'd be willing to hear input on these ideas and direction. 
Perhaps some suggestions. Thanks.


> Spark Bindings (DRM)
> 
>
> Key: MAHOUT-1346
> URL: https://issues.apache.org/jira/browse/MAHOUT-1346
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic 
> functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1346) Spark Bindings (DRM)

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1346:
-

Fix Version/s: (was: Backlog)
   1.0

> Spark Bindings (DRM)
> 
>
> Key: MAHOUT-1346
> URL: https://issues.apache.org/jira/browse/MAHOUT-1346
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic 
> functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (MAHOUT-1408) Distributed cache file matching bug while running SSVD in broadcast mode

2014-02-19 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1408.
--

Resolution: Won't Fix

Don't see a reason to do anything.

> Distributed cache file matching bug while running SSVD in broadcast mode
> 
>
> Key: MAHOUT-1408
> URL: https://issues.apache.org/jira/browse/MAHOUT-1408
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.8
>Reporter: Angad Singh
>Assignee: Dmitriy Lyubimov
>Priority: Minor
> Attachments: BtJob.java.patch
>
>
> The error is:
> java.lang.IllegalArgumentException: Unexpected file name, unable to deduce 
> partition 
> #:file:/data/d1/mapred/local/taskTracker/distcache/434503979705629827_-1822139941_1047712745/nn.red.ua2.inmobi.com/user/rmcuser/oozie-oozi/0034272-140120102756143-oozie-oozi-W/inmobi-ssvd_mahout--java/java-launcher.jar
>   at 
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:154)
>   at 
> org.apache.mahout.math.hadoop.stochasticsvd.SSVDHelper$1.compare(SSVDHelper.java:1)
>   at java.util.Arrays.mergeSort(Arrays.java:1270)
>   at java.util.Arrays.mergeSort(Arrays.java:1281)
>   at java.util.Arrays.mergeSort(Arrays.java:1281)
>   at java.util.Arrays.sort(Arrays.java:1210)
>   at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:112)
>   at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.(SequenceFileDirValueIterator.java:94)
>   at 
> org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:220)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:396)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
>   at org.apache.hadoop.mapred.Child.main(Child.java:260)
> The bug is @ 
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java,
>  near line 220.
> and  @ 
> https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
>  near line 144.
> SSVDHelper's PARTITION_COMPARATOR assumes all files in the distributed cache 
> will have a particular pattern whereas we have jar files in our distributed 
> cache which causes the above exception.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906449#comment-13906449
 ] 

Hudson commented on MAHOUT-1329:


SUCCESS: Integrated in Mahout-Quality #2484 (See 
[https://builds.apache.org/job/Mahout-Quality/2484/])
MAHOUT-1329: reverting back the last change (smarthi: rev 1570023)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/core/pom.xml
* /mahout/trunk/examples/pom.xml
* /mahout/trunk/integration/pom.xml
* /mahout/trunk/math/pom.xml
* /mahout/trunk/pom.xml


> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Jenkins build is back to normal : Mahout-Quality #2484

2014-02-19 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : Mahout-Examples-Classify-20News #432

2014-02-19 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters #543

2014-02-19 Thread Apache Jenkins Server
See 



Re: Mahout on Spark?

2014-02-19 Thread Suneel Marthi





On Wednesday, February 19, 2014 7:22 PM, Ted Dunning  
wrote:
 
On Wed, Feb 19, 2014 at 1:55 PM, peng  wrote:


> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.

Downpour SGD is #1 on my list of features for 1.0, will start working on that 
once the MultiLayer Perceptron is functional and integrated into Mahout 
processing pipeline (should be by next week).

Re: Mahout on Spark?

2014-02-19 Thread Ted Dunning
On Wed, Feb 19, 2014 at 1:55 PM, peng  wrote:

> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906333#comment-13906333
 ] 

Suneel Marthi commented on MAHOUT-1329:
---

Gokhan, I remember now the conversation we had few months ago about trying to 
avoid adding dependencies and agree with you on that. I completely forgot about 
that conversation, sorry. In light of the Hudson failure due to this patch, 
reverting this patch back.

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Reopened] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reopened MAHOUT-1329:
---


> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: mahout-nightly #1502

2014-02-19 Thread Apache Jenkins Server
See 

Changes:

[smarthi] MAHOUT-1329: Mahout for Hadoop 2.x

--
Started by timer
Building remotely on ubuntu6 in workspace 

Updating http://svn.apache.org/repos/asf/mahout/trunk at revision 
'2014-02-19T23:12:24.928 +'
U examples/pom.xml
U integration/pom.xml
U CHANGELOG
U 
core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java
U 
core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1Mapper.java
U 
core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialBuilder.java
U core/src/main/java/org/apache/mahout/classifier/df/DecisionForest.java
U core/pom.xml
U math/src/test/java/org/apache/mahout/math/TestSparseMatrix.java
U math/src/main/java/org/apache/mahout/math/SparseMatrix.java
U math/pom.xml
U pom.xml
At revision 1569962
Parsing POMs
Modules changed, recalculating dependency graph
maven3-agent.jar already up to date
maven3-interceptor.jar already up to date
maven3-interceptor-commons.jar already up to date
[trunk] $ /home/hudson/tools/java/latest1.6/bin/java -cp 
/home/jenkins/jenkins-slave/maven3-agent.jar:/home/hudson/tools/maven/apache-maven-3.0.4/boot/plexus-classworlds-2.4.jar
 org.jvnet.hudson.maven3.agent.Maven3Main 
/home/hudson/tools/maven/apache-maven-3.0.4 
/home/jenkins/jenkins-slave/slave.jar 
/home/jenkins/jenkins-slave/maven3-interceptor.jar 
/home/jenkins/jenkins-slave/maven3-interceptor-commons.jar 44023
<===[JENKINS REMOTING CAPACITY]===>channel started
   log4j:WARN No appenders could be found for logger 
(org.apache.commons.beanutils.converters.BooleanConverter).
log4j:WARN Please initialize the log4j system properly.
Executing Maven:  -B -f 
 
-Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/0 clean 
install deploy -DskiptTests -Dmahout.skip.distribution=false
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]   
[ERROR]   The project org.apache.mahout:mahout-integration:1.0-SNAPSHOT 
( 
has 1 error
[ERROR] 'dependencies.dependency.version' for 
org.apache.hbase:hbase-client:jar must be a valid version but is 
'${hbase.version}'. @ line 143, column 16
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
channel stopped


Build failed in Jenkins: Mahout-Quality #2483

2014-02-19 Thread Apache Jenkins Server
See 

--
[...truncated 1633 lines...]
A 
math/src/test/java/org/apache/mahout/math/random/ChineseRestaurantTest.java
A math/src/test/java/org/apache/mahout/math/random/NormalTest.java
A math/src/test/java/org/apache/mahout/math/random/MultinomialTest.java
A math/src/test/java/org/apache/mahout/math/random/IndianBuffetTest.java
A 
math/src/test/java/org/apache/mahout/math/random/PoissonSamplerTest.java
A math/src/test/java/org/apache/mahout/math/PermutedVectorViewTest.java
A math/src/test/java/org/apache/mahout/math/DiagonalMatrixTest.java
A 
math/src/test/java/org/apache/mahout/math/TestRandomAccessSparseVector.java
A math/src/test/java/org/apache/mahout/math/FileBasedMatrixTest.java
A 
math/src/test/java/org/apache/mahout/math/VectorBinaryAssignCostTest.java
A math/src/test/resources
A math/src/test/resources/negative-binomial-test-data.csv
A math/src/test/resources/beta-test-data.csv
A math/src/test/resources/words.txt
A math/src/test/resources/hanging-svd.tsv
A math/src/test/java-templates
A math/src/test/java-templates/org
A math/src/test/java-templates/org/apache
A math/src/test/java-templates/org/apache/mahout
A math/src/test/java-templates/org/apache/mahout/math
A math/src/test/java-templates/org/apache/mahout/math/list
A 
math/src/test/java-templates/org/apache/mahout/math/list/ValueTypeArrayListTest.java.t
A math/src/test/java-templates/org/apache/mahout/math/set
A 
math/src/test/java-templates/org/apache/mahout/math/set/OpenKeyTypeHashSetTest.java.t
A math/src/test/java-templates/org/apache/mahout/math/map
A 
math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeObjectHashMapTest.java.t
A 
math/src/test/java-templates/org/apache/mahout/math/map/OpenObjectValueTypeHashMapTest.java.t
A 
math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeValueTypeHashMapTest.java.t
A math/src/main
A math/src/main/java-templates
A math/src/main/java-templates/org
A math/src/main/java-templates/org/apache
A math/src/main/java-templates/org/apache/mahout
A math/src/main/java-templates/org/apache/mahout/math
A math/src/main/java-templates/org/apache/mahout/math/map
A 
math/src/main/java-templates/org/apache/mahout/math/map/OpenObjectValueTypeHashMap.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeValueTypeMap.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/map/OpenKeyTypeValueTypeHashMap.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeObjectMap.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/map/OpenKeyTypeObjectHashMap.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/map/AbstractObjectValueTypeMap.java.t
A math/src/main/java-templates/org/apache/mahout/math/function
A 
math/src/main/java-templates/org/apache/mahout/math/function/KeyTypeProcedure.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/function/ValueTypeComparator.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/function/KeyTypeObjectProcedure.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/function/ObjectValueTypeProcedure.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/function/KeyTypeValueTypeProcedure.java.t
A math/src/main/java-templates/org/apache/mahout/math/buffer
A 
math/src/main/java-templates/org/apache/mahout/math/buffer/ValueTypeBufferConsumer.java.t
A math/src/main/java-templates/org/apache/mahout/math/list
A 
math/src/main/java-templates/org/apache/mahout/math/list/ValueTypeArrayList.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/list/AbstractValueTypeList.java.t
A math/src/main/java-templates/org/apache/mahout/math/set
A 
math/src/main/java-templates/org/apache/mahout/math/set/OpenKeyTypeHashSet.java.t
A 
math/src/main/java-templates/org/apache/mahout/math/set/AbstractKeyTypeSet.java.t
A math/src/main/java
A math/src/main/java/org
A math/src/main/java/org/apache
A math/src/main/java/org/apache/mahout
A math/src/main/java/org/apache/mahout/common
A math/src/main/java/org/apache/mahout/common/RandomUtils.java
A math/src/main/java/org/apache/mahout/common/RandomWrapper.java
A math/src/main/java/org/apache/mahout/math
A math/src/main/java/org/apache/mahout/math/MatrixTimesOps.java
A 
math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
A math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
A math/src/m

Re: Mahout on Spark?

2014-02-19 Thread peng
I was suggested to switch to MLlib for its performance, but I doubt if 
that is production ready, even if it is I would still favour hadoop's 
sturdiness and self-healing.
But maybe mahout can include contribs that M/R is not fit for, like 
downpour SGD or graph-based algorithms?


On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas  wrote:

+100 for this, different execution engines, like the direction  pig and crunch 
take

Sent from my iPhone


On Feb 19, 2014, at 5:19 AM, Gokhan Capan  wrote:

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov wrote:


PS I am moving along cost optimizer for spark-backed DRMs on some
multiplicative pipelines that is capable of figuring different cost-based
rewrites and R-Like DSL that mixes in-core and distributed matrix
representations and blocks but it is painfully slow, i really only doing it
like couple nights in a month. It does not look like i will be doing it on
company time any time soon (and even if i did, the company doesn't seem to
be inclined to contribute anything I do anything new on their time). It is
all painfully slow, there's no direct funding for it anywhere with no
string attached. That probably will be primary reason why Mahout would not
be able to get much traction compared to university-based contributions.


On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov 
wrote:



Unfortunately methinks the prospects of something like Mahout/MLLib merge
seem very unlikely due to vastly diverged approach to the basics of

linear

algebra (and other things). Just like one cannot grow single tree out of
two trunks -- not easily, anyway.

It is fairly easy to port (and subsequently beat) MLib at this point from
collection of algorithms point of view. But IMO goal should be more
MLI-like first, and port second. And be very careful with concepts.
Something that i so far don't see happening with MLib. MLib seems to be
old-style Mahout-like rush to become a collection of basic algorithms
rather than coherent foundation. Admittedly, i havent looked very

closely.



On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter 
I'm also convinced that Spark is a superior platform for executing
distributed ML algorithms. We've had a discussion about a change from
Hadoop to another platform some time ago, but at that point in time it

was

not clear which of the upcoming dataflow processing systems (Spark,
Hyracks, Stratosphere) would establish itself amongst the users. To me

it

seems pretty obvious that Spark made the race.

I concur with Ted, it would be great to have the communities work
together. I know that at least 4 mahout committers (including me) are
already following Spark's mailinglist and actively participating in the
discussions.

What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation)
to Spark some time ago, but I haven't had time to test my code on a

large

dataset yet. I'd be happy to see someone help with that.







On 02/19/2014 08:04 AM, Nick Pentreath wrote:

I know the Spark/Mllib devs can occasionally be quite set in ways of
doing certain things, but we'd welcome as many Mahout devs as possible

to

work together.


It may be too late, but perhaps a GSoC project to look at a port of

some

stuff like co occurrence recommender and streaming k-means?




N
--
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning 
wrote:

On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <

nick.pentre...@gmail.com>wrote:


My (admittedly heavily biased) view is Spark is a superior platform
overall
for ML. If the two communities can work together to leverage the
strengths
of Spark, and the large amount of good stuff in Mahout (as well as

the

fantastic depth of experience of Mahout devs) I think a lot can be
achieved!

It makes a lot of sense that Spark would be better than Hadoop f

Build failed in Jenkins: Mahout-Examples-Classify-20News #431

2014-02-19 Thread Apache Jenkins Server
See 

Changes:

[smarthi] MAHOUT-1329: Mahout for Hadoop 2.x

--
[...truncated 1627 lines...]
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/ReloadFromJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/SQL92JDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/GenericJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/AbstractBooleanPrefJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/PostgreSQLBooleanPrefJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/MySQLBooleanPrefJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/AbstractJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/PostgreSQLJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/MySQLJDBCDataModel.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/jdbc/ConnectionPoolDataSource.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/model/mongodb/MongoDBDataModel.java
A integration/src/main/java/org/apache/mahout/cf/taste/impl/recommender
A 
integration/src/main/java/org/apache/mahout/cf/taste/impl/recommender/slopeone
A integration/src/main/java/org/apache/mahout/cf/taste/web
A 
integration/src/main/java/org/apache/mahout/cf/taste/web/RecommenderWrapper.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/web/RecommenderSingleton.java
A 
integration/src/main/java/org/apache/mahout/cf/taste/web/RecommenderServlet.java
A integration/src/main/java/org/apache/mahout/benchmark
A 
integration/src/main/java/org/apache/mahout/benchmark/PlusBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/ClosestCentroidBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/VectorBenchmarks.java
A 
integration/src/main/java/org/apache/mahout/benchmark/DotBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/BenchmarkRunner.java
A 
integration/src/main/java/org/apache/mahout/benchmark/DistanceBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/MinusBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/SerializationBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/CloneBenchmark.java
A 
integration/src/main/java/org/apache/mahout/benchmark/TimesBenchmark.java
A integration/src/main/java/org/apache/mahout/clustering
A integration/src/main/java/org/apache/mahout/clustering/evaluation
A 
integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsReducer.java
A 
integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsDriver.java
A 
integration/src/main/java/org/apache/mahout/clustering/evaluation/RepresentativePointsMapper.java
A 
integration/src/main/java/org/apache/mahout/clustering/evaluation/ClusterEvaluator.java
A integration/src/main/java/org/apache/mahout/clustering/cdbw
A 
integration/src/main/java/org/apache/mahout/clustering/cdbw/CDbwEvaluator.java
A integration/src/main/java/org/apache/mahout/clustering/lda
AU
integration/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
A integration/src/main/java/org/apache/mahout/clustering/conversion
A 
integration/src/main/java/org/apache/mahout/clustering/conversion/InputDriver.java
A 
integration/src/main/java/org/apache/mahout/clustering/conversion/InputMapper.java
A integration/src/main/java/org/apache/mahout/utils
A integration/src/main/java/org/apache/mahout/utils/clustering
A 
integration/src/main/java/org/apache/mahout/utils/clustering/AbstractClusterWriter.java
AU
integration/src/main/java/org/apache/mahout/utils/clustering/JsonClusterWriter.java
A 
integration/src/main/java/org/apache/mahout/utils/clustering/GraphMLClusterWriter.java
A 
integration/src/main/java/org/apache/mahout/utils/clustering/CSVClusterWriter.java
A 
integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumperWriter.java
A 
integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
A 
integration/src/main/java/org/apache/mahout/utils/clustering/ClusterWriter.java
A integration/src/main/java/org/apache/mahout/utils/MatrixDumper.java
A 
integration/sr

Build failed in Jenkins: Mahout-Examples-Cluster-Reuters #542

2014-02-19 Thread Apache Jenkins Server
See 

Changes:

[smarthi] MAHOUT-1329: Mahout for Hadoop 2.x

--
Started by an SCM change
Building remotely on ubuntu1 in workspace 

Updating https://svn.apache.org/repos/asf/mahout/trunk at revision 
'2014-02-19T21:37:11.461 +'
U examples/pom.xml
U integration/pom.xml
U CHANGELOG
U 
core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1MapperTest.java
U core/pom.xml
U math/pom.xml
U pom.xml
At revision 1569928
No emails were triggered.
[trunk] $ /home/hudson/tools/maven/apache-maven-3.0.4/bin/mvn -DskipTests=true 
-U clean install
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project -> [Help 1]
[ERROR]   
[ERROR]   The project org.apache.mahout:mahout-integration:1.0-SNAPSHOT 
(
 has 1 error
[ERROR] 'dependencies.dependency.version' for 
org.apache.hbase:hbase-client:jar must be a valid version but is 
'${hbase.version}'. @ line 143, column 16
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
Build step 'Invoke top-level Maven targets' marked build as failure


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906068#comment-13906068
 ] 

Hudson commented on MAHOUT-1329:


FAILURE: Integrated in Mahout-Quality #2482 (See 
[https://builds.apache.org/job/Mahout-Quality/2482/])
MAHOUT-1329: Mahout for Hadoop 2.x (smarthi: rev 1569900)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/core/pom.xml
* /mahout/trunk/examples/pom.xml
* /mahout/trunk/integration/pom.xml
* /mahout/trunk/math/pom.xml
* /mahout/trunk/pom.xml


> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Mahout-Quality #2482

2014-02-19 Thread Apache Jenkins Server
See 

Changes:

[smarthi] MAHOUT-1329: Mahout for Hadoop 2.x

--
[...truncated 1633 lines...]
A math/src/test/java/org/apache/mahout/math/jet
A math/src/test/java/org/apache/mahout/math/jet/stat
A 
math/src/test/java/org/apache/mahout/math/jet/stat/ProbabilityTest.java
A math/src/test/java/org/apache/mahout/math/jet/stat/GammaTest.java
A math/src/test/java/org/apache/mahout/math/jet/random
A 
math/src/test/java/org/apache/mahout/math/jet/random/DistributionChecks.java
A math/src/test/java/org/apache/mahout/math/jet/random/GammaTest.java
A math/src/test/java/org/apache/mahout/math/jet/random/engine
A 
math/src/test/java/org/apache/mahout/math/jet/random/engine/MersenneTwisterTest.java
A 
math/src/test/java/org/apache/mahout/math/jet/random/ExponentialTest.java
A math/src/test/java/org/apache/mahout/math/jet/random/NormalTest.java
A 
math/src/test/java/org/apache/mahout/math/jet/random/NegativeBinomialTest.java
A math/src/test/java/org/apache/mahout/math/FileBasedMatrixTest.java
AUmath/src/test/java/org/apache/mahout/math/MatrixTest.java
A math/src/test/java/org/apache/mahout/math/MatricesTest.java
A math/src/test/java/org/apache/mahout/math/set
A math/src/test/java/org/apache/mahout/math/set/HashUtilsTest.java
A math/src/test/java/org/apache/mahout/math/stats
A math/src/test/java/org/apache/mahout/math/stats/GroupTreeTest.java
A 
math/src/test/java/org/apache/mahout/math/stats/OnlineSummarizerTest.java
A math/src/test/java/org/apache/mahout/math/stats/TDigestTest.java
AUmath/src/test/java/org/apache/mahout/math/stats/LogLikelihoodTest.java
A 
math/src/test/java/org/apache/mahout/math/stats/OnlineExponentialAverageTest.java
A math/src/test/java/org/apache/mahout/math/OldQRDecompositionTest.java
A math/src/test/java/org/apache/mahout/math/decomposer
A math/src/test/java/org/apache/mahout/math/decomposer/hebbian
AU
math/src/test/java/org/apache/mahout/math/decomposer/hebbian/TestHebbianSolver.java
A math/src/test/java/org/apache/mahout/math/decomposer/lanczos
AU
math/src/test/java/org/apache/mahout/math/decomposer/lanczos/TestLanczosSolver.java
AUmath/src/test/java/org/apache/mahout/math/decomposer/SolverTest.java
A math/src/main
A math/src/main/java
A math/src/main/java/org
A math/src/main/java/org/apache
A math/src/main/java/org/apache/mahout
A math/src/main/java/org/apache/mahout/common
A math/src/main/java/org/apache/mahout/common/RandomUtils.java
A math/src/main/java/org/apache/mahout/common/RandomWrapper.java
A math/src/main/java/org/apache/mahout/math
A math/src/main/java/org/apache/mahout/math/solver
A math/src/main/java/org/apache/mahout/math/solver/LSMR.java
A 
math/src/main/java/org/apache/mahout/math/solver/EigenDecomposition.java
A math/src/main/java/org/apache/mahout/math/solver/Preconditioner.java
A 
math/src/main/java/org/apache/mahout/math/solver/JacobiConditioner.java
A 
math/src/main/java/org/apache/mahout/math/solver/ConjugateGradientSolver.java
A 
math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
A math/src/main/java/org/apache/mahout/math/map
A math/src/main/java/org/apache/mahout/math/map/PrimeFinder.java
A math/src/main/java/org/apache/mahout/math/map/package-info.java
A 
math/src/main/java/org/apache/mahout/math/map/QuickOpenIntIntHashMap.java
A math/src/main/java/org/apache/mahout/math/map/HashFunctions.java
A math/src/main/java/org/apache/mahout/math/map/OpenHashMap.java
AUmath/src/main/java/org/apache/mahout/math/SparseColumnMatrix.java
AUmath/src/main/java/org/apache/mahout/math/Vector.java
A math/src/main/java/org/apache/mahout/math/FileBasedMatrix.java
A math/src/main/java/org/apache/mahout/math/NamedVector.java
A math/src/main/java/org/apache/mahout/math/Matrices.java
A math/src/main/java/org/apache/mahout/math/VectorIterable.java
A math/src/main/java/org/apache/mahout/math/set
A math/src/main/java/org/apache/mahout/math/set/AbstractSet.java
A math/src/main/java/org/apache/mahout/math/set/OpenHashSet.java
A math/src/main/java/org/apache/mahout/math/set/HashUtils.java
A math/src/main/java/org/apache/mahout/math/RandomTrinaryMatrix.java
A math/src/main/java/org/apache/mahout/math/OldQRDecomposition.java
A math/src/main/java/org/apache/mahout/math/decomposer
AU
math/src/main/java/org/apache/mahout/math/decomposer/AsyncEigenVerifier.java
AU
math/src/main/java/org/apache/mahout/math/decomposer/SingularVectorVerifier.java
A math

[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906062#comment-13906062
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Is it OK to add hadoop dependencies to the project root, and to the math module 
(actually to all modules even they already depend on the core module)?

I remember that's what we wanted to avoid

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1329:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to trunk

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (MAHOUT-1420) Add solr-recommender to examples

2014-02-19 Thread Andrew Musselman (JIRA)
Andrew Musselman created MAHOUT-1420:


 Summary: Add solr-recommender to examples
 Key: MAHOUT-1420
 URL: https://issues.apache.org/jira/browse/MAHOUT-1420
 Project: Mahout
  Issue Type: New Feature
  Components: Examples
Affects Versions: 0.9
Reporter: Andrew Musselman
Assignee: Andrew Musselman
Priority: Minor
 Fix For: 1.0


Write a new example that builds a solr-recommender based on Pat's code at 
https://github.com/pferrel/solr-recommender and which has the glue scripts 
needed to pipe all the way from start(raw data) to finish(running web service 
and UI page).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1329:
-

Assignee: Suneel Marthi

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-19 Thread Isabel Drost-Fromm (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905643#comment-13905643
 ] 

Isabel Drost-Fromm commented on MAHOUT-1418:


[~smarthi] No worries - I was just confused (with JIRA having been re-indexed 
this morning and all) :)

> Removal of write access to anything but CMS for username isabel
> ---
>
> Key: MAHOUT-1418
> URL: https://issues.apache.org/jira/browse/MAHOUT-1418
> Project: Mahout
>  Issue Type: Task
>Reporter: Isabel Drost-Fromm
>Assignee: Isabel Drost-Fromm
>Priority: Trivial
>
> Hi,
> Please remove write access to user name "isabel" - effective mid March. For 
> background check the Mahout board report of October last year*.
> Don't worry - I'm not planning to go completely silent and offline by then. 
> However I know from several years of doing Berlin Buzzwords that being 
> completely sleep deprived is not a good state to commit to subversion - 
> except that when sleep deprived I usually don't remember this insight. So 
> this is my security net forcing me to go through the regular submit patch in 
> jira, get it reviewed and committed cycle (except for documentation changes).
> * 
> 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Hadoop 2 support

2014-02-19 Thread Suneel Marthi
Yes 





On Wednesday, February 19, 2014 10:43 AM, Sergey Svinarchuk 
 wrote:
 
Thanks!
This patch will be added in mahout 1.0?


On Wed, Feb 19, 2014 at 5:39 PM, Suneel Marthi wrote:

> Thanks for the patch Sergey. I tested this with Hadoop 1 and 2 and can
> confirm that all unit tests pass and the examples work.
>
>
>
>
>
>
> On Wednesday, February 19, 2014 9:39 AM, Sean Owen 
> wrote:
>
> Hmm I thought there was already a profile for this, but on second
> look, I only see a settable hadoop.version. It has both hadoop-core
> and hadoop-common dependencies which isn't right. I bet this patch
> clarifies the difference properly, and that's got to be good.
>
> I think I am thinking of how the CDH packaging does something like
> this. Thanks, I thought this was already in HEAD in some form. It does
> need to be, since the code does in fact run OK on Hadoop 2, modulo the
> odd bug due to behavior differences.
>
>
> On Wed, Feb 19, 2014 at 2:28 PM, Sergey Svinarchuk
>  wrote:
> > This patch updat mahout dependency and create 2 profile:
> >     - hadoop1 build mahout with Hadoop 1 dependency (uses by default)
> >     - hadoop2 build mahout with Hadoop 2 dependency
> > Because if now build mahout and then run it on machine where installed
> > hadoop 2 - mahout job will be failed.
> >
> >
> > On Wed, Feb 19, 2014 at 4:21 PM, Sean Owen  wrote:
> >
> >> Sergey I think it already worked with 2.0, no? (Although it doesn't
> >> actually use the 2.x APIs). Is this for 2.2 and/or what are the
> >> high-level changes? I'd imagine mostly packaging stuff.
> >>
> >> On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
> >>  wrote:
> >> > Today I updated patch in M-1329 to trunk. It's ticket that add support
> >> > hadoop2 to mahout.
> >> > I builded mahout with patch and all UT was passed for hadoop1 and
> >> hadoop2.
> >> > Also I tested examples/bin on the both hadoop version.
> >> > Can somebody from committers review patch and test it?
> >> >
> >> > Thanks,
> >> > Sergey!
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This
>  message is intended for the use of the individual or entity
> >> to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby notified
> >> that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> >> immediately
> >> > and delete it from your system. Thank You.

> >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE:
>  This message is intended for the use of the individual or entity to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2 support

2014-02-19 Thread Sergey Svinarchuk
Thanks!
This patch will be added in mahout 1.0?


On Wed, Feb 19, 2014 at 5:39 PM, Suneel Marthi wrote:

> Thanks for the patch Sergey. I tested this with Hadoop 1 and 2 and can
> confirm that all unit tests pass and the examples work.
>
>
>
>
>
>
> On Wednesday, February 19, 2014 9:39 AM, Sean Owen 
> wrote:
>
> Hmm I thought there was already a profile for this, but on second
> look, I only see a settable hadoop.version. It has both hadoop-core
> and hadoop-common dependencies which isn't right. I bet this patch
> clarifies the difference properly, and that's got to be good.
>
> I think I am thinking of how the CDH packaging does something like
> this. Thanks, I thought this was already in HEAD in some form. It does
> need to be, since the code does in fact run OK on Hadoop 2, modulo the
> odd bug due to behavior differences.
>
>
> On Wed, Feb 19, 2014 at 2:28 PM, Sergey Svinarchuk
>  wrote:
> > This patch updat mahout dependency and create 2 profile:
> > - hadoop1 build mahout with Hadoop 1 dependency (uses by default)
> > - hadoop2 build mahout with Hadoop 2 dependency
> > Because if now build mahout and then run it on machine where installed
> > hadoop 2 - mahout job will be failed.
> >
> >
> > On Wed, Feb 19, 2014 at 4:21 PM, Sean Owen  wrote:
> >
> >> Sergey I think it already worked with 2.0, no? (Although it doesn't
> >> actually use the 2.x APIs). Is this for 2.2 and/or what are the
> >> high-level changes? I'd imagine mostly packaging stuff.
> >>
> >> On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
> >>  wrote:
> >> > Today I updated patch in M-1329 to trunk. It's ticket that add support
> >> > hadoop2 to mahout.
> >> > I builded mahout with patch and all UT was passed for hadoop1 and
> >> hadoop2.
> >> > Also I tested examples/bin on the both hadoop version.
> >> > Can somebody from committers review patch and test it?
> >> >
> >> > Thanks,
> >> > Sergey!
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This
>  message is intended for the use of the individual or entity
> >> to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby notified
> >> that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> >> immediately
> >> > and delete it from your system. Thank You.
> >>
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE:
>  This message is intended for the use of the individual or entity to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Hadoop 2 support

2014-02-19 Thread Suneel Marthi
Thanks for the patch Sergey. I tested this with Hadoop 1 and 2 and can confirm 
that all unit tests pass and the examples work.






On Wednesday, February 19, 2014 9:39 AM, Sean Owen  wrote:
 
Hmm I thought there was already a profile for this, but on second
look, I only see a settable hadoop.version. It has both hadoop-core
and hadoop-common dependencies which isn't right. I bet this patch
clarifies the difference properly, and that's got to be good.

I think I am thinking of how the CDH packaging does something like
this. Thanks, I thought this was already in HEAD in some form. It does
need to be, since the code does in fact run OK on Hadoop 2, modulo the
odd bug due to behavior differences.


On Wed, Feb 19, 2014 at 2:28 PM, Sergey Svinarchuk
 wrote:
> This patch updat mahout dependency and create 2 profile:
>     - hadoop1 build mahout with Hadoop 1 dependency (uses by default)
>     - hadoop2 build mahout with Hadoop 2 dependency
> Because if now build mahout and then run it on machine where installed
> hadoop 2 - mahout job will be failed.
>
>
> On Wed, Feb 19, 2014 at 4:21 PM, Sean Owen  wrote:
>
>> Sergey I think it already worked with 2.0, no? (Although it doesn't
>> actually use the 2.x APIs). Is this for 2.2 and/or what are the
>> high-level changes? I'd imagine mostly packaging stuff.
>>
>> On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
>>  wrote:
>> > Today I updated patch in M-1329 to trunk. It's ticket that add support
>> > hadoop2 to mahout.
>> > I builded mahout with patch and all UT was passed for hadoop1 and
>> hadoop2.
>> > Also I tested examples/bin on the both hadoop version.
>> > Can somebody from committers review patch and test it?
>> >
>> > Thanks,
>> > Sergey!
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This
 message is intended for the use of the individual or entity
>> to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE:
 This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-19 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905560#comment-13905560
 ] 

Suneel Marthi commented on MAHOUT-1418:
---

Yeah I deleted the comment, fine shoot me down guys.

> Removal of write access to anything but CMS for username isabel
> ---
>
> Key: MAHOUT-1418
> URL: https://issues.apache.org/jira/browse/MAHOUT-1418
> Project: Mahout
>  Issue Type: Task
>Reporter: Isabel Drost-Fromm
>Assignee: Isabel Drost-Fromm
>Priority: Trivial
>
> Hi,
> Please remove write access to user name "isabel" - effective mid March. For 
> background check the Mahout board report of October last year*.
> Don't worry - I'm not planning to go completely silent and offline by then. 
> However I know from several years of doing Berlin Buzzwords that being 
> completely sleep deprived is not a good state to commit to subversion - 
> except that when sleep deprived I usually don't remember this insight. So 
> this is my security net forcing me to go through the regular submit patch in 
> jira, get it reviewed and committed cycle (except for documentation changes).
> * 
> 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-19 Thread Manuel Blechschmidt (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905554#comment-13905554
 ] 

Manuel Blechschmidt commented on MAHOUT-1418:
-

Hi [~isabel],
the comment was delete Yesterday 18:51 by himself. If you click on All you can 
see this and I still have a copy in my inbox :-)

/Manuel

> Removal of write access to anything but CMS for username isabel
> ---
>
> Key: MAHOUT-1418
> URL: https://issues.apache.org/jira/browse/MAHOUT-1418
> Project: Mahout
>  Issue Type: Task
>Reporter: Isabel Drost-Fromm
>Assignee: Isabel Drost-Fromm
>Priority: Trivial
>
> Hi,
> Please remove write access to user name "isabel" - effective mid March. For 
> background check the Mahout board report of October last year*.
> Don't worry - I'm not planning to go completely silent and offline by then. 
> However I know from several years of doing Berlin Buzzwords that being 
> completely sleep deprived is not a good state to commit to subversion - 
> except that when sleep deprived I usually don't remember this insight. So 
> this is my security net forcing me to go through the regular submit patch in 
> jira, get it reviewed and committed cycle (except for documentation changes).
> * 
> 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (MAHOUT-1418) Removal of write access to anything but CMS for username isabel

2014-02-19 Thread Isabel Drost-Fromm (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost-Fromm resolved MAHOUT-1418.


Resolution: Fixed

I distinctly remember here being a comment by [~smarthi] earlier today - weird.

INFRA issue is closed by now - so closing this one as well. Remember: This 
doesn't mean, you've gotten rid of me - JIRA still works, so does mail and 
anything else linked to my Apache account. All it means is that my changes to 
artifacts checked into the Mahout svn repo explicitly must go through review.

> Removal of write access to anything but CMS for username isabel
> ---
>
> Key: MAHOUT-1418
> URL: https://issues.apache.org/jira/browse/MAHOUT-1418
> Project: Mahout
>  Issue Type: Task
>Reporter: Isabel Drost-Fromm
>Assignee: Isabel Drost-Fromm
>Priority: Trivial
>
> Hi,
> Please remove write access to user name "isabel" - effective mid March. For 
> background check the Mahout board report of October last year*.
> Don't worry - I'm not planning to go completely silent and offline by then. 
> However I know from several years of doing Berlin Buzzwords that being 
> completely sleep deprived is not a good state to commit to subversion - 
> except that when sleep deprived I usually don't remember this insight. So 
> this is my security net forcing me to go through the regular submit patch in 
> jira, get it reviewed and committed cycle (except for documentation changes).
> * 
> 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Hadoop 2 support

2014-02-19 Thread Sean Owen
Hmm I thought there was already a profile for this, but on second
look, I only see a settable hadoop.version. It has both hadoop-core
and hadoop-common dependencies which isn't right. I bet this patch
clarifies the difference properly, and that's got to be good.

I think I am thinking of how the CDH packaging does something like
this. Thanks, I thought this was already in HEAD in some form. It does
need to be, since the code does in fact run OK on Hadoop 2, modulo the
odd bug due to behavior differences.

On Wed, Feb 19, 2014 at 2:28 PM, Sergey Svinarchuk
 wrote:
> This patch updat mahout dependency and create 2 profile:
> - hadoop1 build mahout with Hadoop 1 dependency (uses by default)
> - hadoop2 build mahout with Hadoop 2 dependency
> Because if now build mahout and then run it on machine where installed
> hadoop 2 - mahout job will be failed.
>
>
> On Wed, Feb 19, 2014 at 4:21 PM, Sean Owen  wrote:
>
>> Sergey I think it already worked with 2.0, no? (Although it doesn't
>> actually use the 2.x APIs). Is this for 2.2 and/or what are the
>> high-level changes? I'd imagine mostly packaging stuff.
>>
>> On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
>>  wrote:
>> > Today I updated patch in M-1329 to trunk. It's ticket that add support
>> > hadoop2 to mahout.
>> > I builded mahout with patch and all UT was passed for hadoop1 and
>> hadoop2.
>> > Also I tested examples/bin on the both hadoop version.
>> > Can somebody from committers review patch and test it?
>> >
>> > Thanks,
>> > Sergey!
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or entity
>> to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Re: Hadoop 2 support

2014-02-19 Thread Sergey Svinarchuk
This patch updat mahout dependency and create 2 profile:
- hadoop1 build mahout with Hadoop 1 dependency (uses by default)
- hadoop2 build mahout with Hadoop 2 dependency
Because if now build mahout and then run it on machine where installed
hadoop 2 - mahout job will be failed.


On Wed, Feb 19, 2014 at 4:21 PM, Sean Owen  wrote:

> Sergey I think it already worked with 2.0, no? (Although it doesn't
> actually use the 2.x APIs). Is this for 2.2 and/or what are the
> high-level changes? I'd imagine mostly packaging stuff.
>
> On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
>  wrote:
> > Today I updated patch in M-1329 to trunk. It's ticket that add support
> > hadoop2 to mahout.
> > I builded mahout with patch and all UT was passed for hadoop1 and
> hadoop2.
> > Also I tested examples/bin on the both hadoop version.
> > Can somebody from committers review patch and test it?
> >
> > Thanks,
> > Sergey!
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Hadoop 2 support

2014-02-19 Thread Sean Owen
Sergey I think it already worked with 2.0, no? (Although it doesn't
actually use the 2.x APIs). Is this for 2.2 and/or what are the
high-level changes? I'd imagine mostly packaging stuff.

On Wed, Feb 19, 2014 at 2:14 PM, Sergey Svinarchuk
 wrote:
> Today I updated patch in M-1329 to trunk. It's ticket that add support
> hadoop2 to mahout.
> I builded mahout with patch and all UT was passed for hadoop1 and hadoop2.
> Also I tested examples/bin on the both hadoop version.
> Can somebody from committers review patch and test it?
>
> Thanks,
> Sergey!
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Hadoop 2 support

2014-02-19 Thread Sergey Svinarchuk
Today I updated patch in M-1329 to trunk. It's ticket that add support
hadoop2 to mahout.
I builded mahout with patch and all UT was passed for hadoop1 and hadoop2.
Also I tested examples/bin on the both hadoop version.
Can somebody from committers review patch and test it?

Thanks,
Sergey!

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905464#comment-13905464
 ] 

Sergey Svinarchuk commented on MAHOUT-1329:
---

Updated patch to trunk.
Build mahout for hadoop 1 -> mvn clean package 
Build mahout for hadoop 2 -> mvn clean package -Dhadoop.profile=200

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Svinarchuk updated MAHOUT-1329:
--

Attachment: 1329.patch

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Svinarchuk updated MAHOUT-1329:
--

Attachment: (was: 1329-2.patch)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Svinarchuk updated MAHOUT-1329:
--

Attachment: (was: 1329.diff)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Svinarchuk updated MAHOUT-1329:
--

Attachment: 1329.diff

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Sergey Svinarchuk (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Svinarchuk updated MAHOUT-1329:
--

Attachment: (was: 1329.patch)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>  Labels: patch
> Fix For: 1.0
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905075#comment-13905075
 ] 

Sean Owen edited comment on MAHOUT-1419 at 2/19/14 1:14 PM:


The significant change is computing 'split points' rather than considering all 
values as split points.

For numeric features, this means taking percentiles, rather than every distinct 
value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count 
array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and 
simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

The tests had to change as a result. The test for equivalence between 
OptIgSplit and DefaultIgSplit is no longer valid, as they intentionally do not 
necessarily behave the same way now. The VisualizerTest now results in 
different values, but I verified that it's a superficial difference. The trees 
chosen before and after are equivalent since the decision thresholds, while 
different, chop up the data identically on the tiny test data sets.

I end up observing a *50x* speedup with this change. Although this is in a test 
that exercises only building, and on a data set that would be maximally 
affected by this bottleneck. 


was (Author: srowen):
The significant change is computing 'split points' rather than considering all 
values as split points.

For numeric features, this means taking percentiles, rather than every distinct 
value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count 
array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and 
simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

The tests had to change as a result. The test for equivalence between 
OptIgSplit and DefaultIgSplit is no longer invalid, as they intentionally do 
not behave the same way. The VisualizerTest now results in different values, 
but I verified that it's actually a superficial difference. The trees chosen 
before and after are equivalent since the decision thresholds, while different, 
chop up the data identically.

I end up observing a *50x* speedup with this change. Although this is in a test 
that exercises only building, and on a data set that would be maximally 
affected by this bottleneck. 

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to 

[jira] [Commented] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905395#comment-13905395
 ] 

Sean Owen commented on MAHOUT-1419:
---

Yes you could compute summary statistics once and for all for each feature 
across the whole data set. That's efficient. 

The wrinkle is that, as you go down the tree, you're looking at a smaller 
subset of the data, whose percentiles for that feature may be very different 
from the global ones. This patch does the simplest thing, which is to recompute 
percentiles every time. That adapts to the data, but is more work. But hey it's 
already a big win and a simple change.

Agree, the core change here is fairly simple -- just compute the split points 
intelligently, and then count stuff by split, then iterate over splits.

Here it's not picking random percentiles but trying x different evenly spaced 
percentiles in order. Pretty reasonable.

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1419:
--

Attachment: MAHOUT-1419.patch

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1419:
--

Attachment: (was: MAHOUT-1419.patch)

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout on Spark?

2014-02-19 Thread Sean Owen
To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas  wrote:
> +100 for this, different execution engines, like the direction  pig and 
> crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan  wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov >>> wrote:
>>>
 Unfortunately methinks the prospects of something like Mahout/MLLib merge
 seem very unlikely due to vastly diverged approach to the basics of
>>> linear
 algebra (and other things). Just like one cannot grow single tree out of
 two trunks -- not easily, anyway.

 It is fairly easy to port (and subsequently beat) MLib at this point from
 collection of algorithms point of view. But IMO goal should be more
 MLI-like first, and port second. And be very careful with concepts.
 Something that i so far don't see happening with MLib. MLib seems to be
 old-style Mahout-like rush to become a collection of basic algorithms
 rather than coherent foundation. Admittedly, i havent looked very
>>> closely.


 On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter >>> wrote:

> I'm also convinced that Spark is a superior platform for executing
> distributed ML algorithms. We've had a discussion about a change from
> Hadoop to another platform some time ago, but at that point in time it
>>> was
> not clear which of the upcoming dataflow processing systems (Spark,
> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
> seems pretty obvious that Spark made the race.
>
> I concur with Ted, it would be great to have the communities work
> together. I know that at least 4 mahout committers (including me) are
> already following Spark's mailinglist and actively participating in the
> discussions.
>
> What are the ideas how a fruitful cooperation look like?
>
> Best,
> Sebastian
>
> PS:
>
> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> to Spark some time ago, but I haven't had time to test my code on a
>>> large
> dataset yet. I'd be happy to see someone help with that.
>
>
>
>
>
>
>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>
>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>> work together.
>>
>>
>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>> stuff like co occurrence recommender and streaming k-means?
>>
>>
>>
>>
>> N
>> --
>> Sent from Mailbox for iPhone
>>
>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning 
>> wrote:
>>
>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com>wrote:
>>>
 My (admittedly heavily biased) view is Spark is a superior platform
 overall
 for ML. If the two communities can work together to leverage the
 strengths
>>

[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1419:
--

Attachment: MAHOUT-1419.patch

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905075#comment-13905075
 ] 

Sean Owen edited comment on MAHOUT-1419 at 2/19/14 8:57 AM:


The significant change is computing 'split points' rather than considering all 
values as split points.

For numeric features, this means taking percentiles, rather than every distinct 
value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count 
array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and 
simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

The tests had to change as a result. The test for equivalence between 
OptIgSplit and DefaultIgSplit is no longer invalid, as they intentionally do 
not behave the same way. The VisualizerTest now results in different values, 
but I verified that it's actually a superficial difference. The trees chosen 
before and after are equivalent since the decision thresholds, while different, 
chop up the data identically.

I end up observing a *50x* speedup with this change. Although this is in a test 
that exercises only building, and on a data set that would be maximally 
affected by this bottleneck. 


was (Author: srowen):
The significant change is computing 'split points' rather than considering all 
values as split points.

For numeric features, this means taking percentiles, rather than every distinct 
value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count 
array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and 
simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.7, 0.8, 0.9
>Reporter: Sean Owen
> Attachments: MAHOUT-1419.patch
>
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout on Spark?

2014-02-19 Thread Gokhan Capan
I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov wrote:

> PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only doing it
> like couple nights in a month. It does not look like i will be doing it on
> company time any time soon (and even if i did, the company doesn't seem to
> be inclined to contribute anything I do anything new on their time). It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would not
> be able to get much traction compared to university-based contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov  >wrote:
>
> > Unfortunately methinks the prospects of something like Mahout/MLLib merge
> > seem very unlikely due to vastly diverged approach to the basics of
> linear
> > algebra (and other things). Just like one cannot grow single tree out of
> > two trunks -- not easily, anyway.
> >
> > It is fairly easy to port (and subsequently beat) MLib at this point from
> > collection of algorithms point of view. But IMO goal should be more
> > MLI-like first, and port second. And be very careful with concepts.
> > Something that i so far don't see happening with MLib. MLib seems to be
> > old-style Mahout-like rush to become a collection of basic algorithms
> > rather than coherent foundation. Admittedly, i havent looked very
> closely.
> >
> >
> > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter  >wrote:
> >
> >> I'm also convinced that Spark is a superior platform for executing
> >> distributed ML algorithms. We've had a discussion about a change from
> >> Hadoop to another platform some time ago, but at that point in time it
> was
> >> not clear which of the upcoming dataflow processing systems (Spark,
> >> Hyracks, Stratosphere) would establish itself amongst the users. To me
> it
> >> seems pretty obvious that Spark made the race.
> >>
> >> I concur with Ted, it would be great to have the communities work
> >> together. I know that at least 4 mahout committers (including me) are
> >> already following Spark's mailinglist and actively participating in the
> >> discussions.
> >>
> >> What are the ideas how a fruitful cooperation look like?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS:
> >>
> >> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> >> to Spark some time ago, but I haven't had time to test my code on a
> large
> >> dataset yet. I'd be happy to see someone help with that.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> >>
> >>> I know the Spark/Mllib devs can occasionally be quite set in ways of
> >>> doing certain things, but we'd welcome as many Mahout devs as possible
> to
> >>> work together.
> >>>
> >>>
> >>> It may be too late, but perhaps a GSoC project to look at a port of
> some
> >>> stuff like co occurrence recommender and streaming k-means?
> >>>
> >>>
> >>>
> >>>
> >>> N
> >>> --
> >>> Sent from Mailbox for iPhone
> >>>
> >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning 
> >>> wrote:
> >>>
> >>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>  nick.pentre...@gmail.com>wrote:
> 
> > My (admittedly heavily biased) view is Spark is a superior platform
> > overall
> > for ML. If the two communities can work together to leverage the
> > strengths
> > of Spark, and the large amount of good stuff in Mahout (as well as
> the
> > fantastic depth of experience of Mahout devs) I think a lot can be
> > achieved!
> >
> >  It makes a lot of sense that Spark would be better than Hadoop for
> ML
>  purposes given that Hadoop was intended to do web-crawl kinds of
> things
>  and
>  Spark was intentionally built to support machine learning.
>  Given that Spark has been announced by a majority of the Hadoop-based
>  distribution vendors, it makes sense that maybe Mahout should jump in.
>  I really would prefer it if the two communities (MLib/MLI and Mahout)
>  could
>  work more closely together.  There is a lot of good to be had on both
>  sides.
> 
> >

[jira] [Updated] (MAHOUT-1419) Random decision forest is excessively slow on numeric features

2014-02-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-1419:
--

Status: Patch Available  (was: Open)

The significant change is computing 'split points' rather than considering all 
values as split points.

For numeric features, this means taking percentiles, rather than every distinct 
value, which was incredibly slow for any data set with a numeric feature.

Categorical split points were optimized -- was pointlessly allocating a count 
array for every datum, not every distinct category value.

Handling of the count arrays and counting occurrences was unified, and 
simplified; there was no point in making them members as they were not reused.

(There are a few micro-optimizations, such as to the entropy method.)
(Also fixed an NPE in BuildForest.)

> Random decision forest is excessively slow on numeric features
> --
>
> Key: MAHOUT-1419
> URL: https://issues.apache.org/jira/browse/MAHOUT-1419
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification
>Affects Versions: 0.9, 0.8, 0.7
>Reporter: Sean Owen
>
> Follow-up to MAHOUT-1417. There's a customer running this and observing it 
> take an unreasonably long time on about 2GB of data -- like, >24 hours when 
> other RDF M/R implementations take 9 minutes. The difference is big enough to 
> probably be considered a defect. MAHOUT-1417 got that down to about 5 hours. 
> I am trying to further improve it.
> One key issue seems to be how splits are evaluated over numeric features. A 
> split is tried for every distinct numeric value of the feature in the whole 
> data set. Since these are floating point values, they could (and in the 
> customer's case are) all distinct. 200K rows means 200K splits to evaluate 
> every time a node is built on the feature.
> A better approach is to sample percentiles out of the feature and evaluate 
> only those as splits. Really doing that efficiently would require a lot of 
> rewrite. However, there are some modest changes possible which get some of 
> the benefit, and appear to make it run about 3x faster. That is --on a data 
> set that exhibits this problem -- meaning one using numeric features which 
> are generally distinct. Which is not exotic.
> There are comparable but different problems with handling of categorical 
> features, but that's for a different patch.
> I have a patch, but it changes behavior to some extent since it is evaluating 
> only a sample of splits instead of every single possible one. In particular 
> it makes the output of "OptIgSplit" no longer match the "DefaultIgSplit". 
> Although I think the point is that "optimized" may mean giving different 
> choices of split here, which could yield differing trees. So that test 
> probably has to go.
> (Along the way I found a number of micro-optimizations in this part of the 
> code that added up to maybe a 3% speedup. And fixed an NPE too.)
> I will propose a patch shortly with all of this for thoughts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)