[jira] [Commented] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268053#comment-15268053
 ] 

Hudson commented on MAHOUT-1848:


SUCCESS: Integrated in Mahout-Quality #3354 (See 
[https://builds.apache.org/job/Mahout-Quality/3354/])
MAHOUT-1848: drmSampleKRows in FlinkEngine should generate a dense or (smarthi: 
rev 6ab5a8d6456dfc52d7a951ae71618a6417516a07)
* flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala


> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix
> --
>
> Key: MAHOUT-1848
> URL: https://issues.apache.org/jira/browse/MAHOUT-1848
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based 
> on the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build is back to normal : Mahout-Quality #3354

2016-05-02 Thread Apache Jenkins Server
See 



[jira] [Commented] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267995#comment-15267995
 ] 

ASF GitHub Bot commented on MAHOUT-1848:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/233


> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix
> --
>
> Key: MAHOUT-1848
> URL: https://issues.apache.org/jira/browse/MAHOUT-1848
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based 
> on the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1848.
---
Resolution: Fixed

> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix
> --
>
> Key: MAHOUT-1848
> URL: https://issues.apache.org/jira/browse/MAHOUT-1848
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based 
> on the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267994#comment-15267994
 ] 

ASF GitHub Bot commented on MAHOUT-1848:


GitHub user smarthi opened a pull request:

https://github.com/apache/mahout/pull/233

MAHOUT-1848: drmSampleKRows in FlinkEngine should generate a dense or 
sparse matrix



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smarthi/mahout MAHOUT-1848

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #233


commit 0e5a1d4fbee9c13d2352cb88cb6b2c74379a7bdc
Author: smarthi 
Date:   2016-05-03T03:03:10Z

MAHOUT-1848: drmSampleKRows in FlinkEngine should generate a dense or 
sparse matrix




> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix
> --
>
> Key: MAHOUT-1848
> URL: https://issues.apache.org/jira/browse/MAHOUT-1848
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based 
> on the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1848 started by Suneel Marthi.
-
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix
> --
>
> Key: MAHOUT-1848
> URL: https://issues.apache.org/jira/browse/MAHOUT-1848
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based 
> on the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov
also, mahout does have optimizer that simply decides on degree of
parallelism of the _product_. I.e., if it computes C=A'B then it figures
that final results should be split N ways. but it doesn't apply the
partition function -- it just uses the usual hash partitioner to forward
the keys, i don't think we ever override that.

On Mon, May 2, 2016 at 9:39 AM, Dmitriy Lyubimov  wrote:

> by probabilistic algorithms i mostly mean inference involving monte carlo
> type mechanisms (Gibbs sampling LDA which i think might still be part of
> our MR collection might be an example, as well as its faster counterpart,
> variational Bayes inference.
>
> the parallelization strategies are are just standard spark mechanisms (in
> case of spark), mostly are using their standard hash samplers (which are in
> math speak are uniform multinomial samplers really).
>
> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim 
> wrote:
>
>> Hey Dimitri -
>>
>> Yes I meant probabilistic algorithms.  If mahout doesn’t use
>> probabilistic algos then how does it accomplish a degree of optimal
>> parallelization ? Wouldn’t you need randomization to spread out the
>> processing of tasks.
>>
>> > On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov 
>> wrote:
>> >
>> > yes mahout has stochastic svd and pca which are described at length in
>> the
>> > samsara book. The book examples in Andrew Palumbo's github also contain
>> an
>> > example of computing k-means|| sketch.
>> >
>> > if you mean _probabilistic_ algorithms, although i have done some things
>> > outside the public domain, nothing has been contributed.
>> >
>> > You are very welcome to try something if you don't have big constraints
>> on
>> > oss contribution.
>> >
>> > -d
>> >
>> > On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim > >
>> > wrote:
>> >
>> >> Hey All,
>> >>
>> >> I’d like to know if Mahout uses any randomized algorithms.   I’m
>> thinking
>> >> it probably does.  Can somebody point me to the packages that utilized
>> >> randomized algos.
>> >>
>> >> Thanks,
>> >>
>> >> Khurrum
>> >>
>> >>
>>
>>
>


Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-05-02 Thread Dmitriy Lyubimov
graph = graft, sorry. Graft just the AtB class into 0.12 codebase.

On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov  wrote:

> ok.
>
> Nikaash,
> could you perhaps do one more experiment and graph the 0.10 a'b code into
> 0.12 code (or whatever branch you say is not working the same) so we could
> quite confirm that the culprit change is indeed AB'?
>
> thank you very much.
>
> -d
>
> On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri 
> wrote:
>
>> Hi,
>>
>> I tried commenting out those lines and it did marginally improve the
>> performance. Although, the 0.10 version still significantly outperforms it.
>>
>> Here is a screenshot of the saveAsTextFile job (attached as selection1).
>> The AtB step took about 34 mins, which is significantly more than using
>> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
>>
>> The selection2 file is a screenshot of the flatMap at AtB.scala job,
>> which ran for 34 minutes,
>>
>> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
>> would take time, while subsequent such operations for the other indicators
>> would be orders of magnitudes faster. In the current job, the subsequent
>> AtB operations take time similar to the first one.
>>
>> A snapshot of my code is as follows:
>>
>> var existingRowIDs: Option[BiDictionary] = None
>>
>> // The first action named in the sequence is the "primary" action and begins 
>> to fill up the user dictionary
>> for (actionDescription <- actionInput) {
>>   // grab the path to actions
>>   val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
>> actionDescription._2,
>> schema = DefaultIndexedDatasetElementReadSchema,
>> existingRowIDs = existingRowIDs)
>>   existingRowIDs = Some(action.rowIDs)
>>
>>   ...
>> }
>>
>> which seems fairly standard, so I hope I'm not making a mistake here.
>>
>> It looks like the 0.11 onward version is using computeAtBZipped3 for
>> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
>> using atb_nograph. Though I'm not really sure whether that makes much of a
>> difference.
>>
>> Thank you,
>> Nikaash Puri
>>
>> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel 
>> wrote:
>>
>>> Right, will do. But Nakaash if you could just comment out those lines
>>> and see if it has an effect it would be informative and even perhaps solve
>>> your problem sooner than my changes. No great rush. Playing around with
>>> different values, as Dmitriy says, might yield better results and for that
>>> you can mess with the code or wait for my changes.
>>>
>>> Yeah, it’s fast enough in most cases. The main work is the optimized
>>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something like
>>> 10x faster than a similar algo in Hadoop MR. This particular calc and
>>> generalization is not in any other Spark or now Flink lib that I know of.
>>>
>>>
>>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov 
>>> wrote:
>>>
>>> Nikaash,
>>>
>>> yes unfortunately you may need to play with parallelism for your
>>> particular
>>> load/cluster manually to get the best out of it. I guess Pat will be
>>> adding
>>> the option.
>>>
>>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
>>> share
>>> > screenshots if possible.
>>> >
>>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
>>> comment
>>> > out the line and see the difference in performance.
>>> >
>>> > Thanks so much for helping guys, I really appreciate it.
>>> >
>>> > Also, the algorithm implementation for LLR is extremely performant, at
>>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
>>> (which
>>> > in our case is a fair amount) and the model was built in about 20
>>> minutes,
>>> > which is pretty amazing. This was using a pretty decent sized cluster,
>>> > though.
>>> >
>>> > Thank you,
>>> > Nikaash Puri
>>> >
>>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel  wrote:
>>> >
>>> > There are some other changes I want to make for the next rev so I’ll do
>>> > that.
>>> >
>>> > Nikaash, it would still be nice to verify this fixes your problem,
>>> also if
>>> > you want to create a Jira it will guarantee I don’t forget.
>>> >
>>> >
>>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov 
>>> wrote:
>>> >
>>> > yes -- i would do it as an optional option -- just like par does -- do
>>> > nothing; try auto, or try exact number of splits
>>> >
>>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel 
>>> wrote:
>>> >
>>> >> It’s certainly easy to put this in the driver, taking it out of the
>>> algo.
>>> >>
>>> >> Dmitriy, is it a candidate for an Option param to the algo? That would
>>> >> catch cases where people rely on it now (like my old DStream example)
>>> but
>>> >> easily allow it to be overridden to None to imitate pre 0.11, or
>>> passed in
>>> >> when the app knows better.
>>> >>
>>> >> Nikaash, are you in a position to comment out the .par(auto=true

Jenkins build is back to normal : mahout-nightly » Mahout Integration #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Apache Mahout #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Math Scala bindings #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Map-Reduce #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Flink bindings #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Spark bindings #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly #2080

2016-05-02 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : mahout-nightly » Mahout Math #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Build Tools #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Release Package #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Examples #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout HDFS #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout Spark bindings shell #2080

2016-05-02 Thread Apache Jenkins Server
See 




Jenkins build is back to normal : mahout-nightly » Mahout H2O backend #2080

2016-05-02 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Mahout-Quality #3353

2016-05-02 Thread Apache Jenkins Server
See 

Changes:

[smarthi] NoJira: Fix Javadocs Warnings

--
[...truncated 112258 lines...]
05/03/2016 00:05:47 Combine (Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(21/24)
 switched to CANCELING 
05/03/2016 00:05:47 Combine (Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(22/24)
 switched to CANCELING 
05/03/2016 00:05:47 Combine (Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(23/24)
 switched to CANCELING 
05/03/2016 00:05:47 Combine (Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(24/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(1/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(2/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(7/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(8/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(9/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(10/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(11/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(12/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(14/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(15/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(16/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(17/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(18/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(19/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(20/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(21/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(22/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(23/24)
 switched to CANCELING 
05/03/2016 00:05:47 Reduce(Reduce at 
org.apache.mahout.flinkbindings.blas.FlinkOpAt$.sparseTrick(FlinkOpAt.scala:61))(24/24)
 switched to CANCELING 
05/03/2016 00:05:47 CHAIN Join(Join at 
org.apache.flink.api.scala.UnfinishedJoinOperation.createJoinFunctionAssigner(joinDataSet.scala:278))
 -> FlatMap (FlatMap at 
org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:52))(1/24)
 switched to CANCELED 
05/03/2016 00:05:47 CHAIN Join(Join at 
org.apache.flink.api.scala.UnfinishedJoinOperation.createJoinFunctionAssigner(joinDataSet.scala:278))
 -> FlatMap (FlatMap at 
org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:52))(2/24)
 switched to CANCELED 
05/03/2016 00:05:47 CHAIN Join(Join at 
org.apache.flink.api.scala.UnfinishedJoinOperation.createJoinFunctionAssigner(joinDataSet.scala:278))
 -> FlatMap (FlatMap at 
org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:52))(3/24)
 switched to CANCELED 
05/03/2016 00:05:47 CHAIN Join(Join at 
org.apache.flink.api.scala.UnfinishedJoinOperation.createJoinFunctionAssigner(joinDataSet.scala:278))
 -> FlatMap (FlatMap at 
org.apache.mahout.flinkbindings.blas.FlinkOpAtB$.notZippable(FlinkOpAtB.scala:52))(4/24)
 switched to CANCELED 
05/03/2016 00:05:47 CHAIN Join(Join at 
org.apache.flink.api.scala.UnfinishedJoinOperation.createJoinFunctionAssigner(joinDataSe

[jira] [Created] (MAHOUT-1848) drmSampleKRows in FlinkEngine should generate a dense or sparse matrix

2016-05-02 Thread Suneel Marthi (JIRA)
Suneel Marthi created MAHOUT-1848:
-

 Summary: drmSampleKRows in FlinkEngine should generate a dense or 
sparse matrix
 Key: MAHOUT-1848
 URL: https://issues.apache.org/jira/browse/MAHOUT-1848
 Project: Mahout
  Issue Type: Bug
  Components: Flink
Affects Versions: 0.12.0
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.12.1


drmSampleKRows in FlinkEngine should generate a dense or sparse matrix based on 
the type of vector in the sampled Dataset



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MAHOUT-1830) Publish scaladocs for Mahout 0.12.0 release

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1830:
--
Comment: was deleted

(was: [~sslavic]   ?? This is on the critical path.)

> Publish scaladocs for Mahout 0.12.0 release
> ---
>
> Key: MAHOUT-1830
> URL: https://issues.apache.org/jira/browse/MAHOUT-1830
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Critical
> Fix For: 0.12.1
>
>
> Need to publish scaladocs for Mahout 0.12.0, present scaladocs out there are 
> from 0.10.2 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAHOUT-1830) Publish scaladocs for Mahout 0.12.0 release

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi reassigned MAHOUT-1830:
-

Assignee: Suneel Marthi  (was: Stevo Slavic)

> Publish scaladocs for Mahout 0.12.0 release
> ---
>
> Key: MAHOUT-1830
> URL: https://issues.apache.org/jira/browse/MAHOUT-1830
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Critical
> Fix For: 0.12.1
>
>
> Need to publish scaladocs for Mahout 0.12.0, present scaladocs out there are 
> from 0.10.2 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Mahout-Quality #3352

2016-05-02 Thread Apache Jenkins Server
See 

Changes:

[smarthi] MAHOUT-1847: drmSampleRows in FlinkEngine doesn't wrap Int Keys when

--
[...truncated 93171 lines...]
05/02/2016 23:21:14 DataSink 
(org.apache.flink.api.java.Utils$CollectHelper@1ca2865b)(1/1) switched to 
DEPLOYING 
05/02/2016 23:21:14 CHAIN DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat) -> Map (Map at 
org.apache.mahout.flinkbindings.drm.CheckpointedFlinkDrm.dim$lzycompute(CheckpointedFlinkDrm.scala:79))
 -> Combine (Reduce at 
org.apache.mahout.flinkbindings.drm.CheckpointedFlinkDrm.dim$lzycompute(CheckpointedFlinkDrm.scala:83))(3/4)
 switched to FINISHED 
05/02/2016 23:21:14 Reduce (Reduce at 
org.apache.mahout.flinkbindings.drm.CheckpointedFlinkDrm.dim$lzycompute(CheckpointedFlinkDrm.scala:83))(1/1)
 switched to FINISHED 
05/02/2016 23:21:14 DataSink 
(org.apache.flink.api.java.Utils$CollectHelper@1ca2865b)(1/1) switched to 
RUNNING 
05/02/2016 23:21:14 DataSink 
(org.apache.flink.api.java.Utils$CollectHelper@1ca2865b)(1/1) switched to 
FINISHED 
05/02/2016 23:21:14 Job execution switched to status FINISHED.
(40,40)
05/02/2016 23:21:15 Job execution switched to status RUNNING.
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(1/4) switched to 
SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(1/4) switched to 
DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(2/4) switched to 
SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(2/4) switched to 
DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(3/4) switched to 
SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(3/4) switched to 
DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(4/4) switched to 
SCHEDULED 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(4/4) switched to 
DEPLOYING 
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to RUNNING 
05/02/2016 23:21:15 DataSource (at 
org.apache.mahout.flinkbindings.blas.FlinkOpTimesRightMatrix$.drmTimesInCore(FlinkOpTimesRightMatrix.scala:50)
 (org.apache.flink.api.java.io.Collec)(1/1) switched to RUNNING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(2/4) switched to 
RUNNING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (org.apache.flink.api.java.io.TypeSerializerInputFormat)(3/4) switched to 
RUNNING 
05/02/2016 23:21:15 DataSource (at 
org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:396)
 (or

[jira] [Commented] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267720#comment-15267720
 ] 

Hudson commented on MAHOUT-1847:


FAILURE: Integrated in Mahout-Quality #3352 (See 
[https://builds.apache.org/job/Mahout-Quality/3352/])
MAHOUT-1847: drmSampleRows in FlinkEngine doesn't wrap Int Keys when (smarthi: 
rev 4c85d6a48bcdd5c161f50f85ec1e7d278d1dbae0)
* flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala


> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM for a classTag of type Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved MAHOUT-1847.
---
Resolution: Fixed

> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM for a classTag of type Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267672#comment-15267672
 ] 

ASF GitHub Bot commented on MAHOUT-1847:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/232


> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM for a classTag of type Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267666#comment-15267666
 ] 

ASF GitHub Bot commented on MAHOUT-1847:


GitHub user smarthi opened a pull request:

https://github.com/apache/mahout/pull/232

MAHOUT-1847: drmSampleRows in FlinkEngine doesn't wrap Int Keys when 
ClassTag is of type Int



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/smarthi/mahout MAHOUT-1847

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #232


commit c5b11422f5a068cadbcde9c3f533d4ec11f9c8b9
Author: smarthi 
Date:   2016-05-02T22:47:44Z

MAHOUT-1847: drmSampleRows in FlinkEngine doesn't wrap Int Keys when 
ClassTag is of type Int




> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM for a classTag of type Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1847:
--
Description: drmSampleKRows in Flinkengine doesn't rekey with Integer keys 
when wrapping the resulting DataSet into a DRM for a classTag of type Int.  
(was: drmSampleKRows in Flinkengine doesn't rekey with Integer keys when 
wrapping the resulting DataSet into a DRM)

> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM for a classTag of type Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1847) drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type Int

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1847:
--
Summary: drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag 
is of type Int  (was: drmSampleRows in FlinkEngine)

> drmSampleRows in FlinkEngine doesn't wrap Int Keys when ClassTag is of type 
> Int
> ---
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1847) drmSampleRows in FlinkEngine

2016-05-02 Thread Suneel Marthi (JIRA)
Suneel Marthi created MAHOUT-1847:
-

 Summary: drmSampleRows in FlinkEngine
 Key: MAHOUT-1847
 URL: https://issues.apache.org/jira/browse/MAHOUT-1847
 Project: Mahout
  Issue Type: Bug
  Components: Flink
Affects Versions: 0.12.0
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 0.12.1


drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping the 
resulting DataSet into a DRM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (MAHOUT-1847) drmSampleRows in FlinkEngine

2016-05-02 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1847 started by Suneel Marthi.
-
> drmSampleRows in FlinkEngine
> 
>
> Key: MAHOUT-1847
> URL: https://issues.apache.org/jira/browse/MAHOUT-1847
> Project: Mahout
>  Issue Type: Bug
>  Components: Flink
>Affects Versions: 0.12.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 0.12.1
>
>
> drmSampleKRows in Flinkengine doesn't rekey with Integer keys when wrapping 
> the resulting DataSet into a DRM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: stochastic nature

2016-05-02 Thread Andrew Palumbo
Hi Khurrum,

To expand upon what Dmitriy was saying regarding k-means|| sketching in the 
github repo for the samsara book, please see:

https://github.com/andrewpalumbo/mahout-samsara-book/blob/master/myMahoutApp/src/main/scala/myMahoutApp/BahmaniSketch.scala#L48

Mahout has a sampling apI based the underlying Engine's sampling methods, in 
this case Spark's as described by Dmitriy below.

See: 
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L138

and its implementation in the Spark Module:

https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala#L247

With these sampling methods along wih the statistics available for DRMs, most 
of the tools are available to implement Monte Carlo style algorithms, and we 
would have interest in including some implementations in the upcoming releases.


Andy



From: Khurrum Nasim 
Sent: Monday, May 2, 2016 12:47:17 PM
To: dev@mahout.apache.org
Subject: Re: stochastic nature

Thanks for the insight Dimitri.   I will look further into spark to understand 
how it handles parallelization and distributed processing.


> On May 2, 2016, at 12:39 PM, Dmitriy Lyubimov  wrote:
>
> by probabilistic algorithms i mostly mean inference involving monte carlo
> type mechanisms (Gibbs sampling LDA which i think might still be part of
> our MR collection might be an example, as well as its faster counterpart,
> variational Bayes inference.
>
> the parallelization strategies are are just standard spark mechanisms (in
> case of spark), mostly are using their standard hash samplers (which are in
> math speak are uniform multinomial samplers really).
>
> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim 
> wrote:
>
>> Hey Dimitri -
>>
>> Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic
>> algos then how does it accomplish a degree of optimal parallelization ?
>> Wouldn’t you need randomization to spread out the processing of tasks.
>>
>>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov  wrote:
>>>
>>> yes mahout has stochastic svd and pca which are described at length in
>> the
>>> samsara book. The book examples in Andrew Palumbo's github also contain
>> an
>>> example of computing k-means|| sketch.
>>>
>>> if you mean _probabilistic_ algorithms, although i have done some things
>>> outside the public domain, nothing has been contributed.
>>>
>>> You are very welcome to try something if you don't have big constraints
>> on
>>> oss contribution.
>>>
>>> -d
>>>
>>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
>>> wrote:
>>>
 Hey All,

 I’d like to know if Mahout uses any randomized algorithms.   I’m
>> thinking
 it probably does.  Can somebody point me to the packages that utilized
 randomized algos.

 Thanks,

 Khurrum


>>
>>



Re: stochastic nature

2016-05-02 Thread Khurrum Nasim
Thanks for the insight Dimitri.   I will look further into spark to understand 
how it handles parallelization and distributed processing.


> On May 2, 2016, at 12:39 PM, Dmitriy Lyubimov  wrote:
> 
> by probabilistic algorithms i mostly mean inference involving monte carlo
> type mechanisms (Gibbs sampling LDA which i think might still be part of
> our MR collection might be an example, as well as its faster counterpart,
> variational Bayes inference.
> 
> the parallelization strategies are are just standard spark mechanisms (in
> case of spark), mostly are using their standard hash samplers (which are in
> math speak are uniform multinomial samplers really).
> 
> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim 
> wrote:
> 
>> Hey Dimitri -
>> 
>> Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic
>> algos then how does it accomplish a degree of optimal parallelization ?
>> Wouldn’t you need randomization to spread out the processing of tasks.
>> 
>>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov  wrote:
>>> 
>>> yes mahout has stochastic svd and pca which are described at length in
>> the
>>> samsara book. The book examples in Andrew Palumbo's github also contain
>> an
>>> example of computing k-means|| sketch.
>>> 
>>> if you mean _probabilistic_ algorithms, although i have done some things
>>> outside the public domain, nothing has been contributed.
>>> 
>>> You are very welcome to try something if you don't have big constraints
>> on
>>> oss contribution.
>>> 
>>> -d
>>> 
>>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
>>> wrote:
>>> 
 Hey All,
 
 I’d like to know if Mahout uses any randomized algorithms.   I’m
>> thinking
 it probably does.  Can somebody point me to the packages that utilized
 randomized algos.
 
 Thanks,
 
 Khurrum
 
 
>> 
>> 



Re: stochastic nature

2016-05-02 Thread Khurrum Nasim
Hey Dimitri - 

Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic 
algos then how does it accomplish a degree of optimal parallelization ? 
Wouldn’t you need randomization to spread out the processing of tasks.  

> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov  wrote:
> 
> yes mahout has stochastic svd and pca which are described at length in the
> samsara book. The book examples in Andrew Palumbo's github also contain an
> example of computing k-means|| sketch.
> 
> if you mean _probabilistic_ algorithms, although i have done some things
> outside the public domain, nothing has been contributed.
> 
> You are very welcome to try something if you don't have big constraints on
> oss contribution.
> 
> -d
> 
> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
> wrote:
> 
>> Hey All,
>> 
>> I’d like to know if Mahout uses any randomized algorithms.   I’m thinking
>> it probably does.  Can somebody point me to the packages that utilized
>> randomized algos.
>> 
>> Thanks,
>> 
>> Khurrum
>> 
>> 



Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov
by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.

the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).

On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim 
wrote:

> Hey Dimitri -
>
> Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic
> algos then how does it accomplish a degree of optimal parallelization ?
> Wouldn’t you need randomization to spread out the processing of tasks.
>
> > On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov  wrote:
> >
> > yes mahout has stochastic svd and pca which are described at length in
> the
> > samsara book. The book examples in Andrew Palumbo's github also contain
> an
> > example of computing k-means|| sketch.
> >
> > if you mean _probabilistic_ algorithms, although i have done some things
> > outside the public domain, nothing has been contributed.
> >
> > You are very welcome to try something if you don't have big constraints
> on
> > oss contribution.
> >
> > -d
> >
> > On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
> > wrote:
> >
> >> Hey All,
> >>
> >> I’d like to know if Mahout uses any randomized algorithms.   I’m
> thinking
> >> it probably does.  Can somebody point me to the packages that utilized
> >> randomized algos.
> >>
> >> Thanks,
> >>
> >> Khurrum
> >>
> >>
>
>


Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov
yes mahout has stochastic svd and pca which are described at length in the
samsara book. The book examples in Andrew Palumbo's github also contain an
example of computing k-means|| sketch.

if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.

You are very welcome to try something if you don't have big constraints on
oss contribution.

-d

On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
wrote:

> Hey All,
>
> I’d like to know if Mahout uses any randomized algorithms.   I’m thinking
> it probably does.  Can somebody point me to the packages that utilized
> randomized algos.
>
> Thanks,
>
> Khurrum
>
>


Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-05-02 Thread Dmitriy Lyubimov
ok.

Nikaash,
could you perhaps do one more experiment and graph the 0.10 a'b code into
0.12 code (or whatever branch you say is not working the same) so we could
quite confirm that the culprit change is indeed AB'?

thank you very much.

-d

On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri  wrote:

> Hi,
>
> I tried commenting out those lines and it did marginally improve the
> performance. Although, the 0.10 version still significantly outperforms it.
>
> Here is a screenshot of the saveAsTextFile job (attached as selection1).
> The AtB step took about 34 mins, which is significantly more than using
> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
>
> The selection2 file is a screenshot of the flatMap at AtB.scala job, which
> ran for 34 minutes,
>
> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
> would take time, while subsequent such operations for the other indicators
> would be orders of magnitudes faster. In the current job, the subsequent
> AtB operations take time similar to the first one.
>
> A snapshot of my code is as follows:
>
> var existingRowIDs: Option[BiDictionary] = None
>
> // The first action named in the sequence is the "primary" action and begins 
> to fill up the user dictionary
> for (actionDescription <- actionInput) {
>   // grab the path to actions
>   val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
> actionDescription._2,
> schema = DefaultIndexedDatasetElementReadSchema,
> existingRowIDs = existingRowIDs)
>   existingRowIDs = Some(action.rowIDs)
>
>   ...
> }
>
> which seems fairly standard, so I hope I'm not making a mistake here.
>
> It looks like the 0.11 onward version is using computeAtBZipped3 for
> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
> using atb_nograph. Though I'm not really sure whether that makes much of a
> difference.
>
> Thank you,
> Nikaash Puri
>
> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel  wrote:
>
>> Right, will do. But Nakaash if you could just comment out those lines and
>> see if it has an effect it would be informative and even perhaps solve your
>> problem sooner than my changes. No great rush. Playing around with
>> different values, as Dmitriy says, might yield better results and for that
>> you can mess with the code or wait for my changes.
>>
>> Yeah, it’s fast enough in most cases. The main work is the optimized A’A,
>> A’B stuff in the BLAS optimizer Dmitriy put in. It is something like 10x
>> faster than a similar algo in Hadoop MR. This particular calc and
>> generalization is not in any other Spark or now Flink lib that I know of.
>>
>>
>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov  wrote:
>>
>> Nikaash,
>>
>> yes unfortunately you may need to play with parallelism for your
>> particular
>> load/cluster manually to get the best out of it. I guess Pat will be
>> adding
>> the option.
>>
>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri 
>> wrote:
>>
>> > Hi,
>> >
>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
>> share
>> > screenshots if possible.
>> >
>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
>> comment
>> > out the line and see the difference in performance.
>> >
>> > Thanks so much for helping guys, I really appreciate it.
>> >
>> > Also, the algorithm implementation for LLR is extremely performant, at
>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
>> (which
>> > in our case is a fair amount) and the model was built in about 20
>> minutes,
>> > which is pretty amazing. This was using a pretty decent sized cluster,
>> > though.
>> >
>> > Thank you,
>> > Nikaash Puri
>> >
>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel  wrote:
>> >
>> > There are some other changes I want to make for the next rev so I’ll do
>> > that.
>> >
>> > Nikaash, it would still be nice to verify this fixes your problem, also
>> if
>> > you want to create a Jira it will guarantee I don’t forget.
>> >
>> >
>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov 
>> wrote:
>> >
>> > yes -- i would do it as an optional option -- just like par does -- do
>> > nothing; try auto, or try exact number of splits
>> >
>> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel 
>> wrote:
>> >
>> >> It’s certainly easy to put this in the driver, taking it out of the
>> algo.
>> >>
>> >> Dmitriy, is it a candidate for an Option param to the algo? That would
>> >> catch cases where people rely on it now (like my old DStream example)
>> but
>> >> easily allow it to be overridden to None to imitate pre 0.11, or
>> passed in
>> >> when the app knows better.
>> >>
>> >> Nikaash, are you in a position to comment out the .par(auto=true) and
>> see
>> >> if it makes a difference?
>> >>
>> >>
>> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov 
>> wrote:
>> >>
>> >> can you please look into spark UI and write down how many split the job
>> >> generates in the first stage of the pipeline, or anywhere else there's

stochastic nature

2016-05-02 Thread Khurrum Nasim
Hey All,

I’d like to know if Mahout uses any randomized algorithms.   I’m thinking it 
probably does.  Can somebody point me to the packages that utilized randomized 
algos.   

Thanks,

Khurrum



Re: Mahout contributions

2016-05-02 Thread Khurrum Nasim
@Saikat - One thing I shall say is that REST is slow.  There is latency because 
of deserialization overhead.  For very large datasets probably not very good to 
use REST.  


> On Apr 30, 2016, at 2:35 PM, Saikat Kanjilal  wrote:
> 
> Andrew et al,I wanted to ask about a few items while I'm researching my dev 
> proposal, so what I'm looking to build is a streaming analytics platform to 
> do things like collaborative filtering and anomaly detection on large amounts 
> of streaming data that are either generated from events (kafka) or through a 
> firehose like Amazon Kinesis, my initial thinking is that this pipe of 
> events/data would be connected to a rest API that sits on top of mahout, the 
> backend underneath mahout would use a hybrid form of spark as well as spark 
> streaming, I'm wondering whether Samsara was designed from the ground up to 
> deal with large amounts of streaming data or whether this is not a use case 
> targeted yet.  My goal is to build a platform with several data sources/sinks 
> and produce intermediate checkpoints where transformations are applied to the 
> data before once again sending to a set of sinks/sources.  Therefore the 
> potential fits into and out of mahout include:
> 1) A rest API that leverages spray and akka and invokes one or more 
> algorithms in mahout2) A runtime environment with scala actors that allows 
> one to either ingest data or perform transformations on data through the use 
> of various classification and clustering algorithms, the runtime environment 
> would ingest algorithms using mahout as a library3) A rich set of actors 
> dealing with various no sql and graph based datastores 
> (cassandra/neo4j/titan/mongo)
> 
> Some insight into Samsara would be great as I'm trying to understand the 
> entry points into mahout.
> Thanks in advance.
> 
>> From: ap@outlook.com
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> Date: Thu, 28 Apr 2016 21:43:19 +
>> 
>> I don't  think that this sort of of integration work would be a good fit 
>> directly to the Mahout project.  Mahout is more about math, algorithms and 
>> an environment to develop algorithms.  We stay away from direct platform 
>> integration.  In the past we did have some elasticsearch/mahout integration 
>> work that is not in the code base for this exact reason.  I would suggest 
>> that better places to contribute something like this may be: PIO 
>> (https://prediction.io/), or even directly as a package for spark 
>> http://spark-packages.org/ .
>> 
>> Recent projects integrating Mahout have recently been added to PIO: 
>> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation.
>>   
>> 
>> I think that the project that you are proposing would be a better fit there.
>> 
>> Thanks,
>> 
>> Andy
>> 
>> 
>> 
>> From: Saikat Kanjilal 
>> Sent: Thursday, April 28, 2016 1:50 PM
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> 
>> I want to start with social data as an example, for example data returned 
>> from FB graph API as well user Twitter data, will send some samples later if 
>> you're interested.
>> 
>> Sent from my iPhone
>> 
>>> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim  
>>> wrote:
>>> 
>>> 
>>> What type of JSON payload size are we talking about here ?
>>> 
 On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal  wrote:
 
 Because EL gives you the visualization and non Lucene type query 
 constructs as well and also that it already has a rest API that I plan on 
 tying into mahout.  I plan on wrapping some of the clustering algorithms 
 that I implement using Mahout and Spark as a service which can then make 
 calls into other services (namely elasticsearch and neo4j graph service).
 
 Sent from my iPhone
 
> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim  
> wrote:
> 
> @Saikat- why use EL instead of Lucene directly.
> 
> 
> 
>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal  
>> wrote:
>> 
>> This is great information thank you, based on this recommendation I 
>> won't create a JIRA but start work on my project and when the code 
>> approaches the percentages you are describing I will create the 
>> appropriate JIRA's and put together a proposal to send to the list, 
>> sound ok?  Based on your latest updates to the wiki i will work on a 
>> handful of the clustering algorithms since I see that the Spark 
>> implementations for these are not yet complete.
>> Thank you again
>> 
>>> From: ap@outlook.com
>>> To: dev@mahout.apache.org
>>> Subject: Re: Mahout contributions
>>> Date: Thu, 28 Apr 2016 01:31:09 +
>>> 
>>> Saikat,
>>> 
>>> One other thing that I should say is that you do not need clearance or 
>>> input from the committers to begin work on your project, and the 
>>> interest can and should come from the commun

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-05-02 Thread Nikaash Puri
Hi,

I tried commenting out those lines and it did marginally improve the
performance. Although, the 0.10 version still significantly outperforms it.

Here is a screenshot of the saveAsTextFile job (attached as selection1).
The AtB step took about 34 mins, which is significantly more than using
0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.

The selection2 file is a screenshot of the flatMap at AtB.scala job, which
ran for 34 minutes,

Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB would
take time, while subsequent such operations for the other indicators would
be orders of magnitudes faster. In the current job, the subsequent AtB
operations take time similar to the first one.

A snapshot of my code is as follows:

var existingRowIDs: Option[BiDictionary] = None

// The first action named in the sequence is the "primary" action and
begins to fill up the user dictionary
for (actionDescription <- actionInput) {
  // grab the path to actions
  val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
actionDescription._2,
schema = DefaultIndexedDatasetElementReadSchema,
existingRowIDs = existingRowIDs)
  existingRowIDs = Some(action.rowIDs)

  ...
}

which seems fairly standard, so I hope I'm not making a mistake here.

It looks like the 0.11 onward version is using computeAtBZipped3 for
performing the multiplication in atb_nograph_mmul unlike 0.10 which was
using atb_nograph. Though I'm not really sure whether that makes much of a
difference.

Thank you,
Nikaash Puri

On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel  wrote:

> Right, will do. But Nakaash if you could just comment out those lines and
> see if it has an effect it would be informative and even perhaps solve your
> problem sooner than my changes. No great rush. Playing around with
> different values, as Dmitriy says, might yield better results and for that
> you can mess with the code or wait for my changes.
>
> Yeah, it’s fast enough in most cases. The main work is the optimized A’A,
> A’B stuff in the BLAS optimizer Dmitriy put in. It is something like 10x
> faster than a similar algo in Hadoop MR. This particular calc and
> generalization is not in any other Spark or now Flink lib that I know of.
>
>
> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov  wrote:
>
> Nikaash,
>
> yes unfortunately you may need to play with parallelism for your particular
> load/cluster manually to get the best out of it. I guess Pat will be adding
> the option.
>
> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri 
> wrote:
>
> > Hi,
> >
> > Sure, I’ll do some more detailed analysis of the jobs on the UI and share
> > screenshots if possible.
> >
> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
> comment
> > out the line and see the difference in performance.
> >
> > Thanks so much for helping guys, I really appreciate it.
> >
> > Also, the algorithm implementation for LLR is extremely performant, at
> > least as of Mahout 0.10. I ran some tests for around 61 days of data
> (which
> > in our case is a fair amount) and the model was built in about 20
> minutes,
> > which is pretty amazing. This was using a pretty decent sized cluster,
> > though.
> >
> > Thank you,
> > Nikaash Puri
> >
> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel  wrote:
> >
> > There are some other changes I want to make for the next rev so I’ll do
> > that.
> >
> > Nikaash, it would still be nice to verify this fixes your problem, also
> if
> > you want to create a Jira it will guarantee I don’t forget.
> >
> >
> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov  wrote:
> >
> > yes -- i would do it as an optional option -- just like par does -- do
> > nothing; try auto, or try exact number of splits
> >
> > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel 
> wrote:
> >
> >> It’s certainly easy to put this in the driver, taking it out of the
> algo.
> >>
> >> Dmitriy, is it a candidate for an Option param to the algo? That would
> >> catch cases where people rely on it now (like my old DStream example)
> but
> >> easily allow it to be overridden to None to imitate pre 0.11, or passed
> in
> >> when the app knows better.
> >>
> >> Nikaash, are you in a position to comment out the .par(auto=true) and
> see
> >> if it makes a difference?
> >>
> >>
> >> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov 
> wrote:
> >>
> >> can you please look into spark UI and write down how many split the job
> >> generates in the first stage of the pipeline, or anywhere else there's
> >> signficant variation in # of splits in both cases?
> >>
> >> the row similarity is a very short pipeline (in comparison with what
> would
> >> normally be on average). so only the first input re-splitting is
> critical.
> >>
> >> The splitting along the products is adjusted by optimizer automatically
> to
> >> match the amount of data segments observed on average in the input(s).
> >> e.g.
> >> if uyou compute val C = A %*% B and A has 500 elements per split and