from:"Dmitriy Lyubimov"

Re: Renumbering new releases

2019-04-30 Thread Dmitriy Lyubimov

I am ok with 1.0


On Thu, Apr 18, 2019 at 6:10 PM Andrew Musselman 
wrote:

> We've been discussing moving to a 1.0 release for a few years now. This
> past quarter we had a comment on our board report about whether we would
> consider getting out of the 0.x releases.
>
> I think it makes sense especially since we've had major overhauls a
> couple/few times now, which is one sign of a matured project.
>
> Instead of moving to 1.0, however, I would like to propose moving to 14.1
> for our next release. This would recognize that for several years the
> project has been ready to use, while keeping a similar historical release
> ordering.
>
> Any thoughts on moving to that numbering scheme? Further releases could
> increment the minor version and have hot fixes in point releases, or we
> could just increment the major version for a yearly release, say.
>

Re: Hangouts

2018-07-31 Thread Dmitriy Lyubimov

I am on vacation this week fyi

On Tue, Jul 31, 2018 at 11:36 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Cool, I'll shoot for something on Friday early Pacific time and put an
> invite in here; looking forward to it!
>
> On Sat, Jul 28, 2018 at 9:26 AM Shannon Quinn  wrote:
>
> > Weekdays are better for me, so my vote would be a Friday morning.
> >
> > On 7/27/18 5:55 PM, Ivan Serdyuk wrote:
> > > Works for me.
> > >
> > > On Sat, Jul 28, 2018 at 12:14 AM, Andrew Musselman <
> > > andrew.mussel...@gmail.com> wrote:
> > >
> > >> Weekends okay for people? Or Friday, morning Pacific Time?
> > >>
> > >> On Tue, Jul 24, 2018 at 3:20 PM Trevor Grant <
> trevor.d.gr...@gmail.com>
> > >> wrote:
> > >>
> > >>> yea sounds good.
> > >>>
> > >>> On Mon, Jul 23, 2018 at 12:32 PM, Andrew Musselman <
> > >>> andrew.mussel...@gmail.com> wrote:
> > >>>
> >  Hi all, any interest in a hangout meeting next month to catch up on
> a
> >  release/blockers?
> > 
> >
> >
>

Re: Congrats Palumbo and Holden

2018-05-02 Thread Dmitriy Lyubimov

Congrats!

On Wed, May 2, 2018 at 1:25 PM, Trevor Grant 
wrote:

> Both were just elected new ASF members!!
>
> https://s.apache.org/D6iz
>

Re: Board Report

2018-04-24 Thread Dmitriy Lyubimov

LGTM
-d

On Tue, Apr 24, 2018 at 9:48 AM, Andrew Palumbo  wrote:

> Hello all,
> The Mahout PMC would like to involve the community more in filling out
> board reports.  This will hopefully help us to learn some of the needs of
> Mahout devs and users.
>
>
> https://docs.google.com/document/d/1q7nOWMOzwgR18mnutvbnwBskn-
> EJMzMCvADWFI4Pemk/edit?usp=drivesdk
>
> The above gdoc is for a report which will be filed in 2 weeks.  Please do
> look over.  Comments and questions are welcome.
>
> Thank you,
>
> Andy
>

Re: Backlog - Reordered Top->Down

2018-03-06 Thread Dmitriy Lyubimov

Thank you.

On Sat, Mar 3, 2018 at 1:45 PM, André Santi Oliveira 
wrote:

> Things which were in the backlog were organized (top->down) using criteria:
>
>   - Fix Version(s)
>   - Priority
>   - Type
>
>
> If you don't agree with the order of some things which are there, feel free
> to move around but I would like to understand the why to myself be able to
> organize better next time. We have a lot of things "not tagged" which I put
> on the bottom of the backlog, but it does not means are less important.
> just means I was not able to classify still.
>
> I still need to understand how do you classify things in terms of versions.
> In my head I have the workflow which is things are being done then we tag
> to say "it will be delivered in this version for sure", but it's just a
> detail for now as I said I'm still understanding the whole process and how
> do you work.
>
>
> Cheers,
> André
>

Re: Updating Wikipedia

2018-02-19 Thread Dmitriy Lyubimov

I think Suneel was modifying it...

On Sun, Feb 18, 2018 at 7:02 AM, Trevor Grant 
wrote:

> Is anyone good at Wikipedia?
>
> We're still listed as being primarily running on Hadoop there.
>
> https://en.wikipedia.org/wiki/Apache_Mahout
>
> If anyone has some skills/time- an update would be cool...
>

Re: MathJax not renedering on Website

2017-09-12 Thread Dmitriy Lyubimov

PS
http://docs.mathjax.org/en/latest/start.html

"We retired our self-hosted CDN at cdn.mathjax.org in April, 2017. We
recommend using cdnjs.com which uses the same provider. [...]"

On Tue, Sep 12, 2017 at 9:52 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> mathjax we use runs on cdn. We can either re-host it ourselves to avoid
> the dependency (and also shield from incompatibilities introduced by
> version update), or follow whatever they use.
>
> their currently advertised cdn location seems to be (which indeed looks
> different to me from before)
>
>
> src="<a  rel="nofollow" href="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML&quot">https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML&quot</a>;>
>
>
> On Mon, Sep 11, 2017 at 1:59 AM, Andrew Palumbo <ap@outlook.com>
> wrote:
>
>> Thanks Ted, I haven't been able to follow the website migration closely,
>> but this seems likely as we've changed paths I believe.
>>
>> @Dustin - does this seem like the issue we're facing?
>>
>> --andy
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Ted Dunning <ted.dunn...@gmail.com>
>> Date: 09/10/2017 8:56 PM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: MathJax not renedering on Website
>>
>> This has happened periodically to my sites. The answer is usually that the
>> canonical location of the mathJax JavaScript library has changed.
>>
>> On Sep 10, 2017 7:58 PM, "Andrew Palumbo" <ap@outlook.com> wrote:
>>
>> > It looks like MathJax is not rendering tex on the site:
>> >
>> >
>> > Eg.:
>> >
>> >
>> > https://mahout.apache.org/users/algorithms/d-ssvd.html
>> >
>> > Ideas to get this going while site is being redone?
>> >
>> >
>> >
>>
>
>

Re: MathJax not renedering on Website

2017-09-12 Thread Dmitriy Lyubimov

mathjax we use runs on cdn. We can either re-host it ourselves to avoid the
dependency (and also shield from incompatibilities introduced by version
update), or follow whatever they use.

their currently advertised cdn location seems to be (which indeed looks
different to me from before)

https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML";>


On Mon, Sep 11, 2017 at 1:59 AM, Andrew Palumbo  wrote:

> Thanks Ted, I haven't been able to follow the website migration closely,
> but this seems likely as we've changed paths I believe.
>
> @Dustin - does this seem like the issue we're facing?
>
> --andy
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Ted Dunning 
> Date: 09/10/2017 8:56 PM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Re: MathJax not renedering on Website
>
> This has happened periodically to my sites. The answer is usually that the
> canonical location of the mathJax JavaScript library has changed.
>
> On Sep 10, 2017 7:58 PM, "Andrew Palumbo"  wrote:
>
> > It looks like MathJax is not rendering tex on the site:
> >
> >
> > Eg.:
> >
> >
> > https://mahout.apache.org/users/algorithms/d-ssvd.html
> >
> > Ideas to get this going while site is being redone?
> >
> >
> >
>

Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.

2017-09-05 Thread Dmitriy Lyubimov

The last thing i want to do is to overcomplicate things though.

On Tue, Sep 5, 2017 at 4:02 PM, Andrew Palumbo <ap@outlook.com> wrote:

> > PS technically, some "flavor" of the dataset still can be attributed and
>  passed on in the pipeline, e.g., that's what i do with partitioning kind.
> if another operator messes that flavor up, this gets noted in the
> carry-over property (that's how optimizer knows if operands in a binary
> logical operator are coming in identically partitioned or not, for
> example). similar thing can be done to "sorted-ness" flavor and being
> tracked around, and operators that break "sorted-ness" would note that also
> on the tree nodes, but that only makes sense if we have "consumer"
> operators that care about sortedness, of which we have none at the moment
> (it possible that we will, perhaps). I am just saying this problem may
> benefit from some more broad thinking of the issue in optimization tree
> sense, i.e., why we do it, which things will use it and which things will
> preserve/mess it up etc.
>
>
> Agreed re: more broad thinking yes- just getting the conversation
> started.  Thanks.
>
> 
> From: Dmitriy Lyubimov <dlie...@gmail.com>
> Sent: Tuesday, September 5, 2017 6:06:35 PM
> To: dev@mahout.apache.org
> Subject: Re: [DISCUSS} New feature - DRM and in-core matrix sort and
> required test suites for modules.
>
> PS technically, some "flavor" of the dataset still can be attributed and
>  passed on in the pipeline, e.g., that's what i do with partitioning kind.
> if another operator messes that flavor up, this gets noted in the
> carry-over property (that's how optimizer knows if operands in a binary
> logical operator are coming in identically partitioned or not, for
> example). similar thing can be done to "sorted-ness" flavor and being
> tracked around, and operators that break "sorted-ness" would note that also
> on the tree nodes, but that only makes sense if we have "consumer"
> operators that care about sortedness, of which we have none at the moment
> (it possible that we will, perhaps). I am just saying this problem may
> benefit from some more broad thinking of the issue in optimization tree
> sense, i.e., why we do it, which things will use it and which things will
> preserve/mess it up etc.
>
> On Tue, Sep 5, 2017 at 3:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > In general, +1, don't see why not.
> >
> > Q -- is it something that you have encountered while doing algebra? I.e.,
> > do you need the sorted DRM to continue algebraic operations between
> > optimizer barriers, or you just need an RDD as the outcome of all this?
> >
> > if it is just an RDD, then you could just do a spark-supported sort,
> > that's why we have an drm.rdd barrier (spark-specific). Barrier out to
> > spark RDD and then continue doing whatever spark already supports.
> >
> > Another potential issue is that matrices do not generally imply ordering
> > or formation of intermediate products, i.e., we inside optimizer, you
> might
> > build a pipeline that implies ordered RDD in Spark sense, but there is no
> > algebraic operator consuming sorted rdds, and no operator that guarantees
> > preserving it (even if it just a checkpoint). This may create ambiguities
> > as more rewriting rules are added. This is not a major concern.
> >
> > On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> >> Ever since we moved Flink to its own profile, I have been thinking we
> >> ought
> >> to do the same to H2O but haven't been to motivated bc it was never
> >> causing
> >> anyone any problems.
> >>
> >> Maybe its time to drop H2O "official support" and move Flink Batch / H2O
> >> into a "mahout/community/engines" folder.
> >>
> >> Ive been doing a lot of Flink Streaming the last couple weeks and
> already
> >> bootlegged a few of the 'Algorithms" into Flink.  Pretty sure we could
> >> support those easily- and I _think_ we could do the same with the
> >> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement
> the
> >> the Operators on that.
> >>
> >> I'd put FlinkStreaming as another community engine.
> >>
> >> If we did that, I'd say- by convention we need a Markdown document in
> >> mahout/community/engines that has a table of what is implemented on
> what.
> >>
> >> That is to say, even if we only were able to implement the &quo

Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.

2017-09-05 Thread Dmitriy Lyubimov

PS technically, some "flavor" of the dataset still can be attributed and
 passed on in the pipeline, e.g., that's what i do with partitioning kind.
if another operator messes that flavor up, this gets noted in the
carry-over property (that's how optimizer knows if operands in a binary
logical operator are coming in identically partitioned or not, for
example). similar thing can be done to "sorted-ness" flavor and being
tracked around, and operators that break "sorted-ness" would note that also
on the tree nodes, but that only makes sense if we have "consumer"
operators that care about sortedness, of which we have none at the moment
(it possible that we will, perhaps). I am just saying this problem may
benefit from some more broad thinking of the issue in optimization tree
sense, i.e., why we do it, which things will use it and which things will
preserve/mess it up etc.

On Tue, Sep 5, 2017 at 3:01 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> In general, +1, don't see why not.
>
> Q -- is it something that you have encountered while doing algebra? I.e.,
> do you need the sorted DRM to continue algebraic operations between
> optimizer barriers, or you just need an RDD as the outcome of all this?
>
> if it is just an RDD, then you could just do a spark-supported sort,
> that's why we have an drm.rdd barrier (spark-specific). Barrier out to
> spark RDD and then continue doing whatever spark already supports.
>
> Another potential issue is that matrices do not generally imply ordering
> or formation of intermediate products, i.e., we inside optimizer, you might
> build a pipeline that implies ordered RDD in Spark sense, but there is no
> algebraic operator consuming sorted rdds, and no operator that guarantees
> preserving it (even if it just a checkpoint). This may create ambiguities
> as more rewriting rules are added. This is not a major concern.
>
> On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
>> Ever since we moved Flink to its own profile, I have been thinking we
>> ought
>> to do the same to H2O but haven't been to motivated bc it was never
>> causing
>> anyone any problems.
>>
>> Maybe its time to drop H2O "official support" and move Flink Batch / H2O
>> into a "mahout/community/engines" folder.
>>
>> Ive been doing a lot of Flink Streaming the last couple weeks and already
>> bootlegged a few of the 'Algorithms" into Flink.  Pretty sure we could
>> support those easily- and I _think_ we could do the same with the
>> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement the
>> the Operators on that.
>>
>> I'd put FlinkStreaming as another community engine.
>>
>> If we did that, I'd say- by convention we need a Markdown document in
>> mahout/community/engines that has a table of what is implemented on what.
>>
>> That is to say, even if we only were able to implement the "algos" on
>> Flink
>> Streaming- there would still be a lot of value to that for many
>> applications (esp considering the state of FlinkML).  Also beats having a
>> half cooked engine sitting on a feature branch.
>>
>> Beam does something similar to that for their various engines.
>>
>> Speaking of Beam, I've heard rumblings here and there of people tlaking
>> about making a Beam engine- this might motivate people to get started (no
>> one person feels responsible for "boiling the ocean" and throwing down an
>> entire engine in one go- but instead can hack out the portions they need.
>>
>>
>> My .02
>>
>> tg
>>
>> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo <ap@outlook.com>
>> wrote:
>>
>> > I've found a need for the sorting a Drm as well as In-core matrices,
>> > something like eg.: DrmLike.sortByColumn(...). I would like to implement
>> > this at the math-scala engine neutral level with pass through functions
>> to
>> > underlying back ends.
>> >
>> >
>> > In-core would be engine neutral by current design (in-core matrices are
>> > all Mahout matrices with the exception of h2o.. which causes some
>> concern.)
>> >
>> >
>> > For Spark, we can use  RDD.sortBy(...).
>> >
>> >
>> > Flink we can use DataSet.sortPartition(...).setParallelism(1).  (There
>> > may be a better method will look deeper).
>> >
>> >
>> > h2o has an implementation, I'm sure, but this brings me to a more
>> > important point: If we want to stub out a method in a back end module,
>> Eg:
>> > h2o, which test suites do we want make a requirements?
>> >
>> >
>> > We've not set any specific rules for which test suites must pass for
>> each
>> > module. We've had a soft requirement for inheriting and passing all test
>> > suites from math-scala.
>> >
>> >
>> > Setting a rule for this is something that we need to IMO.
>> >
>> >
>> > An easy option that I'm thinking would be to set the current core
>> > math-scala suites as a requirement, and then allow for an optional suite
>> > for methods which will be stubbed out.
>> >
>> >
>> > Thoughts?
>> >
>> >
>> > --andy
>> >
>> >
>> >
>>
>
>

Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.

2017-09-05 Thread Dmitriy Lyubimov

In general, +1, don't see why not.

Q -- is it something that you have encountered while doing algebra? I.e.,
do you need the sorted DRM to continue algebraic operations between
optimizer barriers, or you just need an RDD as the outcome of all this?

if it is just an RDD, then you could just do a spark-supported sort, that's
why we have an drm.rdd barrier (spark-specific). Barrier out to spark RDD
and then continue doing whatever spark already supports.

Another potential issue is that matrices do not generally imply ordering or
formation of intermediate products, i.e., we inside optimizer, you might
build a pipeline that implies ordered RDD in Spark sense, but there is no
algebraic operator consuming sorted rdds, and no operator that guarantees
preserving it (even if it just a checkpoint). This may create ambiguities
as more rewriting rules are added. This is not a major concern.

On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant 
wrote:

> Ever since we moved Flink to its own profile, I have been thinking we ought
> to do the same to H2O but haven't been to motivated bc it was never causing
> anyone any problems.
>
> Maybe its time to drop H2O "official support" and move Flink Batch / H2O
> into a "mahout/community/engines" folder.
>
> Ive been doing a lot of Flink Streaming the last couple weeks and already
> bootlegged a few of the 'Algorithms" into Flink.  Pretty sure we could
> support those easily- and I _think_ we could do the same with the
> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement the
> the Operators on that.
>
> I'd put FlinkStreaming as another community engine.
>
> If we did that, I'd say- by convention we need a Markdown document in
> mahout/community/engines that has a table of what is implemented on what.
>
> That is to say, even if we only were able to implement the "algos" on Flink
> Streaming- there would still be a lot of value to that for many
> applications (esp considering the state of FlinkML).  Also beats having a
> half cooked engine sitting on a feature branch.
>
> Beam does something similar to that for their various engines.
>
> Speaking of Beam, I've heard rumblings here and there of people tlaking
> about making a Beam engine- this might motivate people to get started (no
> one person feels responsible for "boiling the ocean" and throwing down an
> entire engine in one go- but instead can hack out the portions they need.
>
>
> My .02
>
> tg
>
> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo  wrote:
>
> > I've found a need for the sorting a Drm as well as In-core matrices,
> > something like eg.: DrmLike.sortByColumn(...). I would like to implement
> > this at the math-scala engine neutral level with pass through functions
> to
> > underlying back ends.
> >
> >
> > In-core would be engine neutral by current design (in-core matrices are
> > all Mahout matrices with the exception of h2o.. which causes some
> concern.)
> >
> >
> > For Spark, we can use  RDD.sortBy(...).
> >
> >
> > Flink we can use DataSet.sortPartition(...).setParallelism(1).  (There
> > may be a better method will look deeper).
> >
> >
> > h2o has an implementation, I'm sure, but this brings me to a more
> > important point: If we want to stub out a method in a back end module,
> Eg:
> > h2o, which test suites do we want make a requirements?
> >
> >
> > We've not set any specific rules for which test suites must pass for each
> > module. We've had a soft requirement for inheriting and passing all test
> > suites from math-scala.
> >
> >
> > Setting a rule for this is something that we need to IMO.
> >
> >
> > An easy option that I'm thinking would be to set the current core
> > math-scala suites as a requirement, and then allow for an optional suite
> > for methods which will be stubbed out.
> >
> >
> > Thoughts?
> >
> >
> > --andy
> >
> >
> >
>

Re: Looking for help with a talk

2017-08-10 Thread Dmitriy Lyubimov

yes sure, as befire, i can review

On Fri, Aug 4, 2017 at 1:12 AM, Isabel Drost-Fromm 
wrote:

> Hi,
>
> I have a first draft of a narrative and slide deck. If anyone has time it
> would be lovely to bounce some ideas back and forth, have the draft of the
> deck reviewed.
>
>
> Isabel
>
>

Re: Proposal for changing Mahout's Git branching rules

2017-07-20 Thread Dmitriy Lyubimov

Guys,

as you know, my ability to contribute is very limited lately, so i don't
feel like my opinion is worth as much as that of a regular committer or
contributor. In the end people who contribute things should decide what
works for them.

I just put forward a warning that while normally this workflow would not be
a problem IF people are aware of the flow and start their work off the dev
branch,  based on my git/github experience, a newbie WILL fork from master
to a private PR branch of her/his own to commence contribution work.

Which, according to proposed scheme, WILL be quite behind the dev branch
that she will then be asked to merge to.

Which WILL catch the unsuspecting contributor unawares. They will find
they'd have a significant divergence to overcome in order to attain the
mergeability of their work.

On Thu, Jul 20, 2017 at 9:06 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Fri, Jun 23, 2017 at 8:23 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> I don’t know where to start here. Git flow does not address the merge
>> conflict problems you talk about. They have nothing to do with the process
>> and are made no easier or harder by following it.
>>
>
> I thought i did demonstrate that it does make conflicts much more probable
> per below. The point where you start your work and point where you merge it
> do matter. This  process does increase the gap between those (which implies
> higher chance of conflicts and deeper divergence from the start). This is
> is the same reason why people try to merge most recent commit stack back as
> often as possible.
>
>
>>
>> >> For example:
>> >> Master is at A
>> >> Dev branch is at A - B -C ... F.
>> >>
>> >> if I start working at master (A) then i wil generate conflicts if i
>> have
>> >> changed same files (lines) as in B, C, .. or F.
>> >>
>> >> If I start working at dev (F) then i will not have a chance to generate
>> >> conflicts with B,C,..F but only with commits that happened after i had
>> >> started.
>> >>
>> >> Also, if I start working at master (A) then github flow will suggest me
>> >> to merge into master during PR. I guarantee 100% of  first time PRs
>> will
>> >> trip on that in github. even if you put "start your work off dev not
>> >> master" 20 times into project readme.
>> >>
>> >> And then you will face the dilemma whether to ask people to resolve
>> merge
>> >> issues w.r.t. master and resubmit, which will result to high
>> contribtors'
>> >> attrition, or resolve them yourself without deep knowledge of the
>> author's
>> >> intent, which will result in delays and plain errors.
>> >>
>> >> -d
>> >>
>> >
>> >
>>
>>
>

Re: Proposal for changing Mahout's Git branching rules

2017-07-20 Thread Dmitriy Lyubimov

On Fri, Jun 23, 2017 at 8:23 AM, Pat Ferrel  wrote:

> I don’t know where to start here. Git flow does not address the merge
> conflict problems you talk about. They have nothing to do with the process
> and are made no easier or harder by following it.
>

I thought i did demonstrate that it does make conflicts much more probable
per below. The point where you start your work and point where you merge it
do matter. This  process does increase the gap between those (which implies
higher chance of conflicts and deeper divergence from the start). This is
is the same reason why people try to merge most recent commit stack back as
often as possible.


>
> >> For example:
> >> Master is at A
> >> Dev branch is at A - B -C ... F.
> >>
> >> if I start working at master (A) then i wil generate conflicts if i have
> >> changed same files (lines) as in B, C, .. or F.
> >>
> >> If I start working at dev (F) then i will not have a chance to generate
> >> conflicts with B,C,..F but only with commits that happened after i had
> >> started.
> >>
> >> Also, if I start working at master (A) then github flow will suggest me
> >> to merge into master during PR. I guarantee 100% of  first time PRs will
> >> trip on that in github. even if you put "start your work off dev not
> >> master" 20 times into project readme.
> >>
> >> And then you will face the dilemma whether to ask people to resolve
> merge
> >> issues w.r.t. master and resubmit, which will result to high
> contribtors'
> >> attrition, or resolve them yourself without deep knowledge of the
> author's
> >> intent, which will result in delays and plain errors.
> >>
> >> -d
> >>
> >
> >
>
>

Re: [DISCUSS] Naming convention for multiple spark/scala combos

2017-07-07 Thread Dmitriy Lyubimov

it would seem 2nd option is preferable if doable. Any option that has most
desirable combinations prebuilt, is preferable i guess. Spark itself also
releases tons of hadoop profile binary variations. so i don't have to build
one myself.

On Fri, Jul 7, 2017 at 8:57 AM, Trevor Grant 
wrote:

> Hey all,
>
> Working on releasing 0.13.1 with multiple spark/scala combos.
>
> Afaik, there is no 'standard' for multiple spark versions (but I may be
> wrong, I don't claim expertise here).
>
> One approach is simply only release binaries for:
> Spark-1.6 + Scala 2.10
> Spark-2.1 + Scala 2.11
>
> OR
>
> We could do like dl4j
>
> org.apache.mahout:mahout-spark_2.10:0.13.1_spark_1
> org.apache.mahout:mahout-spark_2.11:0.13.1_spark_1
>
> org.apache.mahout:mahout-spark_2.10:0.13.1_spark_2
> org.apache.mahout:mahout-spark_2.11:0.13.1_spark_2
>
> OR
>
> some other option I don't know of.
>

Re: Density based Clustering in Mahout

2017-07-06 Thread Dmitriy Lyubimov

PS Maybe we should say, if you can provide kryo serialization, it can be
assumed platform agnostic, and provide api for embedding that further. In
practice all backends (except, I guess, H20 which is going extinct if not
yet) currently support kryo, and the new potential ones could easily add it
too (after all it is just a bunch of bytes after serialization, can't get
any more basic than that).

On Thu, Jul 6, 2017 at 11:21 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
>> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
>> structure, that will exist only within the algorithm for doing in-core
>> operations, (or maybe it lives slightly outside of the algorithm so we
>> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
>> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
>> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
>> allReduceBlock to accept Matrices OR Trees, that will get out of hand
>> fast.  Please anyone chime in to correct me/argue against.
>>
>
> +1. that's exactly what i meant.
>
>
>> So really, we've stumbled into a more important philosophical question-
>> and
>> that is: Is it acceptable to create objects which make the internals of
>> algorithms easier to read and work with, so long as they may be serialized
>> to incore matrices/vectors? I am +1, and if it is decided this is not
>> acceptable, I need to go back and alter (or drop) things like the CanopyFn
>> [2] of the Canopy Clustering Algorithm.
>>
>
> +1 too if it is practical.
> The dilemma here is that if one wants to stay platform agnostic then the
> algorithm has to use platform-agnostic persistence/serialization, of which
> samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
> mapping to record-tagged numerical information, it is preferable (and
> that's what i actually did a lot encoding models).
>
> In practice however of course in a particular application settings it is
> often such that people can't car less about backend compatibility, in which
> case a custom serialization is totally ok. But it in public mahout version
> it would run against the party line of staying backend agnostic so if at
> all possible with a little overhead, we try to avoid it.
>

Re: Density based Clustering in Mahout

2017-07-06 Thread Dmitriy Lyubimov

On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant 
wrote:

> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
> structure, that will exist only within the algorithm for doing in-core
> operations, (or maybe it lives slightly outside of the algorithm so we
> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
> allReduceBlock to accept Matrices OR Trees, that will get out of hand
> fast.  Please anyone chime in to correct me/argue against.
>

+1. that's exactly what i meant.


> So really, we've stumbled into a more important philosophical question- and
> that is: Is it acceptable to create objects which make the internals of
> algorithms easier to read and work with, so long as they may be serialized
> to incore matrices/vectors? I am +1, and if it is decided this is not
> acceptable, I need to go back and alter (or drop) things like the CanopyFn
> [2] of the Canopy Clustering Algorithm.
>

+1 too if it is practical.
The dilemma here is that if one wants to stay platform agnostic then the
algorithm has to use platform-agnostic persistence/serialization, of which
samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
mapping to record-tagged numerical information, it is preferable (and
that's what i actually did a lot encoding models).

In practice however of course in a particular application settings it is
often such that people can't car less about backend compatibility, in which
case a custom serialization is totally ok. But it in public mahout version
it would run against the party line of staying backend agnostic so if at
all possible with a little overhead, we try to avoid it.

Re: Density based Clustering in Mahout

2017-07-05 Thread Dmitriy Lyubimov

PS i read a few papers, including i believe that of Google's, on
partitioning of the DBScan problem for parallelization. It did not fit my
purposes though as they inherently assumed that every cluster problems had
enough centroids to figure to be efficiently partitioned. In effect it
amounted to similar effects as the probabilistic sketches techniques
mentioned in the book, but with much more headache for the bang. I
eventually turned to solving problems pre-sketched one way or another
(including for density clustering problems).

On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> (1) I abandoned any attempts at DBScan and implemented another density
> algorithm itself (can't say which, subject to patent restrictions). The
> reason being, i couldn't immediately figure how to parallelize it
> efficiently (aside from data structure discussions), the base algorithm is
> inherently iterative.
>
> (2) Samsara provides R-base level algebra, not general data structure
> support. IMO it would not pay to adjust it to do that any more than trying
> to fit R base to do it. I did implement spatial structures standardized on
> using Samsara in terms of carrying out computations (mostly in-memory), but
> those were still data structures in their on right.
>
> (3) like i said before, experience tells me that using collection of 2d
> tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
> tensor. The fundamental difference here is mostly with sparse operations
> where n-d sparse tensor could intelligently execute those. But this is not
> supported properly pretty much by any library i know, so all the difference
> in most cases that say they support it directly is just putting a tensor
> api over collection of tensors. Practicality of having dense n-d tensors is
> also a bit questioned, since it immediately increases requirements for
> processing memory quantity of a single tensor instance, whereas collection
> of tensors could be represented as retaining streaming properties. etc.
> etc. Overall it sounds to me like a solution in a search of a problem
> (given my admittedly very limited experience as a practitioner in math).
>
>
> On Wed, Jul 5, 2017 at 12:09 AM, Aditya <adityasarma...@gmail.com> wrote:
>
>> ***Important** **Do read** *
>> Hello everyone,
>>
>> Trevor and I have been discussing as to how to effectively represent an
>> R-Tree in Mahout. Turns out there is a method to represent a Binary Search
>> Tree (BST) in the form of an ancestor matrix. This
>> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from
>> -a-given-binary-tree/>
>> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m
>> atrix/>
>> show the conversion logic from a tree representation to a matrix
>> representation and vice versa.
>>
>> In a distributed scenario, I know of the following design
>> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc
>> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
>> which is fairly efficient and intuitive to understand.
>>
>> Now the point for debate is this:
>> **Please read the design provided in the link above before proceeding**
>> The R-Tree will always be a local entity, no where in the algorithm is
>> there a need / necessity to have a distributed R-Tree kind of scenario. On
>> the other hand, the data points as well as the union find data structure
>> need to be stored in a DRM like data structure and they very well can be
>> represented in the form of a matrix. (A union find data structure
>> basically
>> can be implemented using a vector)
>>
>> 1. Why not build an R-Tree module in the form of a normal tree with two
>> children and a key-value pair? (I'm not sure if this is allowed in Mahout,
>> so veterans please chip in). Since an R-tree will always be an in-core
>> entity.
>>
>> 2. If 1. is not done, then the method followed for the matrix
>> representation of a BST should be followed. Only, the elements and
>> conditions will be changed. On an abstract sense, Matrix representation of
>> an R-Tree and matrix representation of a Binary Search Tree are analogous.
>> But in this case, the construction and search costs for the Matrix version
>> of an R-Tree will be costlier.
>>
>>
>> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
>> could offer your insights.*
>>
>> Thanks,
>> Aditya
>>
>>
>> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com>
>> wrote:
>>
>> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
>> (e.g.
>> >

Re: Proposal for changing Mahout's Git branching rules

2017-06-22 Thread Dmitriy Lyubimov

and contributors convenience should be golden IMO. I remember experiencing
a mild irritation when i was asked to resolve the conflicts on spark prs
because I felt they arose solely because the committer was taking too long
to review my PR and ok it. But if it were resulting from the project not
following simple KISS github PR workflow, it probably would be a bigger
turn-off.

and then imagine the overhead of explaining to every newcomer that they
should and why they should be PRing not against the master but something
else when every other ASF project accepts PRs against master...

I dunno... when working on github, any deviation from github commonly
accepted PR flows imo would be a fatal wound to the process.

On Thu, Jun 22, 2017 at 4:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> should read
>
> And then you will face the dilemma whether to ask people to resolve merge
> issues w.r.t. *dev* and resubmit against *dev*, which will result to high
> contribtors' attrition, or resolve them yourself without deep knowledge of
> the author's intent, which will result in delays and plain errors.
>
> On Thu, Jun 22, 2017 at 2:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
>>
>>
>> On Wed, Jun 21, 2017 at 3:00 PM, Pat Ferrel <p...@occamsmachete.com>
>> wrote:
>>
>>> Which is an option part of git flow but maybe take a look at a better
>>> explanation than mine: http://nvie.com/posts/a-succes
>>> sful-git-branching-model/ <http://nvie.com/posts/a-succe
>>> ssful-git-branching-model/>
>>>
>>> I still don’t see how this complicates resolving conflicts. It just
>>> removes the resolution from being a blocker. If some conflict is pushed to
>>> master the project is dead until it is resolved (how often have we seen
>>> this?)
>>
>>
>> This is completely detached from github reality.
>>
>> In this model, all contributors work actually on the same branch. In
>> github, every contributor will fork off their own dev branch.
>>
>> In this model, people start with a fork off the dev branch and push to
>> dev branch. In github, a contributor will fork off the master branch and
>> will PR against master branch. This is default behavior and my gut feeling
>> no amount of forewarning is going to change that w.r.t. contributors. And
>> if one starts off his/her work with the branch with intent to commit to
>> another, then conflict is guaranteed every time he or she changes the file
>> that has been changed on the branch to be merged to.
>>
>> For example:
>> Master is at A
>> Dev branch is at A - B -C ... F.
>>
>> if I start working at master (A) then i wil generate conflicts if i have
>> changed same files (lines) as in B, C, .. or F.
>>
>> If I start working at dev (F) then i will not have a chance to generate
>> conflicts with B,C,..F but only with commits that happened after i had
>> started.
>>
>> Also, if I start working at master (A) then github flow will suggest me
>> to merge into master during PR. I guarantee 100% of  first time PRs will
>> trip on that in github. even if you put "start your work off dev not
>> master" 20 times into project readme.
>>
>> And then you will face the dilemma whether to ask people to resolve merge
>> issues w.r.t. master and resubmit, which will result to high contribtors'
>> attrition, or resolve them yourself without deep knowledge of the author's
>> intent, which will result in delays and plain errors.
>>
>> -d
>>
>
>

Re: Proposal for changing Mahout's Git branching rules

2017-06-22 Thread Dmitriy Lyubimov

should read

And then you will face the dilemma whether to ask people to resolve merge
issues w.r.t. *dev* and resubmit against *dev*, which will result to high
contribtors' attrition, or resolve them yourself without deep knowledge of
the author's intent, which will result in delays and plain errors.

On Thu, Jun 22, 2017 at 2:48 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Wed, Jun 21, 2017 at 3:00 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> Which is an option part of git flow but maybe take a look at a better
>> explanation than mine: http://nvie.com/posts/a-succes
>> sful-git-branching-model/ <http://nvie.com/posts/a-succe
>> ssful-git-branching-model/>
>>
>> I still don’t see how this complicates resolving conflicts. It just
>> removes the resolution from being a blocker. If some conflict is pushed to
>> master the project is dead until it is resolved (how often have we seen
>> this?)
>
>
> This is completely detached from github reality.
>
> In this model, all contributors work actually on the same branch. In
> github, every contributor will fork off their own dev branch.
>
> In this model, people start with a fork off the dev branch and push to dev
> branch. In github, a contributor will fork off the master branch and will
> PR against master branch. This is default behavior and my gut feeling no
> amount of forewarning is going to change that w.r.t. contributors. And if
> one starts off his/her work with the branch with intent to commit to
> another, then conflict is guaranteed every time he or she changes the file
> that has been changed on the branch to be merged to.
>
> For example:
> Master is at A
> Dev branch is at A - B -C ... F.
>
> if I start working at master (A) then i wil generate conflicts if i have
> changed same files (lines) as in B, C, .. or F.
>
> If I start working at dev (F) then i will not have a chance to generate
> conflicts with B,C,..F but only with commits that happened after i had
> started.
>
> Also, if I start working at master (A) then github flow will suggest me to
> merge into master during PR. I guarantee 100% of  first time PRs will trip
> on that in github. even if you put "start your work off dev not master" 20
> times into project readme.
>
> And then you will face the dilemma whether to ask people to resolve merge
> issues w.r.t. master and resubmit, which will result to high contribtors'
> attrition, or resolve them yourself without deep knowledge of the author's
> intent, which will result in delays and plain errors.
>
> -d
>

Re: Proposal for changing Mahout's Git branching rules

2017-06-22 Thread Dmitriy Lyubimov

On Wed, Jun 21, 2017 at 3:00 PM, Pat Ferrel  wrote:

> Which is an option part of git flow but maybe take a look at a better
> explanation than mine: http://nvie.com/posts/a-successful-git-branching-
> model/ 
>
> I still don’t see how this complicates resolving conflicts. It just
> removes the resolution from being a blocker. If some conflict is pushed to
> master the project is dead until it is resolved (how often have we seen
> this?)

This is completely detached from github reality.

In this model, all contributors work actually on the same branch. In
github, every contributor will fork off their own dev branch.

In this model, people start with a fork off the dev branch and push to dev
branch. In github, a contributor will fork off the master branch and will
PR against master branch. This is default behavior and my gut feeling no
amount of forewarning is going to change that w.r.t. contributors. And if
one starts off his/her work with the branch with intent to commit to
another, then conflict is guaranteed every time he or she changes the file
that has been changed on the branch to be merged to.

For example:
Master is at A
Dev branch is at A - B -C ... F.

if I start working at master (A) then i wil generate conflicts if i have
changed same files (lines) as in B, C, .. or F.

If I start working at dev (F) then i will not have a chance to generate
conflicts with B,C,..F but only with commits that happened after i had
started.

Also, if I start working at master (A) then github flow will suggest me to
merge into master during PR. I guarantee 100% of  first time PRs will trip
on that in github. even if you put "start your work off dev not master" 20
times into project readme.

And then you will face the dilemma whether to ask people to resolve merge
issues w.r.t. master and resubmit, which will result to high contribtors'
attrition, or resolve them yourself without deep knowledge of the author's
intent, which will result in delays and plain errors.

-d

Re: Proposal for changing Mahout's Git branching rules

2017-06-21 Thread Dmitriy Lyubimov

On Wed, Jun 21, 2017 at 2:17 PM, Pat Ferrel  wrote:

Since merges are done by committers, it’s easy to retarget a contributor’s
> PRs but committers would PR against develop,

IMO it is anything but easy to resolve conflicts, let alone somebody
else's. Spark just asks me to resolve them myself. But if you don't have
proper target, you can't ask the contributor.

and some projects like PredictionIO make develop the default branch on
> github so it's the one contributors get by default.
>
That would fix it but i am not sure if we have access to HEAD on github
mirror. Might involve INFRA to do it  And in that case  it would amount
little more but renaming. It would seem it is much easier to create a
branch, "stable master" or something, and consider master to be ongoing PR
base.

-1 on former, -0 on the latter. Judging from the point of both contributor
and committer (of which I am both).it will not make my life easy on either
end.

Re: Looking for help with a talk

2017-06-21 Thread Dmitriy Lyubimov

Isabel, you obviously always can call me with any questions.
-D

On Sat, May 27, 2017 at 2:01 PM, Isabel Drost-Fromm 
wrote:

> Hi,
>
> I've been invited to give a machine learning centric keynote at FrOSCon
> (free and open source conference, the little sister of FOSDEM, roughly 2500
> attendees of all skill levels) in August this year. Content should be less
> technical but focus on big picture, implications and some such.
>
> Would be great to get some help from you.
>
> Anyone here who has time and interest to help out? Anyone planning to
> attend the event already?
>
> Isabel
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.

Re: [DISCUSS] remove getMahoutHome from sparkbindings

2017-06-21 Thread Dmitriy Lyubimov

i think the idea was that mahout has control over application jars and
guarantees that the minimum amount of mahout jars is added to the
application. In order to do that, it needs to be able to figure the home
(even if it is an uber jar, as some devs have been pushing for for some
time now, so it might actually happen)

On Sun, Jun 18, 2017 at 7:04 AM, Trevor Grant 
wrote:

> getMahoutHome()
>
> https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b4
> 5863df3e53/spark/src/main/scala/org/apache/mahout/
> sparkbindings/package.scala#L208
>
> seems to not be used anywhere other than the Flink bindings.
>
> I personally don't like the idea of REQUIRING mahout to be fully installed-
> and at the moment, at least for Spark- we don't need to. One can simply use
> maven coordinates in POM or add them when launching spark
> job/shell/zeppelin.
>
> I'm assuming this method existed to support the old spark-shell, but I'd
> like to drop it to prevent anyone from introducing new code that relies on
> it.
>
> However since I didn't put it there and don't know all use cases, I want to
> open up for discussion before I start a JIRA ticket.
>
> thoughts?
>
> tg
>

Re: new committer: Dustin Vanstee

2017-06-21 Thread Dmitriy Lyubimov

welcome!

On Wed, Jun 21, 2017 at 2:07 PM, Pat Ferrel  wrote:

> Welcome Dustin!
>
> Nice work so far, much needed.
>
>
> On Jun 21, 2017, at 12:08 PM, Andrew Palumbo  wrote:
>
> Welcome Dustin!
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Andrew Musselman 
> Date: 06/20/2017 4:18 PM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Re: new committer: Dustin Vanstee
>
> Welcome Dustin; glad to have you on board!
>
> On Tue, Jun 20, 2017 at 2:50 PM, Trevor Grant 
> wrote:
>
> > The Project Management Committee (PMC) for Apache Mahouthas invited
> > Dustin Vanstee to become a committer and we are pleased to announce
> > that he has accepted.
> >
> > Dustin has taken superb initiative in the website-jekyll migration,
> > as well as showing interest by successfully completing JIRAs in the
> > "precanned" algorithms framework.
> >
> >
> > Being a committer enables easier contribution to theproject since
> > there is no need to go via the patchsubmission process. This should
> > enable better productivity.Being a PMC member enables assistance with
> > the managementand to guide the direction of the project.
> >
>
>

Re: Proposal for changing Mahout's Git branching rules

2017-06-21 Thread Dmitriy Lyubimov

so people need to make sure their PR merges to develop instead of master?
Do they need to PR against develop branch, and if not, who is responsible
for confict resolution then that is to arise from diffing and merging into
different targets?

On Tue, Jun 20, 2017 at 10:09 AM, Pat Ferrel  wrote:

> As I said I was sure there would be Jenkins issues but they must be small
> since it’s just renaming of target branches. Releases are still made from
> master so I don’t see the issue there at all. Only intermediate CI tasks
> are triggered on other branches. But they would have to be in your examples
> too so I don’t see the benefit of using an ad hoc method in terms of CI.
> We’ve used this method for years with Apache PredictionIO with minimal CI
> issues.
>
> No the process below is not equivalent, treating master as develop removes
> the primary (in my mind) benefit. In git flow the master is always stable
> and the reflection of the last primary/core/default release with only
> critical inter-release fixes. If someone wants to work with stable
> up-to-date source, where do they go with the current process? I would claim
> that there actually may be no place to find such a thing except by tracking
> down some working commit number. It would depend on what stage the project
> is in, in git flow there is never a question—master is always stable. Git
> flow also accounts for all the process exceptions and complexities you
> mention below but in a standardized way that is documented so anyone can
> read the rules and follow them. We/Mahout doesn’t even have to write them,
> they can just be referenced.
>
> But we are re-arguing something I thought was already voted on and that is
> another issue. If we need to re-debate this let’s make it stick one way or
> the other.
>
> I really appreciate you being release master and the thought and work
> you’ve put into this and if we decide to stick with it, fine. But it should
> be a project decision that release masters follow, not up to each release
> master. We are now embarking on a much more complex release than before
> with multiple combinations of dependencies for binaries and so multiple
> artifacts. We need to make the effort tame the complexity somehow or it
> will just multiply.
>
> Given the short nature of the current point release I’d even suggest that
> we target putting our decision in practice after the release, which is a
> better time to make a change if we are to do so.
>
>
> On Jun 19, 2017, at 9:04 PM, Trevor Grant 
> wrote:
>
> First issue, one does not simply just start using a develop branch.  CI
> only triggers off the 'main' branch, which is master by default.  If we
> move to the way you propose, then we need to file a ticket with INFRA I
> believe.  That can be done, but its not like we just start doing it one
> day.
>
> The current method is, when we cut a release- we make a new branch of that
> release. Master is treated like dev. If you want the latest stable, you
> would check out branch-0.13.0 .  This is the way most major projects
> (citing Spark, Flink, Zeppelin), including Mahout up to version 0.10.x
> worked.  To your point, there being a lack of a recent stable- that's fair,
> but partly that's because no one created branches with the release for
> 0.10.? - 0.12.2.
>
> For all intents and purposes, we are (now once again) following what you
> propose, the only difference is we are treating master as dev, and
> "branch-0.13.0" as master (e.g. last stable).  Larger features go on their
> own branch until they are ready to merge- e.g. ATM there is just one
> feature branch CUDA.  That was the big take away from this discussion last
> time- there needed to be feature branches, as opposed to everyone running
> around either working off WIP PRs or half baked merges, etc.  To that end-
> "website" was a feature branch, and iirc there has been one other feature
> branch that has merged in the last couple of months but I forget what it
> was at the moment.
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, Jun 19, 2017 at 8:02 PM, Pat Ferrel  wrote:
>
> > Perhaps there is a misunderstanding about where a release comes
> > from—master. So any release tools we have should work fine. It’s just
> that
> > until you are ready to pull the trigger, development is in develop or
> more
> > strictly a “getting a release ready” branch called a release branch. This
> > sounds like a lot of branches but in practice it’s trivial to merge and
> > purge. Everything stays clean and rapid fire last minute fixes are
> isolated
> > to the release branch before going into master.
> >
> > The original reason I brought this up is that our Git tools now allow
> > committers to delete old cruft laden

Re: Redesign

2017-05-10 Thread Dmitriy Lyubimov

I am actually totally blown away! Thanks!

On Wed, May 10, 2017 at 10:32 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Massive thanks to Trevor and Dustin for the work redesigning/implementing
> the website; for the last mile of look and feel I reached out to a designer
> who's interested in working on open-source projects for his portfolio.
>
> I'll keep the team posted if/when I hear back; in the meantime if anyone
> has a designer friend who wants to pitch in please let us know.
>
> Thanks!
>

[jira] [Commented] (MAHOUT-1946) ViennaCL not being picked up by JNI

2017-05-09 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003767#comment-16003767
 ] 

Dmitriy Lyubimov commented on MAHOUT-1946:
--

it seems so. load library failure should equal to backend being not available.


> ViennaCL not being picked up by JNI
> ---
>
> Key: MAHOUT-1946
> URL: https://issues.apache.org/jira/browse/MAHOUT-1946
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Musselman
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Using the PR for MAHOUT-1938 but probably in master as well:
> scala> :load ./examples/bin/SparseSparseDrmTimer.mscala
> Loading ./examples/bin/SparseSparseDrmTimer.mscala...
> timeSparseDRMMMul: (m: Int, n: Int, s: Int, para: Int, pctDense: Double, 
> seed: Long)Long
> scala> timeSparseDRMMMul(100,100,100,1,.02,1234L)
> [INFO] Creating org.apache.mahout.viennacl.opencl.GPUMMul solver
> [INFO] Successfully created org.apache.mahout.viennacl.opencl.GPUMMul solver
> gpuRWCW
> 17/02/26 13:18:54 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:127)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:33)
>   at 
> org.apache.mahout.math.scalabindings.RLikeMatrixOps.$percent$times$percent(RLikeMatrixOps.scala:37)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$.org$apache$mahout$sparkbindings$blas$ABt$$mmulFunc$1(ABt.scala:98)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/02/26 13:18:54 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
&

[jira] [Commented] (MAHOUT-1946) ViennaCL not being picked up by JNI

2017-05-09 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003640#comment-16003640
 ] 

Dmitriy Lyubimov commented on MAHOUT-1946:
--

Native backends should be able to recover from inability to load native code 
gracefully (just report up that the backend is unavailable, rather than 
throwing class loading exception .. guess saying the obvious :) sorry

> ViennaCL not being picked up by JNI
> ---
>
> Key: MAHOUT-1946
> URL: https://issues.apache.org/jira/browse/MAHOUT-1946
> Project: Mahout
>  Issue Type: Bug
>Reporter: Andrew Musselman
>Assignee: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Using the PR for MAHOUT-1938 but probably in master as well:
> scala> :load ./examples/bin/SparseSparseDrmTimer.mscala
> Loading ./examples/bin/SparseSparseDrmTimer.mscala...
> timeSparseDRMMMul: (m: Int, n: Int, s: Int, para: Int, pctDense: Double, 
> seed: Long)Long
> scala> timeSparseDRMMMul(100,100,100,1,.02,1234L)
> [INFO] Creating org.apache.mahout.viennacl.opencl.GPUMMul solver
> [INFO] Successfully created org.apache.mahout.viennacl.opencl.GPUMMul solver
> gpuRWCW
> 17/02/26 13:18:54 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala:65)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.(Context.scala)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$.org$apache$mahout$viennacl$opencl$GPUMMul$$gpuRWCW(GPUMMul.scala:171)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at 
> org.apache.mahout.viennacl.opencl.GPUMMul$$anonfun$11.apply(GPUMMul.scala:77)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:127)
>   at org.apache.mahout.viennacl.opencl.GPUMMul$.apply(GPUMMul.scala:33)
>   at 
> org.apache.mahout.math.scalabindings.RLikeMatrixOps.$percent$times$percent(RLikeMatrixOps.scala:37)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$.org$apache$mahout$sparkbindings$blas$ABt$$mmulFunc$1(ABt.scala:98)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$6.apply(ABt.scala:113)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at 
> org.apache.mahout.sparkbindings.blas.ABt$$anonfun$pairwiseApply$1.apply(ABt.scala:209)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/02/26 13:18:54 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.UnsatisfiedLinkError: no jniViennaCL in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:726)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:501)
>   at org.bytedeco.javacpp.Loader.load(Loader.java:434)
>   at 
> org.apache.mahout.viennacl.opencl.javacpp.Context$.loadLib(Context.scala:63)
>   at 
> org.apache.maho

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-04-21 Thread Dmitriy Lyubimov

There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character long
(+ better test for aggregation).



On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> One is the cluster ID of the Index to which the data point should be
> assigned.
> As per what is given in this book Apache-Mahout-Mapreduce-Dmitriy-Lyubimov
> <http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
> Lyubimov/dp/1523775785>
> in
> chapter 4 about the aggregating Transpose.
> From what i have understood is that row having the same key will added when
> we take aggregating transpose of the matrix.
> So i think there should be a way to assign new  values to row keys and i
> think Dimitriy  Has also mentioned the same thing i approach he has
> outlined in this mail chain
> Correct me if i am wrong.
>
>
> Thanks
> Parth Khatwani
>
>
>
>
>
>
> On Sat, Apr 22, 2017 at 1:54 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
> > Got it- in short no.
> >
> > Think of the keys like a dictionary or HashMap.
> >
> > That's why everything is ending up on row 1.
> >
> > What are you trying to achieve by creating keys of 1?
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <
> > h2016...@pilani.bits-pilani.ac.in> wrote:
> >
> > > @Trevor
> > >
> > >
> > >
> > > In was trying to write the "*Kmeans*" Using Mahout DRM as per the
> > algorithm
> > > outlined by Dmitriy.
> > > I was facing the Problem of assigning cluster Ids to the Row Keys
> > > For Example
> > > Consider the below matrix Where column 1 to 3 are the data points and
> > > column 0 Containing the count of the point
> > > {
> > >  0 => {0:1.0,1: 1.0,2: 1.0,   3: 3.0}
> > >  1 => {0:1.0,1: 2.0,2: 3.0,   3: 4.0}
> > >  2 => {0:1.0,1: 3.0,2: 4.0,   3: 5.0}
> > >  3 => {0:1.0,1: 4.0,2: 5.0,   3: 6.0}
> > > }
> > >
> > > now after calculating the centriod which  closest to the data point
> data
> > > zeroth index i am trying to assign the centriod index to *row key *
> > >
> > > Now Suppose say that every data point is assigned to centriod at index
> 1
> > > so after assigning the key=1 to each and every row
> > >
> > > using the  code below
> > >
> > >  val drm2 = A.mapBlock() {
> > >   case (keys, block) =>for(row <- 0 until keys.size) {
> > >
> > >  * //assigning 1 to each row index*  keys(row) = 1
> > >}(keys, block)}
> > >
> > >
> > >
> > > I want above matrix to be in this form
> > >
> > >
> > > {
> > >  1 => {0:1.0,1: 1.0,2: 1.0,   3: 3.0}
> > >  1 => {0:1.0,1: 2.0,2: 3.0,   3: 4.0}
> > >  1 => {0:1.0,1: 3.0,2: 4.0,   3: 5.0}
> > >  1 => {0:1.0,1: 4.0,2: 5.0,   3: 6.0}
> > > }
> > >
> > >
> > >
> > >
> > >  Turns out to be this
> > > {
> > >  0 => {}
> > >  1 => {0:1.0,1:4.0,2:5.0,3:6.0}
> > >  2 => {}
> > >  3 => {}
> > > }
> > >
> > >
> > >
> > > I am confused weather assigning the new Key Values to the row index is
> > done
> > > through the following code line
> > >
> > > * //assigning 1 to each row index*  keys(row) = 1
> > >
> > >
> > > or is there any other way.
> > >
> > >
> > >
> > > I am not able to find any use links or reference on internet even
> Andrew
> > > and Dmitriy's book also does not have any proper reference for the
> > > above mentioned issue.
> > >
> > >
> > >
> > > Thanks & Regards
> > > Parth Khatwani
> > >
> > >
> > >
> > > On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <
> trevor.d.gr...@gmail.com
> > >
> > > wrote:
> > >
> > > > OK, i dug into this before i r

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-04-12 Thread Dmitriy Lyubimov

 {
> closest = tempDist
> index = row
>   }
> }
> index
>   }
>
>//calculating the sum of squared distance between the points(Vectors)
>   def ssr(a: Vector, b: Vector): Double = {
> (a - b) ^= 2 sum
>   }
>
>   //method used to create (1|D)
>   def addCentriodColumn(arg: Array[Double]): Array[Double] = {
> val newArr = new Array[Double](arg.length + 1)
> newArr(0) = 1.0;
> for (i <- 0 until (arg.size)) {
>   newArr(i + 1) = arg(i);
> }
> newArr
>   }
>
>
> Thanks & Regards
> Parth Khatwani
>
>
>
> On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
> >
> > -- Forwarded message --
> > From: Dmitriy Lyubimov <dlie...@gmail.com>
> > Date: Fri, Mar 31, 2017 at 11:34 PM
> > Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
> > Samsara"
> > To: "dev@mahout.apache.org" <dev@mahout.apache.org>
> >
> >
> > ps1 this assumes row-wise construction of A based on training set of m
> > n-dimensional points.
> > ps2 since we are doing multiple passes over A it may make sense to make
> > sure it is committed to spark cache (by using checkpoint api), if spark
> is
> > used
> >
> > On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > here is the outline. For details of APIs, please refer to samsara
> manual
> > > [2], i will not be be repeating it.
> > >
> > > Assume your training data input is m x n matrix A. For simplicity let's
> > > assume it's a DRM with int row keys, i.e., DrmLike[Int].
> > >
> > > Initialization:
> > >
> > > First, classic k-means starts by selecting initial clusters, by
> sampling
> > > them out. You can do that by using sampling api [1], thus forming a k
> x n
> > > in-memory matrix C (current centroids). C is therefore of Mahout's
> Matrix
> > > type.
> > >
> > > You the proceed by alternating between cluster assignments and
> > > recompupting centroid matrix C till convergence based on some test or
> > > simply limited by epoch count budget, your choice.
> > >
> > > Cluster assignments: here, we go over current generation of A and
> > > recompute centroid indexes for each row in A. Once we recompute index,
> we
> > > put it into the row key . You can do that by assigning centroid indices
> > to
> > > keys of A using operator mapblock() (details in [2], [3], [4]). You
> also
> > > need to broadcast C in order to be able to access it in efficient
> manner
> > > inside mapblock() closure. Examples of that are plenty given in [2].
> > > Essentially, in mapblock, you'd reform the row keys to reflect cluster
> > > index in C. while going over A, you'd have a "nearest neighbor" problem
> > to
> > > solve for the row of A and centroids C. This is the bulk of computation
> > > really, and there are a few tricks there that can speed this step up in
> > > both exact and approximate manner, but you can start with a naive
> search.
> > >
> > > Centroid recomputation:
> > > once you assigned centroids to the keys of marix A, you'd want to do an
> > > aggregating transpose of A to compute essentially average of row A
> > grouped
> > > by the centroid key. The trick is to do a computation of (1|A)' which
> > will
> > > results in a matrix of the shape (Counts/sums of cluster rows). This is
> > the
> > > part i find difficult to explain without a latex graphics.
> > >
> > > In Samsara, construction of (1|A)' corresponds to DRM expression
> > >
> > > (1 cbind A).t (again, see [2]).
> > >
> > > So when you compute, say,
> > >
> > > B = (1 | A)',
> > >
> > > then B is (n+1) x k, so each column contains a vector corresponding to
> a
> > > cluster 1..k. In such column, the first element would be # of points in
> > the
> > > cluster, and the rest of it would correspond to sum of all points. So
> in
> > > order to arrive to an updated matrix C, we need to collect B into
> memory,
> > > and slice out counters (first row) from the rest of it.
> > >
> > > So, to compute C:
> > >
> > > C <- B (2:,:) each row divided by B(1,:)
> > >
> > > (watch out for empty clusters with 0 elements, this will cause lack of
> > > conver

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov

ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used

On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

> here is the outline. For details of APIs, please refer to samsara manual
> [2], i will not be be repeating it.
>
> Assume your training data input is m x n matrix A. For simplicity let's
> assume it's a DRM with int row keys, i.e., DrmLike[Int].
>
> Initialization:
>
> First, classic k-means starts by selecting initial clusters, by sampling
> them out. You can do that by using sampling api [1], thus forming a k x n
> in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
> type.
>
> You the proceed by alternating between cluster assignments and
> recompupting centroid matrix C till convergence based on some test or
> simply limited by epoch count budget, your choice.
>
> Cluster assignments: here, we go over current generation of A and
> recompute centroid indexes for each row in A. Once we recompute index, we
> put it into the row key . You can do that by assigning centroid indices to
> keys of A using operator mapblock() (details in [2], [3], [4]). You also
> need to broadcast C in order to be able to access it in efficient manner
> inside mapblock() closure. Examples of that are plenty given in [2].
> Essentially, in mapblock, you'd reform the row keys to reflect cluster
> index in C. while going over A, you'd have a "nearest neighbor" problem to
> solve for the row of A and centroids C. This is the bulk of computation
> really, and there are a few tricks there that can speed this step up in
> both exact and approximate manner, but you can start with a naive search.
>
> Centroid recomputation:
> once you assigned centroids to the keys of marix A, you'd want to do an
> aggregating transpose of A to compute essentially average of row A grouped
> by the centroid key. The trick is to do a computation of (1|A)' which will
> results in a matrix of the shape (Counts/sums of cluster rows). This is the
> part i find difficult to explain without a latex graphics.
>
> In Samsara, construction of (1|A)' corresponds to DRM expression
>
> (1 cbind A).t (again, see [2]).
>
> So when you compute, say,
>
> B = (1 | A)',
>
> then B is (n+1) x k, so each column contains a vector corresponding to a
> cluster 1..k. In such column, the first element would be # of points in the
> cluster, and the rest of it would correspond to sum of all points. So in
> order to arrive to an updated matrix C, we need to collect B into memory,
> and slice out counters (first row) from the rest of it.
>
> So, to compute C:
>
> C <- B (2:,:) each row divided by B(1,:)
>
> (watch out for empty clusters with 0 elements, this will cause lack of
> convergence and NaNs in the newly computed C).
>
> This operation obviously uses subblocking and row-wise iteration over B,
> for which i am again making reference to [2].
>
>
> [1] https://github.com/apache/mahout/blob/master/math-scala/
> src/main/scala/org/apache/mahout/math/drm/package.scala#L149
>
> [2], Sasmara manual, a bit dated but viable, http://apache.github.
> io/mahout/doc/ScalaSparkBindings.html
>
> [3] scaladoc, again, dated but largely viable for the purpose of this
> exercise:
> http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
>
> [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
> math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
>
> On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
>> @Dmitriycan you please again tell me the approach to move ahead.
>>
>>
>> Thanks
>> Parth Khatwani
>>
>>
>> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
>> h2016...@pilani.bits-pilani.ac.in> wrote:
>>
>> > yes i am unable to figure out the way ahead.
>> > Like how to create the augmented matrix A := (0|D) which you have
>> > mentioned.
>> >
>> >
>> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> > wrote:
>> >
>> >> was my reply for your post on @user has been a bit confusing?
>> >>
>> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
>> >> h2016...@pilani.bits-pilani.ac.in> wrote:
>> >>
>> >> > Sir,
>> >> > I am trying to write the kmeans clustering algorithm using Mahout
>> >> Samsara
>> >> > but i am bit confused
>> >> > about how to leverage Distributed Row Matrix for the same. Can
>> anybody
>> >> help
>> >> > me with same.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> > Parth Khatwani
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov

here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.

Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].

Initialization:

First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.

You the proceed by alternating between cluster assignments and recompupting
centroid matrix C till convergence based on some test or simply limited by
epoch count budget, your choice.

Cluster assignments: here, we go over current generation of A and recompute
centroid indexes for each row in A. Once we recompute index, we put it into
the row key . You can do that by assigning centroid indices to keys of A
using operator mapblock() (details in [2], [3], [4]). You also need to
broadcast C in order to be able to access it in efficient manner inside
mapblock() closure. Examples of that are plenty given in [2]. Essentially,
in mapblock, you'd reform the row keys to reflect cluster index in C. while
going over A, you'd have a "nearest neighbor" problem to solve for the row
of A and centroids C. This is the bulk of computation really, and there are
a few tricks there that can speed this step up in both exact and
approximate manner, but you can start with a naive search.

Centroid recomputation:
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.

In Samsara, construction of (1|A)' corresponds to DRM expression

(1 cbind A).t (again, see [2]).

So when you compute, say,

B = (1 | A)',

then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.

So, to compute C:

C <- B (2:,:) each row divided by B(1,:)

(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).

This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].

[1]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149

[2], Sasmara manual, a bit dated but viable,
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

[3] scaladoc, again, dated but largely viable for the purpose of this
exercise:
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm

[4] mapblock etc.
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps

On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> @Dmitriycan you please again tell me the approach to move ahead.
>
>
> Thanks
> Parth Khatwani
>
>
> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
> > yes i am unable to figure out the way ahead.
> > Like how to create the augmented matrix A := (0|D) which you have
> > mentioned.
> >
> >
> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> >> was my reply for your post on @user has been a bit confusing?
> >>
> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
> >> h2016...@pilani.bits-pilani.ac.in> wrote:
> >>
> >> > Sir,
> >> > I am trying to write the kmeans clustering algorithm using Mahout
> >> Samsara
> >> > but i am bit confused
> >> > about how to leverage Distributed Row Matrix for the same. Can anybody
> >> help
> >> > me with same.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Thanks
> >> > Parth Khatwani
> >> >
> >>
> >
> >
>

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov

was my reply for your post on @user has been a bit confusing?

On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> Sir,
> I am trying to write the kmeans clustering algorithm using Mahout Samsara
> but i am bit confused
> about how to leverage Distributed Row Matrix for the same. Can anybody help
> me with same.
>
>
>
>
>
> Thanks
> Parth Khatwani
>

Re: Marketing

2017-03-24 Thread Dmitriy Lyubimov

On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel  wrote:

> The multiple backend support is such a waste of time IMO. The DSL and GPU
> support is super important and should be made even more distributed. The
> current (as I understand it) single threaded GPU per VM is only the first
> step in what will make Mahout important for a long time to come.
>

This seems self contradicting a bit. Multiple backends is the only thing
that remedies it for me. By that i mean both distributed (i/o) backends and
the in-memory.

Good CPU and GPU plugins will be important, as well as communication layer
alternatives to spark. Spark is not working out well for interconnected
problems, and H20 and Flink, well, I'd just forget about them. I'd
certainly drop H20 for now. But ability to plug in new communication
backend primitives seems to be critical in my experience, as well as
variety of cpu/gpu chipset support. (I do use both in-memory and i/o custom
backends that IMO are a must).

In that sense, it is super-important that custom backends are easy to plug
(even if you are absolutely legitimately dissatisfied with the existing
ones).


> Think of Mahout in 5 years what will be important? H2O? Hadoop Mapreduce?
> Flink? I’ll stake my dollar on no. GPUs yes and up the stakes. Streaming
> online learning (kappa style) yes but not sure Mahout is made for this
> right now.
>
> Or if we are talking about web site revamp +1, I’d be happy to upgrade my
> section and have only held off waiting to see a redesign or moving to
> Jekyll.
>
> As to a new mascot, ok, but the old one fits the name. We tried sub-naming
> Mahout-Samsara to symbolize the changing nature and rebirth of the project,
> maybe we should drop the name Mahout altogether. the name Mahout, like the
> blue man, is not relevant to the project anymore and maybe renaming, is
> good for marketing.
>
> On Mar 24, 2017, at 7:37 AM, Nikolai Sakharnykh 
> wrote:
>
> Agree that the website feels outdated. I would add Samsara code example on
> the front page, list of key algorithms implemented, supported backends,
> github & download links, and cut down the news part especially towards the
> end with flat release numbers and dates. Also probably reorganize the tabs.
>
> If we go with honey badger as a mascot do we have any ideas on the logo
> itself? Honey badger biting/eating a snake?)
>
> -Original Message-
> From: Trevor Grant [mailto:trevor.d.gr...@gmail.com]
> Sent: Thursday, March 23, 2017 8:53 PM
> To: dev@mahout.apache.org
> Cc: u...@mahout.apache.org
> Subject: Re: Marketing
>
> A student once asked his teacher, "Master, what is enlightenment?"
>
> The master replied, "When hungry, eat. When tired, sleep."
>
> Sounds like the honey badger to me...
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Thu, Mar 23, 2017 at 5:43 PM, Pat Ferrel  wrote:
>
> > The little blue man (the mahout) was reborn (samsara) as a honey-badger?
> > He must be close indeed to reaching true enlightenment, or is that
> Buddhism?
> >
> >
> > On Mar 23, 2017, at 12:42 PM, Andrew Palumbo  wrote:
> >
> > +1 on revamp.
> >
> >
> >
> > Sent from my Verizon Wireless 4G LTE smartphone
> >
> >
> >  Original message 
> > From: Trevor Grant 
> > Date: 03/23/2017 12:36 PM (GMT-08:00)
> > To: u...@mahout.apache.org, dev@mahout.apache.org
> > Subject: Marketing
> >
> > Hey user and dev,
> >
> > With 0.13.0 the Apache Mahout project has added some significant updates.
> >
> > The website is starting to feel 'dated' I think it could use a reboot.
> >
> > The blue person riding the elephant has less signifigance in
> > Mahout-Samsara's modular backends.
> >
> > Would like to open the floor to discussion on website reboot (and who
> > might be willing to take on such a project), as well as new mascot.
> >
> > To kick off- in an offline talk there was the idea of A honey badger
> > (bc honey-badger don't care, just like mahout don't care what back end
> > or native solvers you are using, and also bc a cobra bites a honey
> > badger and he takes a little nap then wakes up and finishes eating the
> > cobra. honey badger eats snakes, and does all the work while the other
> > animals pick up the scraps.
> > see this short documentary on the honey badger:
> > https://www.youtube.com/watch?v=4r7wHMg5Yjg ) ^^audio not safe for
> > work
> >
> > Con: its almost tooo jokey.
> >
> > Other idea: are coy-wolfs.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."
> > -Virgil*
> >
> >
>
>

Re: Contributing an algorithm for samsara

2017-03-03 Thread Dmitriy Lyubimov

And by formula yes i mean R syntax.

possible use case would be to take Spark DataFrame and formula (say, `age ~
. -1`) and produce outputs of DrmLike[Int] (a distributed matrix type) that
converts into predictors and target.

In this particular case, this formula means that the predictor matrix (X)
would have all original variables except `age` (for categorical variables
factor extraction is applied), with no bias column.

Some knowledge of R and SAS is required to pin the compatibility nuances
there.

Maybe we could have reasonable simplifications or omissions compared to R
stuff, if we can be reasonably convinced it is actually better that way
than vanilla R contract, but IMO it would be really useful to retain 100%
compatibility there since it is one of ideas there -- retain R-like-ness
with these things.

On Fri, Mar 3, 2017 at 12:31 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski <j...@jagunet.com> wrote:
>>
>>>
>>>
>>
>>> >
>>> > 3) On the feature extraction per R like formula can you elaborate more
>>> here, are you talking about feature extraction using R like dataframes and
>>> operators?
>>>
>>
>>
> Yes. I would start doing generic formula parser and then specific part
> that works with backend-speicifc data frames. For spark, i don't see any
> reason to write our own; we'd just had an adapter for the Spark native data
> frames.
>

Re: Contributing an algorithm for samsara

2017-03-03 Thread Dmitriy Lyubimov

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski  wrote:
>
>>
>>
>
>> >
>> > 3) On the feature extraction per R like formula can you elaborate more
>> here, are you talking about feature extraction using R like dataframes and
>> operators?
>>
>
>
Yes. I would start doing generic formula parser and then specific part that
works with backend-speicifc data frames. For spark, i don't see any reason
to write our own; we'd just had an adapter for the Spark native data
frames.

Re: Contributing an algorithm for samsara

2017-03-03 Thread Dmitriy Lyubimov

I am getting a liittle bit lost who asked what here, inline.

On Fri, Mar 3, 2017 at 4:09 AM, Jim Jagielski  wrote:

>
>
> Would it make sense to keep them as-is, and "pull them out", as
> it were, should they prove to be wanted/needed by the other algo users?
>

I would hope it is of some help (especially math and in-memory prototype)
for something to look back to. I would really try to plot it all anew, I
found it usually helps my focus if I work with my own code from the ground
up.

So no, i would not just try to take it as is. Not without careful review.

Also, if you noticed, the distributed version is quasi-algebraic, i.e., it
contains direct Spark dependencies and code that relies on Spark. As such,
it cannot be put into our decompositions package in mahout-math-scala
module, where most of other distributed decompositions sit.

I suspect it could be made 100% algebraic with current primitives available
in Samsara. This is necessary condition to get it into mahout-math-scala.
If it can't be done, then it has to live in mahout-spark module as one
backend implementation only.

>
> >
> > 3) On the feature extraction per R like formula can you elaborate more
> here, are you talking about feature extraction using R like dataframes and
> operators?
>

> >
> >
> >
> > More later as I read through the papers.
>

I would really start there before anything else. (Moreover, this is the
most fun part of all of it, as far as i am concerned:) ).

Also my adapted formulas are attached to the issue like i mentioned. I
would look thru the math if it is clear (for interpretation), if not let's
discuss any questions.

> >

Re: Contributing an algorithm for samsara

2017-02-17 Thread Dmitriy Lyubimov

in particular, this is the samsara implementation of double-weighed als :
https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626


On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Jim,
>
> if ALS is of interest, and as far as weighed ALS is concerned (since we
> already have trivial regularized ALS in the "decompositions" package),
> here's uncommitted samsara-compatible patch from a while back:
> https://issues.apache.org/jira/browse/MAHOUT-1365
>
> it combines weights on both data points (a.k.a "implicit feedback" als)
> and regularization rates  (paper references are given). We combine both
> approaches in one (which is novel, i guess, but yet simple enough).
> Obviously the final solver can also be used as pure reg rate regularized if
> wanted, making it equivalent to one of the papers.
>
> You may know implicit feedback paper from mllib's implicit als, but unlike
> it was done over there (as a use case sort problem that takes input before
> even features were extracted), we split the problem into pure algebraic
> solver (double-weighed ALS math) and leave the feature extraction outside
> of this issue per se (it can be added as a separate adapter).
>
> The reason for that is that the specific use-case oriented implementation
> does not necessarily leave the space for feature extraction that is
> different from described use case of partially consumed streamed videos in
> the paper. (e.g., instead of videos one could count visits or clicks or
> add-to-cart events which may need additional hyperparameter found for them
> as part of feature extraction and converting observations into "weghts").
>
> The biggest problem with these ALS methods however is that all
> hyperparameters require multidimensional crossvalidation and optimization.
> I think i mentioned it before as list of desired solutions, as it stands,
> Mahout does not have hyperarameter fitting routine.
>
> In practice, when using these kind of ALS, we have a case of
> multidimensional hyperparameter optimization. One of them comes from the
> fitter (reg rate, or base reg rate in case of weighed regularization), and
> the others come from feature extraction process. E.g., in original paper
> they introduce (at least) 2 formulas to extract measure weighs from the
> streaming video observations, and each of them had one parameter, alhpa,
> which in context of the whole problem becomes effectively yet another
> hyperparameter to fit. In other use cases when your confidence measurement
> may be coming from different sources and observations, the confidence
> extraction may actually have even more hyperparameters to fit than just
> one. And when we have a multidimensional case, simple approaches (like grid
> or random search) become either cost prohibitive or ineffective, due to the
> curse of dimensionality.
>
> At the time i was contributing that method, i was using it in conjunction
> with multidimensional bayesian optimizer, but the company that i wrote it
> for did not have it approved for contribution (unlike weighed als) at that
> time.
>
> Anyhow, perhaps you could read the algebra in both ALS papers there and
> ask questions, and we could worry about hyperparameter optimization a bit
> later and performance a bit later.
>
> On the feature extraction front (as in implicit feedback als per Koren
> etc.), this is an ideal use case for more general R-like formula approach,
> which is also on desired list of things to have.
>
> So i guess we have 3 problems really here:
> (1) double-weighed ALS
> (2) bayesian optimization and crossvalidation in an n-dimensional
> hyperparameter space
> (3) feature extraction per (preferrably R-like) formula.
>
>
> -d
>
>
> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap@outlook.com>
> wrote:
>
>> +1 to glms
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Trevor Grant <trevor.d.gr...@gmail.com>
>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Contributing an algorithm for samsara
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>> jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>> t%20%3D%20MAHOU

Re: Contributing an algorithm for samsara

2017-02-17 Thread Dmitriy Lyubimov

Jim,

if ALS is of interest, and as far as weighed ALS is concerned (since we
already have trivial regularized ALS in the "decompositions" package),
here's uncommitted samsara-compatible patch from a while back:
https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and
regularization rates  (paper references are given). We combine both
approaches in one (which is novel, i guess, but yet simple enough).
Obviously the final solver can also be used as pure reg rate regularized if
wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike
it was done over there (as a use case sort problem that takes input before
even features were extracted), we split the problem into pure algebraic
solver (double-weighed ALS math) and leave the feature extraction outside
of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation
does not necessarily leave the space for feature extraction that is
different from described use case of partially consumed streamed videos in
the paper. (e.g., instead of videos one could count visits or clicks or
add-to-cart events which may need additional hyperparameter found for them
as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all
hyperparameters require multidimensional crossvalidation and optimization.
I think i mentioned it before as list of desired solutions, as it stands,
Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of
multidimensional hyperparameter optimization. One of them comes from the
fitter (reg rate, or base reg rate in case of weighed regularization), and
the others come from feature extraction process. E.g., in original paper
they introduce (at least) 2 formulas to extract measure weighs from the
streaming video observations, and each of them had one parameter, alhpa,
which in context of the whole problem becomes effectively yet another
hyperparameter to fit. In other use cases when your confidence measurement
may be coming from different sources and observations, the confidence
extraction may actually have even more hyperparameters to fit than just
one. And when we have a multidimensional case, simple approaches (like grid
or random search) become either cost prohibitive or ineffective, due to the
curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction
with multidimensional bayesian optimizer, but the company that i wrote it
for did not have it approved for contribution (unlike weighed als) at that
time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask
questions, and we could worry about hyperparameter optimization a bit later
and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren
etc.), this is an ideal use case for more general R-like formula approach,
which is also on desired list of things to have.

So i guess we have 3 problems really here:
(1) double-weighed ALS
(2) bayesian optimization and crossvalidation in an n-dimensional
hyperparameter space
(3) feature extraction per (preferrably R-like) formula.

-d

On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo  wrote:

> +1 to glms
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Trevor Grant 
> Date: 02/17/2017 6:56 AM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Re: Contributing an algorithm for samsara
>
> Jim is right, and I would take it one further and say, it would be best to
> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> from there a Logistic regression is a trivial extension.
>
> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
> in neck first for both Jim and Saikat...
>
> MAHOUT-1928 and MAHOUT-1929
>
> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=
> project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%
> 20DESC%2C%20created%20ASC
>
> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
> in there.
>
> If you have an algorithm you are particularly intimate with, or explicitly
> need/want- feel free to open a JIRA and assign to yourself.
>
> There is also a case to be made for implementing the ALS...
>
> 1) It's a much better 'beginner' project.
> 2) Mahout has some world class Recommenders, a toy ALS implementation might
> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
> the framework. E.g. ALS being the toy-prototype reccomender that helps us
> think through building out that section of the framework.
>
>
>
> Trevor Grant
> Data

[jira] [Commented] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs

2017-02-14 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866853#comment-15866853
 ] 

Dmitriy Lyubimov commented on MAHOUT-1940:
--

Normally, one who is writing in Java, does not have to really port anything 
from Scala. 
For example, Spark's Java APIs are in fact implemented in Scala. 

There are normally two ways of going about this: 
(1) write API in Java and implement them in Scala (the way Spark does), 
(2) write Java-compatible traits in Scala and then implement them in Scala as 
well. (which is what i do as it saves complexity a bit). 

to approach the (2), the APIs should only be using Java-compatible types. That 
is, no Scala libraries (such as collections) or incompatible language 
constructs (such as implicits, curried functions, generics context bounds etc. 
etc.) Implementing API interfaces in Java just verifies this a bit better and 
allows avoiding a mixed build (which may sometimes be a problem due to circular 
dependencies between Java and Scala code).


> Provide a Java API to  SimilarityAnalysis and any other needed APIs
> ---
>
> Key: MAHOUT-1940
> URL: https://issues.apache.org/jira/browse/MAHOUT-1940
> Project: Mahout
>  Issue Type: New Feature
>  Components: Algorithms, cooccurrence
>Reporter: James Mackey
>
> We want to port the functionality from 
> org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy 
> integration with a java project we will be creating that derives a similarity 
> measure from the co-occurrence and cross-occurrence matrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: [jira] [Updated] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-13 Thread Dmitriy Lyubimov

fwiw i am not sure if shading fastutils is generally the best approach, as
it blows the size of the math-scala 10x, but i did verify it worked with
cdh.


On Mon, Feb 13, 2017 at 10:49 AM, Andrew Palumbo (JIRA) <j...@apache.org>
wrote:

>
>  [ https://issues.apache.org/jira/browse/MAHOUT-1939?page=
> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Andrew Palumbo updated MAHOUT-1939:
> ---
> Sprint: Jan/Feb-2017
>
> > fastutil version clash with spark distributions
> > ---
> >
> > Key: MAHOUT-1939
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> > Project: Mahout
> >  Issue Type: Bug
> >Reporter: Dmitriy Lyubimov
> >Priority: Blocker
> > Fix For: 0.13.0
> >
> >
> > Version difference in fast util breaks sparse algebra (specifically,
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> > observed version in CDH:
> > file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/
> jars/fastutil-6.3.jar
> > mahout uses 7.0.12
> > java.lang.UnsupportedOperationException
> > at it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$
> BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> > at org.apache.mahout.math.RandomAccessSparseVector$
> RandomAccessElement.set(RandomAccessSparseVector.java:235)
> > at org.apache.mahout.math.VectorView$DecoratorElement.
> set(VectorView.java:181)
> > at org.apache.mahout.math.AbstractVector.assign(
> AbstractVector.java:536)
> > at org.apache.mahout.math.scalabindings.RLikeVectorOps.$
> div$eq(RLikeVectorOps.scala:45)
> > ...
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>

Re: [jira] [Commented] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-13 Thread Dmitriy Lyubimov

I don't know exactly. That current more or less recent version is broken
(with spark 1.6.0) it seems.

On Sun, Feb 12, 2017 at 5:28 PM, Andrew Palumbo (JIRA) <j...@apache.org>
wrote:

>
> [ https://issues.apache.org/jira/browse/MAHOUT-1939?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=15863101#comment-15863101 ]
>
> Andrew Palumbo commented on MAHOUT-1939:
> 
>
> [~dlyubimov] How many CDH Spark versions does this break?  Is this a
> blocker for the upcoming release?  Seems so.
>
>
> > fastutil version clash with spark distributions
> > ---
> >
> > Key: MAHOUT-1939
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> >     Project: Mahout
> >  Issue Type: Bug
> >Reporter: Dmitriy Lyubimov
> >Priority: Critical
> >
> > Version difference in fast util breaks sparse algebra (specifically,
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> > observed version in CDH:
> > file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/
> jars/fastutil-6.3.jar
> > mahout uses 7.0.12
> > java.lang.UnsupportedOperationException
> > at it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$
> BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> > at org.apache.mahout.math.RandomAccessSparseVector$
> RandomAccessElement.set(RandomAccessSparseVector.java:235)
> > at org.apache.mahout.math.VectorView$DecoratorElement.
> set(VectorView.java:181)
> > at org.apache.mahout.math.AbstractVector.assign(
> AbstractVector.java:536)
> > at org.apache.mahout.math.scalabindings.RLikeVectorOps.$
> div$eq(RLikeVectorOps.scala:45)
> > ...
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>

[jira] [Comment Edited] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1939 at 2/11/17 12:33 AM:


perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi:fastutil

  
  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  



was (Author: dlyubimov):
perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  


> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>    Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1939 at 2/11/17 12:16 AM:


perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.
like this (in mahout-math):

  
org.apache.maven.plugins
maven-shade-plugin
3.0.0

  
package

  shade


  

  it.unimi.dsi.fastutil
  shaded.it.unimi.dsi.fastutil

  

  

  



was (Author: dlyubimov):
perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.

> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>    Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862045#comment-15862045
 ] 

Dmitriy Lyubimov commented on MAHOUT-1939:
--

perhaps mahout should  include fast-util in a shaded form in mahout-math or 
mahout-math-scala.

> fastutil version clash with spark distributions
> ---
>
> Key: MAHOUT-1939
> URL: https://issues.apache.org/jira/browse/MAHOUT-1939
> Project: Mahout
>  Issue Type: Bug
>    Reporter: Dmitriy Lyubimov
>Priority: Critical
>
> Version difference in fast util breaks sparse algebra (specifically, 
> RandomAccessSparseVector in assign, e.g., vec *= 5).
> observed version in CDH:
> 
> file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar
> mahout uses 7.0.12
> java.lang.UnsupportedOperationException
> at 
> it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
> at 
> org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
> at 
> org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
> at 
> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
> at 
> org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
> ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (MAHOUT-1939) fastutil version clash with spark distributions

2017-02-10 Thread Dmitriy Lyubimov (JIRA)

Dmitriy Lyubimov created MAHOUT-1939:


 Summary: fastutil version clash with spark distributions
 Key: MAHOUT-1939
 URL: https://issues.apache.org/jira/browse/MAHOUT-1939
 Project: Mahout
  Issue Type: Bug
Reporter: Dmitriy Lyubimov
Priority: Critical


Version difference in fast util breaks sparse algebra (specifically, 
RandomAccessSparseVector in assign, e.g., vec *= 5).

observed version in CDH:

file:/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.21/jars/fastutil-6.3.jar

mahout uses 7.0.12

java.lang.UnsupportedOperationException
at 
it.unimi.dsi.fastutil.ints.AbstractInt2DoubleMap$BasicEntry.setValue(AbstractInt2DoubleMap.java:146)
at 
org.apache.mahout.math.RandomAccessSparseVector$RandomAccessElement.set(RandomAccessSparseVector.java:235)
at 
org.apache.mahout.math.VectorView$DecoratorElement.set(VectorView.java:181)
at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:536)
at 
org.apache.mahout.math.scalabindings.RLikeVectorOps.$div$eq(RLikeVectorOps.scala:45)
...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Re: Intro from a lurker

2017-02-09 Thread Dmitriy Lyubimov

Jim, let me start by stating it's an (unexpected on my side) honor. Are you
willing to get hands-on at this point in numerical problems (or have
resources that can get hands-on)?

Short modern Mahout story (as short as it is possible to be short)

Most nagging problem: lack of support by industry and/or academia. We have
capable committers but less capable capable backers in terms of willingness
to sanction contributions.

Current mahout development goes 2 ways: (a) the platform (aka `samsara`);
and (b)  useful, preferrably end2end use case scenarious, or just
methodology implementation. Note that while (b) is intended to use (a) (and
gain backend portability as a bonus), it is not strictly required as long
as the backend-speicific code could be fairly easily ported to other
backends. Still though, if we come across a need for custom code, we try to
analyze the situation if it is something that might be a fairly common
abstraction so we could add it to the formalisms list we got in the
platform and avoid repetition in the future. Platform primer could be found
on the site, I won't be getting into that now.

In the platform the problem #1, currently, is the performance. Not that it
is generally bad, but some pieces are limited by back-ends. We did some
in-memory work to integrate more performing backends there but the effort
is constrained by our immediate capacities to contribute, and the most
glaring issue (as one of visitors duly noted in jira) is that the
distributed backends we are trying to run are severely limited in terms of
interconnected algebraic problems. We have ideas what to do here though.

It is the very distributed performance of interconnected numerical problems
of the current backends (flink, spark) which precludes Mahout from being a
pragmatical platform for implementing deep learning at scale, for example.
I suppose in-memory performance should be ok for that purpose once we have
added GPU and DL specific GPU primitives. The in-memory improvements are
not complete for everything that would be ideal, but there has been some
notable progress there.

With methodologies, well, there's no one single most pressing problem, it
is really just defined by a pragmatical problem one has at hand. Currently,
Trevor does the most of this outstanding work. It simply and preferably
should be a more edgy than most distributed packages offer.

E.g., decent-to-good bayesian optimization for hyperparameters, or say I
was suggesting to experiment with LRFM recommendation techniques for a few
years, as they significantly expand on type of predictors the method can
take, and their treatment, compared to things like COO or implicit feedback
behavior-based recommenders. Another example is there's no good coverage in
clustering in terms of _type_ of clustering -- mixtures, density, spectral,
not just traditional centroid type of methods. Visualization techniques,
even as simple as 2d density estimators for big datasets are also in
demand. Generally speaking, industry has stepped far ahead in terms of
visualization approaches than commonly is available in open source
software. Bottom line, the only guidance here i see is -- "don't be
trivial. Seek unique  value proposition". But most guiding principle so far
was people's pragmatism: "I have actual production use case and/or very
specific requirements for that, I want to use the methodology X for that,
and I don't seem to be able to find it elsewhere under management of a
distributed platform Y".

-d

On Thu, Feb 9, 2017 at 6:34 AM, Jim Jagielski  wrote:

>
> > On Feb 8, 2017, at 11:50 PM, Suneel Marthi  wrote:
> >
> > Curious JimJag,
> > Did some dude from CapitalOne poke u about Mahout
> >
>
> Not really, no...
>

[jira] [Resolved] (MAHOUT-1916) mahout bug

2017-01-26 Thread Dmitriy Lyubimov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov resolved MAHOUT-1916.
--
Resolution: Invalid

> mahout bug
> --
>
> Key: MAHOUT-1916
> URL: https://issues.apache.org/jira/browse/MAHOUT-1916
> Project: Mahout
>  Issue Type: Bug
>Reporter: sarra sarra
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: request to point out for my error

2017-01-11 Thread Dmitriy Lyubimov

so what's the error?

On Wed, Jan 11, 2017 at 10:02 AM, Andrew Palumbo  wrote:

> Welcome Cherryko.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Suneel Marthi 
> Date: 01/11/2017 9:30 AM (GMT-08:00)
> To: mahout 
> Cc: apailn...@gmail.com
> Subject: Fwd: request to point out for my error
>
> Forwarding to dev@
>
>
>
> -- Forwarded message --
> From: Cherry Ko >
> Date: Wed, Jan 11, 2017 at 12:24 PM
> Subject: request to point out for my error
> To: smar...@apache.org
>
>
> Dear Smarthi,
>  May I introduce myself.I'm Cherryko,Ph.D candidats of
> UCSY,Myanmar.I'm studying about mahout samsara and trying to propose ML
> algorithm on samsara.But I'm a beginner to start samsara.When I was running
> samsara code ,I got the following error.Although I tried as much as I can,I
> can't afford and find the main error.
> I used mahout version 0.12.3,scala 2.10.4,spark 1.5.2 and hadoop 2.6.0
> So,if you get a time,help me.Pelease!!!
> I'll not forget your kindly helpness.I'm looking forward to your reply.
> When I run mahout samsara code in intelliJ,the following errors are seen:
>
>
>

Re: Newbie. Getting Started with Mahout. Found an issue in the documentation + Potential Fix

2016-12-21 Thread Dmitriy Lyubimov

we  use something called Apache CMS..
We also host documentation via github (gh-pages branch)

On Sat, Dec 17, 2016 at 7:53 PM, Abhishek Goswami 
wrote:

> Hi Folks.
>
> I made some progress getting started with using Jekyll.  I am fairly new to
> Jekyll, so to start off I tried porting my existing blog
>  from Wordpress.com to Jekyll.  I was
> able
> to get that working.
>
> p.s. here  is the GitHub
> repo which uses Jekyll. I used GitHub Pages to host it. Kindly take a look
> : https://abgoswam.github.io/  
>
> I realized Jekyll has some importers to
> import
> sites built using other frameworks to Jekyll. I used the wordpress.com
> importer for my trial (to import my wordpress.com blog to jekyll).  It
> worked out nicely.
>
> Could you throw some light about how the current mahout web pages are
> hosted ? Based on that we can figure out how to import the existing site to
> Jekyll etc.
>
>
> Regards,
> Abhishek.
>
>
> On Thu, Dec 15, 2016 at 6:16 PM, Abhishek Goswami 
> wrote:
>
> > Hi Andrew.
> >
> > Thanks for your email. I am planning to spend time on this this weekend.
> >
> > Will keep you folks posted how my trails go, or if I hit any blockers
> etc.
> >
> > Regards
> > Abhishek.
> >
> > On Thu, Dec 15, 2016 at 3:34 PM Andrew Musselman <
> > andrew.mussel...@gmail.com> wrote:
> >
> >> Hey Abishek, just checking in to see how you're getting along; let us
> know
> >>
> >> if we can help.
> >>
> >>
> >>
> >> On Wed, Dec 14, 2016 at 5:55 PM, Andrew Palumbo 
> >> wrote:
> >>
> >>
> >>
> >> > +1
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Sent from my Verizon Wireless 4G LTE smartphone
> >>
> >> >
> >>
> >> >
> >>
> >> >  Original message 
> >>
> >> > From: Trevor Grant 
> >>
> >> > Date: 12/13/2016 7:22 AM (GMT-08:00)
> >>
> >> > To: dev@mahout.apache.org
> >>
> >> > Subject: Re: Newbie. Getting Started with Mahout. Found an issue in
> the
> >>
> >> > documentation + Potential Fix
> >>
> >> >
> >>
> >> > So excited to the web site moved over!!
> >>
> >> >
> >>
> >> > Thanks Abhishek!
> >>
> >> >
> >>
> >> >
> >>
> >> >
> >>
> >> > Trevor Grant
> >>
> >> > Data Scientist
> >>
> >> > https://github.com/rawkintrevo
> >>
> >> > http://stackexchange.com/users/3002022/rawkintrevo
> >>
> >> > http://trevorgrant.org
> >>
> >> >
> >>
> >> > *"Fortunate is he, who is able to know the causes of things."
> -Virgil*
> >>
> >> >
> >>
> >> >
> >>
> >> > On Mon, Dec 12, 2016 at 7:07 PM, Andrew Musselman <
> >>
> >> > andrew.mussel...@gmail.com> wrote:
> >>
> >> >
> >>
> >> > > Great; nice meeting you at the meetup last week too :)
> >>
> >> > >
> >>
> >> > > On Mon, Dec 12, 2016 at 4:58 PM Abhishek Goswami <
> abgos...@gmail.com>
> >>
> >> > > wrote:
> >>
> >> > >
> >>
> >> > > > Sounds good. Thanks Andrew. I will take a stab at it (using Jekyll
> >> as
> >>
> >> > > >
> >>
> >> > > > Suneel suggested)
> >>
> >> > > >
> >>
> >> > > >
> >>
> >> > > >
> >>
> >> > > > On Mon, Dec 12, 2016 at 4:51 PM, Andrew Musselman <
> >>
> >> > > >
> >>
> >> > > > andrew.mussel...@gmail.com> wrote:
> >>
> >> > > >
> >>
> >> > > >
> >>
> >> > > >
> >>
> >> > > > > Awesome, thanks for offering to help out! I think we'll all be
> >>
> >> > learning
> >>
> >> > > > as
> >>
> >> > > >
> >>
> >> > > > > we go, so I'd say go ahead and get a sense of how much in the
> >>
> >> > existing
> >>
> >> > > > site
> >>
> >> > > >
> >>
> >> > > > > would transfer over easily, and how much would need to be
> >> adjusted,
> >>
> >> > > etc.
> >>
> >> > > >
> >>
> >> > > > >
> >>
> >> > > >
> >>
> >> > > > > As you progress you can open a pull request so we can guide/help
> >> out.
> >>
> >> > > >
> >>
> >> > > > >
> >>
> >> > > >
> >>
> >> > > > > On Mon, Dec 12, 2016 at 3:34 PM Abhishek Goswami <
> >> abgos...@gmail.com
> >>
> >> > >
> >>
> >> > > >
> >>
> >> > > > > wrote:
> >>
> >> > > >
> >>
> >> > > > >
> >>
> >> > > >
> >>
> >> > > > > > Thanks Suneel. Yes that sounds reasonable to have the web site
> >> as
> >>
> >> > > part
> >>
> >> > > > of
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > > the source github.
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > > I have forked the project into my local repo :
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > > https://github.com/abgoswam/mahout
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > > Looking forward to get some tips on the next steps.
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > >
> >>
> >> > > >
> >>
> >> > > > > > Regards,
> >>
> >> > > >
> >>
> >> > > > > >
>

[jira] [Commented] (MAHOUT-1856) Create a framework for new Mahout Clustering, Classification, and Optimization Algorithms

2016-12-21 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768521#comment-15768521
 ] 

Dmitriy Lyubimov commented on MAHOUT-1856:
--

one thing -- we usually squash working braches before moving a PR to master so 
that we preferrably have one commit per issue. this is much easier manage (and 
hot-fix stuff if needed later).

> Create a framework for new Mahout Clustering, Classification, and 
> Optimization  Algorithms
> --
>
> Key: MAHOUT-1856
> URL: https://issues.apache.org/jira/browse/MAHOUT-1856
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.1
>Reporter: Andrew Palumbo
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.0
>
>
> To ensure that Mahout does not become "A loose bag of algorithms", Create 
> basic traits with funtions common to each class of algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Git branching policy

2016-12-16 Thread Dmitriy Lyubimov

we work with much simpler PR model which i don't think fits this very well.

i take it PRs will have to be the "feature" branches and will have to be
posted against that develop branch instead of the master. This will
complicate things unnecessarily IMO. It probably is ok model for tightly
knit teams but for OS i would keep things very simple, i.e., what github
already does.

in other words we only manage master, PRs and minor release branches.

For new features, which is the majority, we don't care how devs come up
with the PRs and we merge them to master, which is really our accumulator
for the next major release.

for hot fixes, we PR them against master as well; but we merge them to both
master and release branch for the next minor hot fix.

at least that's how majority of apache projects on github handle that afaik.

Crucial point is to always PR against the master rather than some obscure
special label. It would be very difficult to work with contributors in any
other way than forking off master -- PR against master cycle.

On Thu, Dec 15, 2016 at 8:16 AM, Pat Ferrel  wrote:

> I have changes in the master that are needed for some users of Mahout.
> However the master is often chaotic due to being the branch that is the
> SNAPSHOT of all partial or not well tested changes. The key feature of the
> branching model described in the blog is that master is stable and contains
> only the last release and fixes of features considered critical to users.
> The SNAPSHOT is in a branch called “develop". We have used this method in
> PredictionIO for years and it is quite nice. The older method used in
> Mahout leaves me in the situation of recommending that users pull a
> specific commit rather than the master. With this new model we would be
> able to tag and rename artifacts and any pull from master should be far
> more stable than it is now.
>
> I’d like to propose we use this model for future development. The primary
> difference would be that the current master would become “develop” and the
> master would contain the last stable release plus any “hotfix”es. The
> changes I made to support a user would be proposed for inclusion in the
> “hiotfix” branch and merged with master and develop. Master would get a tag
> and perhaps new artifact name like 0.12.2-hotfix.1 and would be documented
> as a source only “hotfix” it would not require a full maven-verse binary
> release. The benefit to all users is a stable master and a way to get
> changes in a sane way with reliable artifact names.
>
> http://nvie.com/posts/a-successful-git-branching-model/ <
> http://nvie.com/posts/a-successful-git-branching-model/>
>
> Thoughts?

[jira] [Commented] (MAHOUT-1892) Can't broadcast vector in Mahout-Shell

2016-11-15 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668009#comment-15668009
 ] 

Dmitriy Lyubimov commented on MAHOUT-1892:
--

Shell is  a mystery. Obviously it tries to drag A itself into the mapblock 
closure, buy why is escaping me.

What happens if we remove implicit conversion (i.e. use bcastV.value explicitly 
inside the closure)? is it still happening?

> Can't broadcast vector in Mahout-Shell
> --
>
> Key: MAHOUT-1892
> URL: https://issues.apache.org/jira/browse/MAHOUT-1892
> Project: Mahout
>  Issue Type: Bug
>Reporter: Trevor Grant
>
> When attempting to broadcast a Vector in Mahout's spark-shell with `mapBlock` 
> we get serialization errors.  **NOTE** scalars can be broadcast without issue.
> I did some testing in the "Zeppelin Shell" for lack of a better term.  See 
> https://github.com/apache/zeppelin/pull/928
> The `mapBlock` same code I ran in the spark-shell below, also generated 
> errors.  However, wrapping a mapBlock into a function in a compiled jar 
> https://github.com/apache/mahout/pull/246/commits/ccb5da65330e394763928f6dc51d96e38debe4fb#diff-4a952e8e09ae07e0b3a7ac6a5d6b2734R25
>  and then running said function from the Mahout Shell or in the "Zeppelin 
> Shell" (using Spark or Flink as a runner) works fine.  
> Consider
> ```
> mahout> val inCoreA = dense((1, 2, 3), (3, 4, 5))
> val A = drmParallelize(inCoreA)
> val v: Vector = dvec(1,1,1)
> val bcastV = drmBroadcast(v)
> val drm2 = A.mapBlock() {
> case (keys, block) =>
> for(row <- 0 until block.nrow) block(row, ::) -= bcastV
> keys -> block
> }
> drm2.checkpoint()
> ```
> Which emits the stack trace:
> ```
> org.apache.spark.SparkException: Task not serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
> at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
> at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> at org.apache.spark.rdd.RDD.map(RDD.scala:317)
> at 
> org.apache.mahout.sparkbindings.blas.MapBlock$.exec(MapBlock.scala:33)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:338)
> at 
> org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:116)
> at 
> org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
> at 
> $iwC$$iwC$$iwC$$iwC$$iw

Re: [DISCUSS] More meaningful error when running on Spark 2.0

2016-11-15 Thread Dmitriy Lyubimov

+1 on version checking.
And, there's a little bug as well. this error is technically generated by
something like

dense(Set.empty[Vector]),

i.e., it cannot form a matrix out of an empty collection of vectors. While
this is true, i suppose it needs a `require(...)` insert there to generate
a more meaningful response instead of allowing Scala complaining about
empty collection.

-d


On Mon, Nov 14, 2016 at 7:32 AM, Andrew Palumbo  wrote:

> +1
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Trevor Grant 
> Date: 11/14/2016 6:49 AM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: [DISCUSS] More meaningful error when running on Spark 2.0
>
> Hi,
>
> currently when running on Spark 2.0 the user will hit some sort of error,
> one such error is:
>
> java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
> at scala.collection.IterableLike$class.head(IterableLike.scala:107)
> at scala.collection.mutable.ArrayOps$ofRef.scala$collection$Ind
> exedSeqOptimized$$super$head(ArrayOps.scala:186)
> at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOp
> timized.scala:126)
> at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
> at org.apache.mahout.math.scalabindings.package$$anonfun$1.appl
> y(package.scala:155)
> at org.apache.mahout.math.scalabindings.package$$anonfun$1.appl
> y(package.scala:133)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver
> sableLike.scala:234)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver
> sableLike.scala:234)
> at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSe
> qOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.mahout.math.scalabindings.package$.dense(package.scala:133)
> at org.apache.mahout.sparkbindings.SparkEngine$.drmSampleKRows(
> SparkEngine.scala:289)
> at org.apache.mahout.math.drm.package$.drmSampleKRows(package.scala:149)
> at org.apache.mahout.math.drm.package$.drmSampleToTSV(package.scala:165)
> ... 58 elided
>
> With the recent Zeppelin-Mahout integration, there are going to be a lot of
> users unknowingly attempting to run on Mahout on Spark 2.0.  I think it
> would be simple to implement yet save a lot of time on the Zeppelin and
> Mahout mailing lists to do something like:
>
> if sc.version > 1.6.2 then:
>error("Spark versions ${sc.verion} isn't supported.  Please see
> MAHOUT-... (appropriate jira info)")
>
> I'd like to put something together and, depending on how many issues people
> have on Zeppelin list, be prepared to do a hotfix on 0.12.2 if it becomes
> prudent.  Everyone always complaining that Zeppelin doesn't work because of
> some mystical error, is bad pr.  It DOES say in the notebook and elsewhere
> that we're not 2.0 compliant, however one of the advantages/drawbacks of
> Zeppelin is that without having to really know what you're doing you can
> get a functional local cluster of Flink, Spark, etc. all going.
>
> So we easily could have a space where someone read none of the docs, and is
> whining.  Surely few if any would ever do such a thing, but still I think a
> prudent fix to have in the back pocket.
>
> tg
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>

[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:10 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:09 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:08 PM:
---

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding ncol and nrow to drmDfsRead 
specifically. There's possibly only a slight benefit right now (no no-cache or 
no-sample guarantee), which likely only decrease in the future. I am fine with 
it as understood there's no "no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546663#comment-15546663
 ] 

Dmitriy Lyubimov commented on MAHOUT-1884:
--

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding ncol and nrow to drmDfsRead 
specifically. There's possibly only a slight benefit right now (no no-cache or 
no-sample guarantee), which likely only decrease in the future. I am fine with 
it as understood there's no "no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-03 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543437#comment-15543437
 ] 

Dmitriy Lyubimov commented on MAHOUT-1884:
--



Which api is this about specifically?

wrapping existing RDD (drmWrap() api) supports this. 

Also note that for drms off disk, these are one-pass computations that are of 
cost no more than RDD$count(). Since for any dataset we call dfsRead(), the 
obvious intent is to use it, loading & caching is not doing any harm as that's 
what would happen anyway.

also, matrix dimensions are the most obvious ones but not everything that 
optimizer may need to analyze about the dataset (lazily). There are more 
heuristics about datasets that drmWrap() accepts (and even more that it 
doesn't). 

if we are talking about cases where drmWrap() cannot be used for some reason, 
we probably should request metadata equivalent to what drmWrap() does, not just 
ncol, nrow.

> Allow specification of dimensions of a DRM
> --
>
> Key: MAHOUT-1884
> URL: https://issues.apache.org/jira/browse/MAHOUT-1884
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.12.2
>Reporter: Sebastian Schelter
>Assignee: Sebastian Schelter
>Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (MAHOUT-1884) Allow specification of dimensions of a DRM

2016-10-03 Thread Dmitriy Lyubimov

this has been covered by drwWrap() signature from the very beginning.
I vote this as non-issue.

On Sun, Oct 2, 2016 at 11:51 PM, Sebastian Schelter (JIRA) 
wrote:

> Sebastian Schelter created MAHOUT-1884:
> --
>
>  Summary: Allow specification of dimensions of a DRM
>  Key: MAHOUT-1884
>  URL: https://issues.apache.org/jira/browse/MAHOUT-1884
>  Project: Mahout
>   Issue Type: Improvement
> Affects Versions: 0.12.2
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Priority: Minor
>
>
> Currently, in many cases, a DRM must be read to compute its dimensions
> when a user calls nrow or ncol. This also implicitly caches the
> corresponding DRM.
>
> In some cases, the user actually knows the matrix dimensions (e.g., when
> the matrices are synthetically generated, or when some metadata about them
> is known). In such cases, the user should be able to specify the dimensions
> upon creating the DRM and the caching should be avoided.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: Hi, How could I get involved into mahout?

2016-09-26 Thread Dmitriy Lyubimov

Do you want to approach these rpoblems from mostly algebraic solution vs.
e.g. graph based solution?

On Wed, Sep 21, 2016 at 10:08 PM, Tiramisu Ling <saberge...@gmail.com>
wrote:

> Hi Dmitriy,
>
> Thank you for your reply! I'm a postgraduate student of computer science
> and the research direction of mine is Deep learning. And the focus point of
> my research is use DBN to do the link(between network node) prediction,
> which is the major reason makes want to get involved into mahout and do
> some contribution. Most of my program knowledge is about Python and Matlab
> and, honestly, I only have basic level of Java programing skill. But I
> believe I could learn more about how to use Java by reading the codebase of
> mahout, trust me ;).
>
> Best Regards,
> MikeLing
>
> 2016-09-22 6:12 GMT+08:00 Dmitriy Lyubimov <dlie...@gmail.com>:
>
> > ps another way to approach it, which in fact seems to be most common
> > motivator here, is to start with a pragmatic problem one already has at
> > hand. Abstract tinkering  rarely produces strategically useful
> > contributions, it seems.
> >
> > On Wed, Sep 21, 2016 at 3:09 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > if you can tell us about your background a little bit, perhaps we could
> > > have ideas. frankly we have a pretty sprawling roadmap. At least a set
> of
> > > ideas. It's frankly more than we can realistically do, we can use help,
> > yes.
> > >
> > > On Sat, Sep 17, 2016 at 8:52 AM, Tiramisu Ling <saberge...@gmail.com>
> > > wrote:
> > >
> > >> Hey everyone, I'm new to mahout and I would like to contribute to it.
> In
> > >> general, I had read the how to contribute page in [1], and I had clone
> > the
> > >> repo from github. So what should I do next? Are there any issue like
> > 'good
> > >> first bug' to work with? Thank you very much!:)
> > >>
> > >> [1]http://mahout.apache.org/developers/how-to-contribute.html
> > >>
> > >> Best Regards,
> > >> MikeLing
> > >>
> > >
> > >
> >
>

Re: Hi, How could I get involved into mahout?

2016-09-21 Thread Dmitriy Lyubimov

ps another way to approach it, which in fact seems to be most common
motivator here, is to start with a pragmatic problem one already has at
hand. Abstract tinkering  rarely produces strategically useful
contributions, it seems.

On Wed, Sep 21, 2016 at 3:09 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> if you can tell us about your background a little bit, perhaps we could
> have ideas. frankly we have a pretty sprawling roadmap. At least a set of
> ideas. It's frankly more than we can realistically do, we can use help, yes.
>
> On Sat, Sep 17, 2016 at 8:52 AM, Tiramisu Ling <saberge...@gmail.com>
> wrote:
>
>> Hey everyone, I'm new to mahout and I would like to contribute to it. In
>> general, I had read the how to contribute page in [1], and I had clone the
>> repo from github. So what should I do next? Are there any issue like 'good
>> first bug' to work with? Thank you very much!:)
>>
>> [1]http://mahout.apache.org/developers/how-to-contribute.html
>>
>> Best Regards,
>> MikeLing
>>
>
>

Re: Hi, How could I get involved into mahout?

2016-09-21 Thread Dmitriy Lyubimov

if you can tell us about your background a little bit, perhaps we could
have ideas. frankly we have a pretty sprawling roadmap. At least a set of
ideas. It's frankly more than we can realistically do, we can use help, yes.

On Sat, Sep 17, 2016 at 8:52 AM, Tiramisu Ling  wrote:

> Hey everyone, I'm new to mahout and I would like to contribute to it. In
> general, I had read the how to contribute page in [1], and I had clone the
> repo from github. So what should I do next? Are there any issue like 'good
> first bug' to work with? Thank you very much!:)
>
> [1]http://mahout.apache.org/developers/how-to-contribute.html
>
> Best Regards,
> MikeLing
>

Re: Recommenders and MABs

2016-09-21 Thread Dmitriy Lyubimov

there's been a great blog on that somewhere on richrelevance blog... But i
have a vague feeling based on what you are saying it may be all old news to
you...

[1] http://engineering.richrelevance.com/bandits-recommendation-systems/
and there's more in the series

On Sat, Sep 17, 2016 at 3:10 PM, Pat Ferrel  wrote:

> I’ve been thinking about how one would implement an application that only
> shows recommendations. This is partly because people want to build such
> things.
>
> There are many problems with this including cold start and overfit.
> However these problems also face MABs and are solved with sampling schemes.
> So imagine that you have several models from which to draw recommendations:
> 1) CF based recommender, 2) random recommendations, 3) popular recs (by
> some measure). If we look at each individual as facing an MAB with a
> sampling algo trained by them to pull recs from the 3 (or more) arms. This
> implies an MAB per user.
>
> The very first visit to the application would randomly draw from the
> choices and since there is no user data the recs engine would have to be
> able to respond (perhaps with random recs) the same would have to be true
> of the popular model (returning random), and random is always happy. The
> problem with this is that none of the arms are completely independent and
> the model driving each arm will change over time.
>
> The first time a user visits will result in a new MAB for them and will
> randomly draw from all arms but may get better responses from popular (with
> no user specific data yet in the system for cf). So the sampling will start
> to favor popular but will still explore other methods. When enough data is
> accumulated to start making good recs, the recommender will start to
> outperform popular and will get more of the user’s reinforcement.
>
> This seems to work with several unanswered questions and one problem to
> avoid—overfit. We would need a sampling method that would never fully
> converge or the user would never get a chance to show their
> expanding/changing preferences. The cf recommender will also overfit if
> non-cf items are not mixed in. Of the sampling methods I’ve seen for MABs,
> Greedy will not work but even  with some form of Bayesian/Thompson sampling
> the question is how to parameterize the sampling. With too little
> convergence we get sub-optimal exploit but we get the same with too much
> convergence and this will also overfit the cf recs.
>
> I imagine we could train a meta-model on the mature explore amount by
> trying different parameterization and finding if there is one answer for
> all or we could resort to heuristic rules—even business rules.
>
> If anyone has read this far, any ideas or comments?

Re: Machine Learning algorithm implementation

2016-09-21 Thread Dmitriy Lyubimov

We primarily think in platform-independent, R-like way now.
http://mahout.apache.org/users/sparkbindings/home.html

We hope it should be a good news for algebraic algorithm implementers like
you.

Samsara is mapped into spark, flink and H20 as it stands (no mapreduce, you
are correct in that).

We recognize that the existing set of optimized algebra operators may not
always be enough, and so we expect that some part of an algorithm can be
done for a particular backend, but we usually hope it is abstracted enough
so that non-samsara parts can then be ported to other backends if need be.

-d

On Tue, Sep 20, 2016 at 8:05 PM, María José Basgall 
wrote:

> Hi all,
> I am a doctorate student in Computer Science and we are developing a
> Self-Organizing Map (SOM) algorithm on MapReduce. I want to know about what
> ML algorithm implementation is missing, because we want to make a
> contribution to this project.
> We checked out this page: https://mahout.apache.org/user
> s/basics/algorithms.html and we figured out that the most of algorithms in
> the MapReduce column are deprecated, what is the reason for it? Do we need
> to think in Spark instead of MapReduce implementations?
>
> Thanks,
> MJ
>

Re: Mahout distro Size

2016-09-06 Thread Dmitriy Lyubimov

I dunno. they build shaded assembly artifact it seems and are happy with
this approach. It would seem we'd just need the legacy deps in a similar
case.

On Tue, Sep 6, 2016 at 4:48 PM, Andrew Palumbo <ap@outlook.com> wrote:

> bq.
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
>
> Do you mean using something like Spark's dependency resolver?
>
> ________
> From: Dmitriy Lyubimov <dlie...@gmail.com>
> Sent: Tuesday, September 6, 2016 4:46:24 PM
> To: dev@mahout.apache.org
> Subject: Re: Mahout distro Size
>
> 2 + 1
> 3 + 1
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
> On the other hand, the samsara only dependencies are really light. backends
> are really always "provided", and the rest of it is fairly small enough not
> to be an issue either way.  but we probably definitely should drop local
> support for MR stuff (MR local mode didn't work correctly anyway, last time
> I checked)
>
> On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap@outlook.com> wrote:
>
> > The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.
> > stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> > distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> > size down.
> >
> >   1.  A few Possibilities:
> >
> >   2.  Drop h2o (binary only) from Distro? (18M - unused)
> >
> >   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
> > Remove Hadoop 1 support. could save us some space.
> >
> >   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
> > Remove dependency jars from /lib in mahout binary distribution. Should
> also
> > save space.
> >
> >   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> > of dependencies to  scope, we can revisit: MAHOUT-1705<
> > https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> > in job jar for mahout-examples.
> >
> >  *   16M./lib/hadoop
> >
> >  *   85M./lib/
> >
> > *   Many of the jars in /lib/ and possibly /lib/hadoop are
> already
> > packaged into the mahout-examples jar and adding them to the classpath
> from
> > /lib/ is therefore redundant. As well many may be provided.
> >
>

Re: Mahout distro Size

2016-09-06 Thread Dmitriy Lyubimov

PS i probably should not say "probably definitely" next to each other.
Definitely just definitely :)

On Tue, Sep 6, 2016 at 1:46 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> 2 + 1
> 3 + 1
>
> 4: other projects do something too. spark (at least it used to) to produce
> tons of lib-managed deps as the result of its build, they probably still
> have?
>
> On the other hand, the samsara only dependencies are really light.
> backends are really always "provided", and the rest of it is fairly small
> enough not to be an issue either way.  but we probably definitely should
> drop local support for MR stuff (MR local mode didn't work correctly
> anyway, last time I checked)
>
> On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo <ap@outlook.com> wrote:
>
>> The current apache-mahout-distribution-0.12.2.tar.gz<http://mirror.stjsc
>> hools.org/public/apache/mahout/0.12.2/apache-mahout-distribu
>> tion-0.12.2.tar.gz> is 224M. we need to look for ways to get this size
>> down.
>>
>>   1.  A few Possibilities:
>>
>>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>>
>>   3.  MAHOUT-1865<https://issues.apache.org/jira/browse/MAHOUT-1865>:
>> Remove Hadoop 1 support. could save us some space.
>>
>>   4.  MAHOUT-1706<https://issues.apache.org/jira/browse/MAHOUT-1706>:
>> Remove dependency jars from /lib in mahout binary distribution. Should also
>> save space.
>>
>>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
>> of dependencies to  scope, we can revisit: MAHOUT-1705<
>> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
>> in job jar for mahout-examples.
>>
>>  *   16M./lib/hadoop
>>
>>  *   85M./lib/
>>
>> *   Many of the jars in /lib/ and possibly /lib/hadoop are
>> already packaged into the mahout-examples jar and adding them to the
>> classpath from /lib/ is therefore redundant. As well many may be provided.
>>
>
>

Re: Mahout distro Size

2016-09-06 Thread Dmitriy Lyubimov

2 + 1
3 + 1

4: other projects do something too. spark (at least it used to) to produce
tons of lib-managed deps as the result of its build, they probably still
have?

On the other hand, the samsara only dependencies are really light. backends
are really always "provided", and the rest of it is fairly small enough not
to be an issue either way.  but we probably definitely should drop local
support for MR stuff (MR local mode didn't work correctly anyway, last time
I checked)

On Tue, Sep 6, 2016 at 1:33 PM, Andrew Palumbo  wrote:

> The current apache-mahout-distribution-0.12.2.tar.gz stjschools.org/public/apache/mahout/0.12.2/apache-mahout-
> distribution-0.12.2.tar.gz> is 224M. we need to look for ways to get this
> size down.
>
>   1.  A few Possibilities:
>
>   2.  Drop h2o (binary only) from Distro? (18M - unused)
>
>   3.  MAHOUT-1865:
> Remove Hadoop 1 support. could save us some space.
>
>   4.  MAHOUT-1706:
> Remove dependency jars from /lib in mahout binary distribution. Should also
> save space.
>
>   5.  Having dropped support for MAHOUT_LOCAL we can now likely set a lot
> of dependencies to  scope, we can revisit: MAHOUT-1705<
> https://issues.apache.org/jira/browse/MAHOUT-1705>: Verify dependencies
> in job jar for mahout-examples.
>
>  *   16M./lib/hadoop
>
>  *   85M./lib/
>
> *   Many of the jars in /lib/ and possibly /lib/hadoop are already
> packaged into the mahout-examples jar and adding them to the classpath from
> /lib/ is therefore redundant. As well many may be provided.
>

Re: What's the ETA for next Mahout release

2016-08-18 Thread Dmitriy Lyubimov

as a rule of thumb, we usually don't update dependencies in maintenance
releases. I would say it should go with 0.13.
Do you have sense of urgency to have release? can you use PR branch
meanwhile?

On Thu, Aug 18, 2016 at 10:19 AM, Raviteja Lokineni <
raviteja.lokin...@gmail.com> wrote:

> Can we call a vote for a maintenance release with the lucene compatibility
> upgraded to 5.x from 4.x.
>
> I am assuming it should deserve it's own release. What do the other devs
> think about a maintenance release.
>
> On Thu, Aug 18, 2016 at 1:07 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > There's no consensus. There are a couple features coming that would
> inspire
> > 0.13 but it's hard to tell when they are going to be completed due to
> > uncertainties in contributor's schedule.
> >
> > current estimate september-october for the 0.13.
> >
> > Maintenance releases could be cut pretty much at will (assuming there are
> > compelling bugs found).
> >
> > On Wed, Aug 17, 2016 at 12:05 PM, Raviteja Lokineni <
> > raviteja.lokin...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Just wanted to check if there is any ETA planned out for the next
> Mahout
> > > release. Either 0.12.3 or 0.13.
> > >
> > > Thanks,
> > > --
> > > *Raviteja Lokineni* | Business Intelligence Developer
> > > TD Ameritrade
> > >
> > > E: raviteja.lokin...@gmail.com
> > >
> > > [image: View Raviteja Lokineni's profile on LinkedIn]
> > > <http://in.linkedin.com/in/ravitejalokineni>
> > >
> >
>
>
>
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokin...@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>

Re: What's the ETA for next Mahout release

2016-08-18 Thread Dmitriy Lyubimov

There's no consensus. There are a couple features coming that would inspire
0.13 but it's hard to tell when they are going to be completed due to
uncertainties in contributor's schedule.

current estimate september-october for the 0.13.

Maintenance releases could be cut pretty much at will (assuming there are
compelling bugs found).

On Wed, Aug 17, 2016 at 12:05 PM, Raviteja Lokineni <
raviteja.lokin...@gmail.com> wrote:

> Hi all,
>
> Just wanted to check if there is any ETA planned out for the next Mahout
> release. Either 0.12.3 or 0.13.
>
> Thanks,
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokin...@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> 
>

Re: Traits for a mahout algorithm Library.

2016-07-21 Thread Dmitriy Lyubimov

On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant 
wrote:

>
>
> Finally, re data-frames.  Why not leave it as vectors and matrices?
>

Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.)  than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.

Re: Traits for a mahout algorithm Library.

2016-07-21 Thread Dmitriy Lyubimov

sk-learn learner, transformer and predictor features sound good to me,
tried-and-proven

most importantly imo we need strong established type system and not repeat
what i view as a problem in some other offerings. If the type system is
strict and limited in size, then there's much less need in data adapters,
or none at all.

so what we have :
-- double precison tensor types (but not n-d arrays though)
what we don't have:
-- data frames

What we may want to have
-- formula support, especially for non-linear glm ("linear generalized
linear", does this makes sense at all?) ok non-linear regressions
formula normally acts on data-frame-y data, not on tensor data, albeit it
produces tensor data. Herein lies a conundrum. I don't see mahout taking on
data frames, this is just too big. but good formula and "factor" (in R
sense) support is nice to have for down-to-earth problems.

perhaps a tactical solution here is to integrate some foreign engine data
frames but mahout native formula support. But i didn't give it much
thought, because, although formulas and step-wise non-linear model searches
are the first thing to happen to any analytics (but somehow it hasn't
happened well enough elsewhere), i don't see how it can be made cheaply in
engine-agnostic way. I still commonly view mahout as an under-funded
project, so choices of new things should be smart -- small in volume, great
in the bang. Dataframes are not small in the volume, esp. since i am
increasingly turning away from Spark in my personal endeavors, so i won't
support just integrating sparkql for this purpose.

Big area that people actually need (IMO) and what hasn't been done well
elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
idea that has been in AMPLab for as long as i remember them, and is still
very popular, but I don't think there are good offers that actually solve
this problem in OSS. One of the reasons, modern OSS is pretty slow for the
volume required by the task. if we get some unique improvements to the
framework, we can think of getting in this business. this shouldn't be that
much difficult, assuming the throughput is not an issue. GPU clusters are
increasingly common, we can hope we'll get there in the future.

on algorithm side, i would love to see something with 2d inputs, cnns or
something, for image processing.

On Thu, Jul 21, 2016 at 8:08 AM, Trevor Grant 
wrote:

> I was thinking so too.  Most ML frameworks are at least loosly based on the
> Sklearn paradigm.  For those not familiar, at a very abstract level-
>
> model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
>
> model1.fit(trainingData)
>
> // then depending on the goal of the algorithm you have either (or both)
> preds = model1.predict( testData)  // which returns a vector of predictions
> for each obs point in testing data
>
> // or sometimes
> newVals = model1.transform( testData) // which returns a new dataset like
> object, as this makes more sense for things like neural nets, or when
> you're not just predicting a single value per observation
>
>
> In addition to the above, pre-processing operations then also have a
> transform method such as
>
> preprocess1 = new Normalizer
>
> preprocess1.fit( trainingData )  // in this phase calculates the mean and
> variance of the training data set
>
> preprocessedTrainingData = preprocess1.transform( trainingData)
> preprocessTestingData = preprocess1.transform( testingData)
>
> I think this is a reasonalbe approach bc A) it makes sense and B) is a
> standard of sorts across ML libraries (bc of A)
>
> We have two high level bucket types, based on what the output is:
>
> Predictors and Transformers
>
> Predictors: anything that return a single value per observation, this is
> classifiers and regressors
>
> Transformers: anything that returns a vector per observation
> - Pre-processing operations
> - Classifiers, in that usually there is a probability vector for each
> observation as to which class it belongs too, the 'predict' method then
> just picks the most likely class
> - Neural nets ( though with one small tweak can be extended to regression
> or classification )
> - Any unsupervised learning application (e.g. clustering)
> - etc.
>
> And so really we have something like:
>
> class LearningFunction
>   def fit()
>
> class Transformer extends LearningFunction:
>   def transform
>
> class Predictor extends Transformer:
>   def predict
>
>
> This paradigm also lends its self nicely to pipelines...
>
> pipeline1 = new Pipeline
>.add( transformer1 )
>.add(  transformer2 )
>.add( model1 )
>
> pipeline1.fit( trainingData )
> pipelin1.predict( testingData )
>
> I have to read up on reccomenders a bit more to figure how those play in,
> or if we need another class.
>
> In addition to that I think we would have an optimizers section that allows
> for the various flavors of SGD, but also allows other types of

Re: Location of JARs

2016-06-02 Thread Dmitriy Lyubimov

Thank you, Trevor, for doing this. i think it is tremendously useful for
this project.

On Thu, Jun 2, 2016 at 11:00 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> I agree and have been thinking so more and more over the last couple of
> days.
>
> I'm going to start tinkering with that idea this afternoon / remainder of
> week.
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Thu, Jun 2, 2016 at 12:23 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > i already looked. my main concern is that it meddles with spark
> interpreter
> > code too much which may create friction with spark interpreters in
> future.
> > it may be hard to have two products integration code coherent in one
> > component (in this case, the same interpreter class/file). I don't want
> to
> > put this comment to zeppelin discussion, but internally i think it should
> > be a concern for us.
> >
> > Is it possible to have a standalone mahout-spark interpreter but use the
> > same spark configuration as configured for spark interpreter? If yes, i
> > would very much like not to have spark-alone and spark+mahout code
> > intermingled in same interpreter class.
> >
> > visually, it probably also would be preferable to have a block that would
> > require boiler of something like
> >
> > %spark.mahout
> >
> > ... blah 
> >
> > On Thu, Jun 2, 2016 at 8:24 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> > > Would you mind having a look at
> > > https://github.com/apache/incubator-zeppelin/pull/928/files
> > > to see if I'm missing anything critical.
> > >
> > > The idea is the user specifies a directory containing the necessary (to
> > be
> > > covered in the setup documentation), and the jars are loaded from
> there.
> > > Also adds some configuration settings (mainly Kyro) when 'spark.mahout'
> > is
> > > true.  Finally imports the mahout and sets up the sdc from the already
> > > declared sc.
> > >
> > > Based on my testing that works in local and cluster mode.
> > >
> > > Thanks,
> > > tg
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Wed, Jun 1, 2016 at 12:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > wrote:
> > >
> > > > On Wed, Jun 1, 2016 at 10:46 AM, Dmitriy Lyubimov <dlie...@gmail.com
> >
> > > > wrote:
> > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <
> > trevor.d.gr...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >>
> > > > >> Other approaches?
> > > > >>
> > > > >> For background, Zeppelin starts a Spark Shell and we need to make
> > sure
> > > > all
> > > > >> of the required Mahout jars get loaded in the class path when
> spark
> > > > >> starts.
> > > > >> The question is where do all of these JARs relatively live.
> > > > >>
> > > > >
> > > > > How does zeppelin copes with extra dependencies for other
> > interpreters
> > > > > (even spark itself)? I guess we should follow the same practice
> > there.
> > > > >
> > > > > Release independence of location algorithm largely depends on jar
> > > filters
> > > > > (again, see filters in the spark binding package). It is possible
> > that
> > > > > artifacts required may change but not very likely (i don't think
> they
> > > > ever
> > > > > changed since 0.10). so it should be possible to build (mahout)
> > > > > release-independent logic to locate, filter and assert the
> necessary
> > > > jars.
> > > > >
> > > >
> > > > PS this  may change soon though if/when custom javacpp code is built,
> > we
> > > > may probably want to keep all native things as separate release
> > > artifacts,
> > > > as they are basically treated as optionally available accellerators
> and
> > > may
> > > > or may not be properly loaded in all situations. hence they may
> > warrant a
> > > > seaprate jar vehicle.
> > > >
> > > > >
> > > > >
> > > > >>
> > > > >> Thanks for any feedback,
> > > > >> tg
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Trevor Grant
> > > > >> Data Scientist
> > > > >> https://github.com/rawkintrevo
> > > > >> http://stackexchange.com/users/3002022/rawkintrevo
> > > > >> http://trevorgrant.org
> > > > >>
> > > > >> *"Fortunate is he, who is able to know the causes of things."
> > > -Virgil*
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Location of JARs

2016-06-02 Thread Dmitriy Lyubimov

i already looked. my main concern is that it meddles with spark interpreter
code too much which may create friction with spark interpreters in future.
it may be hard to have two products integration code coherent in one
component (in this case, the same interpreter class/file). I don't want to
put this comment to zeppelin discussion, but internally i think it should
be a concern for us.

Is it possible to have a standalone mahout-spark interpreter but use the
same spark configuration as configured for spark interpreter? If yes, i
would very much like not to have spark-alone and spark+mahout code
intermingled in same interpreter class.

visually, it probably also would be preferable to have a block that would
require boiler of something like

%spark.mahout

... blah 

On Thu, Jun 2, 2016 at 8:24 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> Would you mind having a look at
> https://github.com/apache/incubator-zeppelin/pull/928/files
> to see if I'm missing anything critical.
>
> The idea is the user specifies a directory containing the necessary (to be
> covered in the setup documentation), and the jars are loaded from there.
> Also adds some configuration settings (mainly Kyro) when 'spark.mahout' is
> true.  Finally imports the mahout and sets up the sdc from the already
> declared sc.
>
> Based on my testing that works in local and cluster mode.
>
> Thanks,
> tg
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Wed, Jun 1, 2016 at 12:48 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > On Wed, Jun 1, 2016 at 10:46 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > >
> > >
> > > On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <trevor.d.gr...@gmail.com
> >
> > > wrote:
> > >
> > >>
> > >> Other approaches?
> > >>
> > >> For background, Zeppelin starts a Spark Shell and we need to make sure
> > all
> > >> of the required Mahout jars get loaded in the class path when spark
> > >> starts.
> > >> The question is where do all of these JARs relatively live.
> > >>
> > >
> > > How does zeppelin copes with extra dependencies for other interpreters
> > > (even spark itself)? I guess we should follow the same practice there.
> > >
> > > Release independence of location algorithm largely depends on jar
> filters
> > > (again, see filters in the spark binding package). It is possible that
> > > artifacts required may change but not very likely (i don't think they
> > ever
> > > changed since 0.10). so it should be possible to build (mahout)
> > > release-independent logic to locate, filter and assert the necessary
> > jars.
> > >
> >
> > PS this  may change soon though if/when custom javacpp code is built, we
> > may probably want to keep all native things as separate release
> artifacts,
> > as they are basically treated as optionally available accellerators and
> may
> > or may not be properly loaded in all situations. hence they may warrant a
> > seaprate jar vehicle.
> >
> > >
> > >
> > >>
> > >> Thanks for any feedback,
> > >> tg
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Trevor Grant
> > >> Data Scientist
> > >> https://github.com/rawkintrevo
> > >> http://stackexchange.com/users/3002022/rawkintrevo
> > >> http://trevorgrant.org
> > >>
> > >> *"Fortunate is he, who is able to know the causes of things."
> -Virgil*
> > >>
> > >
> > >
> >
>

Re: Location of JARs

2016-06-01 Thread Dmitriy Lyubimov

On Wed, Jun 1, 2016 at 10:46 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
>>
>> Other approaches?
>>
>> For background, Zeppelin starts a Spark Shell and we need to make sure all
>> of the required Mahout jars get loaded in the class path when spark
>> starts.
>> The question is where do all of these JARs relatively live.
>>
>
> How does zeppelin copes with extra dependencies for other interpreters
> (even spark itself)? I guess we should follow the same practice there.
>
> Release independence of location algorithm largely depends on jar filters
> (again, see filters in the spark binding package). It is possible that
> artifacts required may change but not very likely (i don't think they ever
> changed since 0.10). so it should be possible to build (mahout)
> release-independent logic to locate, filter and assert the necessary jars.
>

PS this  may change soon though if/when custom javacpp code is built, we
may probably want to keep all native things as separate release artifacts,
as they are basically treated as optionally available accellerators and may
or may not be properly loaded in all situations. hence they may warrant a
seaprate jar vehicle.

>
>
>>
>> Thanks for any feedback,
>> tg
>>
>>
>>
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>
>

Re: Location of JARs

2016-06-01 Thread Dmitriy Lyubimov

On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant 
wrote:

>
> Other approaches?
>
> For background, Zeppelin starts a Spark Shell and we need to make sure all
> of the required Mahout jars get loaded in the class path when spark starts.
> The question is where do all of these JARs relatively live.
>

How does zeppelin copes with extra dependencies for other interpreters
(even spark itself)? I guess we should follow the same practice there.

Release independence of location algorithm largely depends on jar filters
(again, see filters in the spark binding package). It is possible that
artifacts required may change but not very likely (i don't think they ever
changed since 0.10). so it should be possible to build (mahout)
release-independent logic to locate, filter and assert the necessary jars.

>
> Thanks for any feedback,
> tg
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>

Re: Location of JARs

2016-06-01 Thread Dmitriy Lyubimov

I am just going to give you some design intents in the existing code.

as far as i can recollect, mahout context gives complete flexibility. You
can control the behavior but various degrees of overriding the default
behavior and doing more or less work on context setup on your own. (I
assume we are talking specifically about sparkbindings).

By default, the mahoutSparkContext() helper of the sparkbindings package
tries to locate the jars in whatever MAHOUT_HOME/bin/classpath -spark tells
it. (btw this part can be rewritten much more elegantly and robustly with
scala.sys.process._ capabilities of Scala; it's just this code is really
more than 3 years old now and i was not that deep with Scala back then to
know its shell DSL in such detail).

the logic of MAHOUT-HOME/bin/classpath -spark is admittedly pretty
convoluted and there are location variations between binary distribution
and maven-built source locations. I can't say i understand the underlying
structure or motivation for that structure very well there .

(1) E.g. you can tell it to ignore automatically adding these jars to
context and instead use your own algorithm to locate those (e.g. in
Zeppelin home or something). You also can do it in more than one way:
(1a) set addMahoutJars = false. the correct behavior should ignore
requirement of MAHOUT_HOME then; and subsequently you can include necessary
mahout jars could be supplied from your custom location in `customJars`
parameter;
(1b) or you can also set addMahoutJars=false and add them via supplied
custom sparkConf (which is the base configuration for everything before
mahout tries to add its own requirements to configuration).

(2) finally, you can completely take over spark context creation and wrap
already existing context into a mahout context via implicit (or explicit)
conversion given in the same package, `sc2sdc`. E.g. you can do it
implicity:

import o.a.m.sparkbindings._

val mahoutContext:SparkDistributedContext = sparkContext // this is of type
o.a.spark.SparkContext

that's it.

Note that in that case you have to take over on more work than just
adjusting context JAR classpath. you will have to do all the customizations
mahout does to context such as ensuring minimum requirements of kryo
serialization (you can see the code what currently is enforced, but i think
this is largely just the kryo serialization requirement).

Now, if you want to do custom classpath: naturally you don't need all
mahout jars. In case of spark backend execution, you need to filter to
include only mahout-math, mahout-math-scala and mahout-spark.

I am fairly sure that modern state of the project also requires
mahout-spark-[blah]-dependency-reduced.jar to be redistributed to backend
as well (which are minimum 3rd party shaded dependencies apparently engaged
by some algorithms in the backend as well -- it used to be absent from
backend requirements though).

-d

On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant 
wrote:

> I'm trying to refactor the Mahout dependency from the pom.xml of the Spark
> interpreter (adding Mahout integration to Zeppelin)
>
> Assuming MAHOUT_HOME is available, I see that the jars in source build live
> in a different place than the jars in the binary distribution.
>
> I'm to the point where I'm trying to come up with a good place to pick up
> the required jars while allowing for:
> 1. flexability in Mahout versions
> 2. Not writing a huge block of code designed to scan several conceivable
> places throughout the file system.
>
> One thought was to put the onus on the user to move the desired jars to a
> local repo within the Zeppelin directory.
>
> Wanted to open up to input from users and dev as I consider this.
>
> Is documentation specifying which JARs need to be moved to a specific
> directory and places you are likely to find them to much to ask of users?
>
> Other approaches?
>
> For background, Zeppelin starts a Spark Shell and we need to make sure all
> of the required Mahout jars get loaded in the class path when spark starts.
> The question is where do all of these JARs relatively live.
>
> Thanks for any feedback,
> tg
>
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>

Re: Hadoop 1 Support Going forward

2016-05-26 Thread Dmitriy Lyubimov

I think this is not an issue of our choice, but an issue of capability. As
far as i have witnessed during past 2 years, capability to do anything with
MR is at the very least lacking on the grandest of scales, if not
completely gone from this project.

On Thu, May 26, 2016 at 12:37 PM, Andrew Palumbo  wrote:

>
> I would also suggest that we should try to find out if there is anybody
> actually using Hadoop 1 with Mahout.
> 
> From: Isabel Drost-Fromm 
> Sent: Thursday, May 26, 2016 4:20:08 AM
> To: dev@mahout.apache.org
> Subject: Re: Hadoop 1 Support Going forward
>
> On Thu, May 26, 2016 at 01:22:26AM +, Andrew Palumbo wrote:
> > Currently I don't believe that there Is a reason not to, aside from
> regular Jenkins hiccups and some minor addition to the complexity of the
> example script
>
> I think there are two questions we should be asking ourselves here:
>
> If tomorrow someone comes along and asks us to fix a major bug/security
> issue,
> is there someone within our community who has enough insight to be able to
> do
> that?
>
> Assuming we aren't making any modifications to this part of the code base
> anyway, what is the advantage for downstream users to keep that code
> around in
> newer versions as opposed to using an older release of Mahout? E.g. do they
> automatically benefit from performance improvements made elsewhere?
>
>
> Isabel
>
>

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Dmitriy Lyubimov

k) and an R
> > > >>> library
> > > >>>>> containing some functions which will pull the data out of the
> > > resource
> > > >>>> pool
> > > >>>>> and spit out a dataframe.
> > > >>>>>
> > > >>>>> Once its in a Dataframe in R- go nuts with any plotting package
> you
> > > >>> like.
> > > >>>>> Likewise, it should be possible to do the same thing with
> > matplotlib
> > > >>> and
> > > >>>>> python (
> https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
> > > >>>>>
> > > >>>>> All of this doesn't necessarily require any changing of the
> > Zeppelin
> > > >>>> source
> > > >>>>> code, and isn't very intrusive or difficult to set up, I'll make
> a
> > > > blog
> > > >>>>> post but its almost a text book entry tutorial on using imports
> in
> > > >>>>> Zeppelin. (e.g. a tutorial would be just as at home on the
> Zeppelin
> > > >>> site
> > > >>>> as
> > > >>>>> it would on the Mahout site).
> > > >>>>>
> > > >>>>> Now, there has been some talk of using Zeppelin's angularJS.
> > Things
> > > >>> get
> > > >>>> a
> > > >>>>> little more harry in that case, but we could make an optional
> build
> > > >>>> profile
> > > >>>>> that would make zeppelin recognize matrices at tables and expose
> > all
> > > > of
> > > >>>> the
> > > >>>>> built in charting features of Zeppelin.
> > > >>>>>
> > > >>>>> If you're not adding a bunch of custom charts to Zeppelin (which
> > > would
> > > >>> be
> > > >>>>> somewhat tedious), you're going to end up with a lot of examples
> > > where
> > > >>>> you
> > > >>>>> create a table in Mahout/Spark pass it to AngularJS then some
> > > > AngularJS
> > > >>>>> code charts it for you.  At that point however, you're doing just
> > as
> > > >>> much
> > > >>>>> work, if not more than it would be to simply pass to R or Python
> > and
> > > >>> let
> > > >>>>> ggplot or matlibplot do the work for you.
> > > >>>>>
> > > >>>>> Finally, I haven't run into any errors yet using Kyro (which in
> > part
> > > > is
> > > >>>>> what makes me fear I'm not doing this right... it was too
> easy...)
> > If
> > > >>>>> anything seems redundant or missing, please call it out.
> > > >>>>>
> > > >>>>> Add Properties to Spark interp:
> > > >>>>>
> > > >>>>> spark.kryo.registrator
> > > >>>>> org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
> > > >>>>> spark.serializer org.apache.spark.serializer.KryoSerializer
> > > >>>>>
> > > >>>>> Add artifacts (need to change these to maven not local, also need
> > to
> > > >>>>> add/change one jar per below, however this does run):
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >
> > >
> >
> /home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
> > > >>>>>
> > > >>>
> > > >
> > >
> >
> /home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
> > > >>>>>
> > > >>>
> > > >
> > >
> >
> /home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
> > > >>>>>
> > > >>>
> > > >
> > >
> >
> /home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
> > > >>>>> Add following code to first paragraph of notebook:
> > > >>>>> ```
> &g

Re: Future Mahout - Zeppelin work

2016-05-19 Thread Dmitriy Lyubimov

still in reply to the blog: i wish though zeppelin had a true mahout
interpreter. all it basically requires is to reuse spark settings but
execute proper imports and context init, and provide this "tablify" routine
somehow.


On Thu, May 19, 2016 at 1:55 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Trevor, terrific job on zeppelin post btw. thanks!
>
> On Thu, May 19, 2016 at 1:54 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
>> Trevor, left a comment on your blog before realizing i should've really
>> be commenting here...
>>
>> -d
>>
>> On Wed, May 18, 2016 at 9:05 PM, Andrew Palumbo <ap@outlook.com>
>> wrote:
>>
>>> In mahout 0.13 well be looking row reduction methods other than just
>>> sampling to transform DRM -> matrix so that it fits in memory.  This is
>>> great!
>>>
>>>  Original message 
>>> From: Andrew Palumbo <ap@outlook.com>
>>> Date: 05/19/2016 12:02 AM (GMT-05:00)
>>> To: dev@mahout.apache.org
>>> Subject: RE: Future Mahout - Zeppelin work
>>>
>>> Well done, Trevor!  I've not yet had a chance to try this in zeppelin
>>> but I just read the blog which is great!
>>>
>>>  Original message 
>>> From: Trevor Grant <trevor.d.gr...@gmail.com>
>>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>>> To: dev@mahout.apache.org
>>> Subject: Re: Future Mahout - Zeppelin work
>>>
>>> Ah thank you.
>>>
>>> Fixing now.
>>>
>>>
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
>>> http://stackexchange.com/users/3002022/raw<
>>> http://stackexchange.com/users/3002022/rawkintrevo>
>>>
>>
>>
>

Re: Future Mahout - Zeppelin work

2016-05-19 Thread Dmitriy Lyubimov

Trevor, terrific job on zeppelin post btw. thanks!

On Thu, May 19, 2016 at 1:54 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Trevor, left a comment on your blog before realizing i should've really be
> commenting here...
>
> -d
>
> On Wed, May 18, 2016 at 9:05 PM, Andrew Palumbo <ap@outlook.com>
> wrote:
>
>> In mahout 0.13 well be looking row reduction methods other than just
>> sampling to transform DRM -> matrix so that it fits in memory.  This is
>> great!
>>
>>  Original message 
>> From: Andrew Palumbo <ap@outlook.com>
>> Date: 05/19/2016 12:02 AM (GMT-05:00)
>> To: dev@mahout.apache.org
>> Subject: RE: Future Mahout - Zeppelin work
>>
>> Well done, Trevor!  I've not yet had a chance to try this in zeppelin but
>> I just read the blog which is great!
>>
>>  Original message 
>> From: Trevor Grant <trevor.d.gr...@gmail.com>
>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Future Mahout - Zeppelin work
>>
>> Ah thank you.
>>
>> Fixing now.
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/raw<
>> http://stackexchange.com/users/3002022/rawkintrevo>
>>
>
>

Re: Future Mahout - Zeppelin work

2016-05-19 Thread Dmitriy Lyubimov

Trevor, left a comment on your blog before realizing i should've really be
commenting here...

-d

On Wed, May 18, 2016 at 9:05 PM, Andrew Palumbo  wrote:

> In mahout 0.13 well be looking row reduction methods other than just
> sampling to transform DRM -> matrix so that it fits in memory.  This is
> great!
>
>  Original message 
> From: Andrew Palumbo 
> Date: 05/19/2016 12:02 AM (GMT-05:00)
> To: dev@mahout.apache.org
> Subject: RE: Future Mahout - Zeppelin work
>
> Well done, Trevor!  I've not yet had a chance to try this in zeppelin but
> I just read the blog which is great!
>
>  Original message 
> From: Trevor Grant 
> Date: 05/18/2016 2:44 PM (GMT-05:00)
> To: dev@mahout.apache.org
> Subject: Re: Future Mahout - Zeppelin work
>
> Ah thank you.
>
> Fixing now.
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/raw<
> http://stackexchange.com/users/3002022/rawkintrevo>
>

[jira] [Comment Edited] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-05-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271648#comment-15271648
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1791 at 5/5/16 12:03 AM:
---

experiments show that native solvers + 2 backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end


was (Author: dlyubimov):
experiments show that native solvers + 2x backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>    Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1791) Automatic threading for java based mmul in the front end and the backend.

2016-05-04 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271648#comment-15271648
 ] 

Dmitriy Lyubimov commented on MAHOUT-1791:
--

experiments show that native solvers + 2x backend threads per task creates good 
cpu saturation balance. 
ditto all-core threads for the front-end

> Automatic threading for java based mmul in the front end and the backend.
> -
>
> Key: MAHOUT-1791
> URL: https://issues.apache.org/jira/browse/MAHOUT-1791
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.11.1, 0.12.0, 0.11.2
>    Reporter: Dmitriy Lyubimov
>Assignee: Andrew Musselman
> Fix For: 0.12.1
>
>
> As we know, we are still struggling with decisions which path to take for 
> bare metal accelerations in in-core math. 
> Meanwhile, a simple no-brainer improvement though is to add decision paths 
> and apply multithreaded matrix-matrix multiplication (and maybe even others; 
> but mmul perhaps is the most prominent beneficiary here at the moment which 
> is both easy to do and to have a statistically significant improvement) 
> So multithreaded logic addition to mmul is one path. 
> Another path is automatic adjustment of multithreading. 
> In front end, we probably want to utilize all cores available. 
> in the backend, we can oversubscribe cores but probably doing so by more than 
> 2x or 3x is unadvisable because of point of diminishing returns driven by 
> growing likelihood of context switching overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov

also, mahout does have optimizer that simply decides on degree of
parallelism of the _product_. I.e., if it computes C=A'B then it figures
that final results should be split N ways. but it doesn't apply the
partition function -- it just uses the usual hash partitioner to forward
the keys, i don't think we ever override that.

On Mon, May 2, 2016 at 9:39 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> by probabilistic algorithms i mostly mean inference involving monte carlo
> type mechanisms (Gibbs sampling LDA which i think might still be part of
> our MR collection might be an example, as well as its faster counterpart,
> variational Bayes inference.
>
> the parallelization strategies are are just standard spark mechanisms (in
> case of spark), mostly are using their standard hash samplers (which are in
> math speak are uniform multinomial samplers really).
>
> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
>
>> Hey Dimitri -
>>
>> Yes I meant probabilistic algorithms.  If mahout doesn’t use
>> probabilistic algos then how does it accomplish a degree of optimal
>> parallelization ? Wouldn’t you need randomization to spread out the
>> processing of tasks.
>>
>> > On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>> >
>> > yes mahout has stochastic svd and pca which are described at length in
>> the
>> > samsara book. The book examples in Andrew Palumbo's github also contain
>> an
>> > example of computing k-means|| sketch.
>> >
>> > if you mean _probabilistic_ algorithms, although i have done some things
>> > outside the public domain, nothing has been contributed.
>> >
>> > You are very welcome to try something if you don't have big constraints
>> on
>> > oss contribution.
>> >
>> > -d
>> >
>> > On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com
>> >
>> > wrote:
>> >
>> >> Hey All,
>> >>
>> >> I’d like to know if Mahout uses any randomized algorithms.   I’m
>> thinking
>> >> it probably does.  Can somebody point me to the packages that utilized
>> >> randomized algos.
>> >>
>> >> Thanks,
>> >>
>> >> Khurrum
>> >>
>> >>
>>
>>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-05-02 Thread Dmitriy Lyubimov

graph = graft, sorry. Graft just the AtB class into 0.12 codebase.

On Mon, May 2, 2016 at 9:06 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> ok.
>
> Nikaash,
> could you perhaps do one more experiment and graph the 0.10 a'b code into
> 0.12 code (or whatever branch you say is not working the same) so we could
> quite confirm that the culprit change is indeed AB'?
>
> thank you very much.
>
> -d
>
> On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <nikaashp...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I tried commenting out those lines and it did marginally improve the
>> performance. Although, the 0.10 version still significantly outperforms it.
>>
>> Here is a screenshot of the saveAsTextFile job (attached as selection1).
>> The AtB step took about 34 mins, which is significantly more than using
>> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
>>
>> The selection2 file is a screenshot of the flatMap at AtB.scala job,
>> which ran for 34 minutes,
>>
>> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
>> would take time, while subsequent such operations for the other indicators
>> would be orders of magnitudes faster. In the current job, the subsequent
>> AtB operations take time similar to the first one.
>>
>> A snapshot of my code is as follows:
>>
>> var existingRowIDs: Option[BiDictionary] = None
>>
>> // The first action named in the sequence is the "primary" action and begins 
>> to fill up the user dictionary
>> for (actionDescription <- actionInput) {
>>   // grab the path to actions
>>   val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
>> actionDescription._2,
>> schema = DefaultIndexedDatasetElementReadSchema,
>> existingRowIDs = existingRowIDs)
>>   existingRowIDs = Some(action.rowIDs)
>>
>>   ...
>> }
>>
>> which seems fairly standard, so I hope I'm not making a mistake here.
>>
>> It looks like the 0.11 onward version is using computeAtBZipped3 for
>> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
>> using atb_nograph. Though I'm not really sure whether that makes much of a
>> difference.
>>
>> Thank you,
>> Nikaash Puri
>>
>> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <p...@occamsmachete.com>
>> wrote:
>>
>>> Right, will do. But Nakaash if you could just comment out those lines
>>> and see if it has an effect it would be informative and even perhaps solve
>>> your problem sooner than my changes. No great rush. Playing around with
>>> different values, as Dmitriy says, might yield better results and for that
>>> you can mess with the code or wait for my changes.
>>>
>>> Yeah, it’s fast enough in most cases. The main work is the optimized
>>> A’A, A’B stuff in the BLAS optimizer Dmitriy put in. It is something like
>>> 10x faster than a similar algo in Hadoop MR. This particular calc and
>>> generalization is not in any other Spark or now Flink lib that I know of.
>>>
>>>
>>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>>
>>> Nikaash,
>>>
>>> yes unfortunately you may need to play with parallelism for your
>>> particular
>>> load/cluster manually to get the best out of it. I guess Pat will be
>>> adding
>>> the option.
>>>
>>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <nikaashp...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
>>> share
>>> > screenshots if possible.
>>> >
>>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
>>> comment
>>> > out the line and see the difference in performance.
>>> >
>>> > Thanks so much for helping guys, I really appreciate it.
>>> >
>>> > Also, the algorithm implementation for LLR is extremely performant, at
>>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
>>> (which
>>> > in our case is a fair amount) and the model was built in about 20
>>> minutes,
>>> > which is pretty amazing. This was using a pretty decent sized cluster,
>>> > though.
>>> >
>>> > Thank you,
>>> > Nikaash Puri
>>> >
>>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <p...@occa

Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov

by probabilistic algorithms i mostly mean inference involving monte carlo
type mechanisms (Gibbs sampling LDA which i think might still be part of
our MR collection might be an example, as well as its faster counterpart,
variational Bayes inference.

the parallelization strategies are are just standard spark mechanisms (in
case of spark), mostly are using their standard hash samplers (which are in
math speak are uniform multinomial samplers really).

On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <khurrum.na...@useitc.com>
wrote:

> Hey Dimitri -
>
> Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic
> algos then how does it accomplish a degree of optimal parallelization ?
> Wouldn’t you need randomization to spread out the processing of tasks.
>
> > On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> >
> > yes mahout has stochastic svd and pca which are described at length in
> the
> > samsara book. The book examples in Andrew Palumbo's github also contain
> an
> > example of computing k-means|| sketch.
> >
> > if you mean _probabilistic_ algorithms, although i have done some things
> > outside the public domain, nothing has been contributed.
> >
> > You are very welcome to try something if you don't have big constraints
> on
> > oss contribution.
> >
> > -d
> >
> > On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com>
> > wrote:
> >
> >> Hey All,
> >>
> >> I’d like to know if Mahout uses any randomized algorithms.   I’m
> thinking
> >> it probably does.  Can somebody point me to the packages that utilized
> >> randomized algos.
> >>
> >> Thanks,
> >>
> >> Khurrum
> >>
> >>
>
>

Re: stochastic nature

2016-05-02 Thread Dmitriy Lyubimov

yes mahout has stochastic svd and pca which are described at length in the
samsara book. The book examples in Andrew Palumbo's github also contain an
example of computing k-means|| sketch.

if you mean _probabilistic_ algorithms, although i have done some things
outside the public domain, nothing has been contributed.

You are very welcome to try something if you don't have big constraints on
oss contribution.

-d

On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim 
wrote:

> Hey All,
>
> I’d like to know if Mahout uses any randomized algorithms.   I’m thinking
> it probably does.  Can somebody point me to the packages that utilized
> randomized algos.
>
> Thanks,
>
> Khurrum
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-05-02 Thread Dmitriy Lyubimov

ok.

Nikaash,
could you perhaps do one more experiment and graph the 0.10 a'b code into
0.12 code (or whatever branch you say is not working the same) so we could
quite confirm that the culprit change is indeed AB'?

thank you very much.

-d

On Mon, May 2, 2016 at 3:35 AM, Nikaash Puri <nikaashp...@gmail.com> wrote:

> Hi,
>
> I tried commenting out those lines and it did marginally improve the
> performance. Although, the 0.10 version still significantly outperforms it.
>
> Here is a screenshot of the saveAsTextFile job (attached as selection1).
> The AtB step took about 34 mins, which is significantly more than using
> 0.10. Similarly, the saveAsTextFile action takes about 9 mins as well.
>
> The selection2 file is a screenshot of the flatMap at AtB.scala job, which
> ran for 34 minutes,
>
> Also, I'm using multiple indicators. As of Mahout 0.10, the first AtB
> would take time, while subsequent such operations for the other indicators
> would be orders of magnitudes faster. In the current job, the subsequent
> AtB operations take time similar to the first one.
>
> A snapshot of my code is as follows:
>
> var existingRowIDs: Option[BiDictionary] = None
>
> // The first action named in the sequence is the "primary" action and begins 
> to fill up the user dictionary
> for (actionDescription <- actionInput) {
>   // grab the path to actions
>   val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
> actionDescription._2,
> schema = DefaultIndexedDatasetElementReadSchema,
> existingRowIDs = existingRowIDs)
>   existingRowIDs = Some(action.rowIDs)
>
>   ...
> }
>
> which seems fairly standard, so I hope I'm not making a mistake here.
>
> It looks like the 0.11 onward version is using computeAtBZipped3 for
> performing the multiplication in atb_nograph_mmul unlike 0.10 which was
> using atb_nograph. Though I'm not really sure whether that makes much of a
> difference.
>
> Thank you,
> Nikaash Puri
>
> On Sat, Apr 30, 2016 at 12:36 AM Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> Right, will do. But Nakaash if you could just comment out those lines and
>> see if it has an effect it would be informative and even perhaps solve your
>> problem sooner than my changes. No great rush. Playing around with
>> different values, as Dmitriy says, might yield better results and for that
>> you can mess with the code or wait for my changes.
>>
>> Yeah, it’s fast enough in most cases. The main work is the optimized A’A,
>> A’B stuff in the BLAS optimizer Dmitriy put in. It is something like 10x
>> faster than a similar algo in Hadoop MR. This particular calc and
>> generalization is not in any other Spark or now Flink lib that I know of.
>>
>>
>> On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>
>> Nikaash,
>>
>> yes unfortunately you may need to play with parallelism for your
>> particular
>> load/cluster manually to get the best out of it. I guess Pat will be
>> adding
>> the option.
>>
>> On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <nikaashp...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Sure, I’ll do some more detailed analysis of the jobs on the UI and
>> share
>> > screenshots if possible.
>> >
>> > Pat, yup, I’ll only be able to get to this on Monday, though. I’ll
>> comment
>> > out the line and see the difference in performance.
>> >
>> > Thanks so much for helping guys, I really appreciate it.
>> >
>> > Also, the algorithm implementation for LLR is extremely performant, at
>> > least as of Mahout 0.10. I ran some tests for around 61 days of data
>> (which
>> > in our case is a fair amount) and the model was built in about 20
>> minutes,
>> > which is pretty amazing. This was using a pretty decent sized cluster,
>> > though.
>> >
>> > Thank you,
>> > Nikaash Puri
>> >
>> > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> >
>> > There are some other changes I want to make for the next rev so I’ll do
>> > that.
>> >
>> > Nikaash, it would still be nice to verify this fixes your problem, also
>> if
>> > you want to create a Jira it will guarantee I don’t forget.
>> >
>> >
>> > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>> >
>> > yes -- i would do it as an optional option -- just like par does -- do
>> > nothing; try auto, or try exact number

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-29 Thread Dmitriy Lyubimov

Nikaash,

yes unfortunately you may need to play with parallelism for your particular
load/cluster manually to get the best out of it. I guess Pat will be adding
the option.

On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <nikaashp...@gmail.com>
wrote:

> Hi,
>
> Sure, I’ll do some more detailed analysis of the jobs on the UI and share
> screenshots if possible.
>
> Pat, yup, I’ll only be able to get to this on Monday, though. I’ll comment
> out the line and see the difference in performance.
>
> Thanks so much for helping guys, I really appreciate it.
>
> Also, the algorithm implementation for LLR is extremely performant, at
> least as of Mahout 0.10. I ran some tests for around 61 days of data (which
> in our case is a fair amount) and the model was built in about 20 minutes,
> which is pretty amazing. This was using a pretty decent sized cluster,
> though.
>
> Thank you,
> Nikaash Puri
>
> On 29-Apr-2016, at 10:18 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> There are some other changes I want to make for the next rev so I’ll do
> that.
>
> Nikaash, it would still be nice to verify this fixes your problem, also if
> you want to create a Jira it will guarantee I don’t forget.
>
>
> On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> yes -- i would do it as an optional option -- just like par does -- do
> nothing; try auto, or try exact number of splits
>
> On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> It’s certainly easy to put this in the driver, taking it out of the algo.
>>
>> Dmitriy, is it a candidate for an Option param to the algo? That would
>> catch cases where people rely on it now (like my old DStream example) but
>> easily allow it to be overridden to None to imitate pre 0.11, or passed in
>> when the app knows better.
>>
>> Nikaash, are you in a position to comment out the .par(auto=true) and see
>> if it makes a difference?
>>
>>
>> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>
>> can you please look into spark UI and write down how many split the job
>> generates in the first stage of the pipeline, or anywhere else there's
>> signficant variation in # of splits in both cases?
>>
>> the row similarity is a very short pipeline (in comparison with what would
>> normally be on average). so only the first input re-splitting is critical.
>>
>> The splitting along the products is adjusted by optimizer automatically to
>> match the amount of data segments observed on average in the input(s).
>> e.g.
>> if uyou compute val C = A %*% B and A has 500 elements per split and B has
>> 5000 elements per split then C would approximately have 5000 elements per
>> split (the larger average in binary operator cases).  That's approximately
>> how it works.
>>
>> However, the par() that has been added, is messing with initial
>> parallelism
>> which would naturally affect the rest of pipeline per above. I now doubt
>> it
>> was a good thing -- when i suggested Pat to try this, i did not mean to
>> put
>> it _inside_ the algorithm itself, rather, into the accurate input
>> preparation code in his particular case. However, I don't think it will
>> work in any given case. Actually sweet spot parallelism for
>> multioplication
>> unfortunately depends on tons of factors -- network bandwidth and hardware
>> configuration, so it is difficult to give it a good guess universally.
>> More
>> likely, for cli-based prepackaged algorithms (I don't use CLI but rather
>> assemble pipelines in scala via scripting and scala application code) the
>> initial paralellization adjustment options should probably be provided to
>> CLI.
>>
>>
>
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-29 Thread Dmitriy Lyubimov

yes -- i would do it as an optional option -- just like par does -- do
nothing; try auto, or try exact number of splits

On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> It’s certainly easy to put this in the driver, taking it out of the algo.
>
> Dmitriy, is it a candidate for an Option param to the algo? That would
> catch cases where people rely on it now (like my old DStream example) but
> easily allow it to be overridden to None to imitate pre 0.11, or passed in
> when the app knows better.
>
> Nikaash, are you in a position to comment out the .par(auto=true) and see
> if it makes a difference?
>
>
> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> can you please look into spark UI and write down how many split the job
> generates in the first stage of the pipeline, or anywhere else there's
> signficant variation in # of splits in both cases?
>
> the row similarity is a very short pipeline (in comparison with what would
> normally be on average). so only the first input re-splitting is critical.
>
> The splitting along the products is adjusted by optimizer automatically to
> match the amount of data segments observed on average in the input(s). e.g.
> if uyou compute val C = A %*% B and A has 500 elements per split and B has
> 5000 elements per split then C would approximately have 5000 elements per
> split (the larger average in binary operator cases).  That's approximately
> how it works.
>
> However, the par() that has been added, is messing with initial parallelism
> which would naturally affect the rest of pipeline per above. I now doubt it
> was a good thing -- when i suggested Pat to try this, i did not mean to put
> it _inside_ the algorithm itself, rather, into the accurate input
> preparation code in his particular case. However, I don't think it will
> work in any given case. Actually sweet spot parallelism for multioplication
> unfortunately depends on tons of factors -- network bandwidth and hardware
> configuration, so it is difficult to give it a good guess universally. More
> likely, for cli-based prepackaged algorithms (I don't use CLI but rather
> assemble pipelines in scala via scripting and scala application code) the
> initial paralellization adjustment options should probably be provided to
> CLI.
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-29 Thread Dmitriy Lyubimov

I was replying to Nikaash.

Sorry -- list keeps rejecting replies because of the size, i had to remove
the content

On Fri, Apr 29, 2016 at 9:05 AM, Khurrum Nasim <khurrum.na...@useitc.com>
wrote:

> Is that for me Dimitry ?
>
>
>
> > On Apr 29, 2016, at 11:53 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> >
> > can you please look into spark UI and write down how many split the job
> > generates in the first stage of the pipeline, or anywhere else there's
> > signficant variation in # of splits in both cases?
> >
> > the row similarity is a very short pipeline (in comparison with what
> would
> > normally be on average). so only the first input re-splitting is
> critical.
> >
> > The splitting along the products is adjusted by optimizer automatically
> to
> > match the amount of data segments observed on average in the input(s).
> e.g.
> > if uyou compute val C = A %*% B and A has 500 elements per split and B
> has
> > 5000 elements per split then C would approximately have 5000 elements per
> > split (the larger average in binary operator cases).  That's
> approximately
> > how it works.
> >
> > However, the par() that has been added, is messing with initial
> parallelism
> > which would naturally affect the rest of pipeline per above. I now doubt
> it
> > was a good thing -- when i suggested Pat to try this, i did not mean to
> put
> > it _inside_ the algorithm itself, rather, into the accurate input
> > preparation code in his particular case. However, I don't think it will
> > work in any given case. Actually sweet spot parallelism for
> multioplication
> > unfortunately depends on tons of factors -- network bandwidth and
> hardware
> > configuration, so it is difficult to give it a good guess universally.
> More
> > likely, for cli-based prepackaged algorithms (I don't use CLI but rather
> > assemble pipelines in scala via scripting and scala application code) the
> > initial paralellization adjustment options should probably be provided to
> > CLI.
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-29 Thread Dmitriy Lyubimov

can you please look into spark UI and write down how many split the job
generates in the first stage of the pipeline, or anywhere else there's
signficant variation in # of splits in both cases?

the row similarity is a very short pipeline (in comparison with what would
normally be on average). so only the first input re-splitting is critical.

The splitting along the products is adjusted by optimizer automatically to
match the amount of data segments observed on average in the input(s). e.g.
if uyou compute val C = A %*% B and A has 500 elements per split and B has
5000 elements per split then C would approximately have 5000 elements per
split (the larger average in binary operator cases).  That's approximately
how it works.

However, the par() that has been added, is messing with initial parallelism
which would naturally affect the rest of pipeline per above. I now doubt it
was a good thing -- when i suggested Pat to try this, i did not mean to put
it _inside_ the algorithm itself, rather, into the accurate input
preparation code in his particular case. However, I don't think it will
work in any given case. Actually sweet spot parallelism for multioplication
unfortunately depends on tons of factors -- network bandwidth and hardware
configuration, so it is difficult to give it a good guess universally. More
likely, for cli-based prepackaged algorithms (I don't use CLI but rather
assemble pipelines in scala via scripting and scala application code) the
initial paralellization adjustment options should probably be provided to
CLI.

Re: Mahout contributions

2016-04-28 Thread Dmitriy Lyubimov

there might be a concept of "contrib" sub project with totally separate
code tree, some asf projects do that. that way it is easy to keep it around
if it turns out to be useful, and easy to strip off if it becomes
unsupported (sorry for pragmatic cynicism)

On Thu, Apr 28, 2016 at 2:48 PM, Khurrum Nasim 
wrote:

> I agree with Andrew.   Mahout should remain indigenous.
>
>
> Prakash - you may want to create your own project on github using the
> mahout library.
>
>
> > On Apr 28, 2016, at 5:43 PM, Andrew Palumbo  wrote:
> >
> > I don't  think that this sort of of integration work would be a good fit
> directly to the Mahout project.  Mahout is more about math, algorithms and
> an environment to develop algorithms.  We stay away from direct platform
> integration.  In the past we did have some elasticsearch/mahout integration
> work that is not in the code base for this exact reason.  I would suggest
> that better places to contribute something like this may be: PIO (
> https://prediction.io/), or even directly as a package for spark
> http://spark-packages.org/ .
> >
> > Recent projects integrating Mahout have recently been added to PIO:
> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation
> .
> >
> > I think that the project that you are proposing would be a better fit
> there.
> >
> > Thanks,
> >
> > Andy
> >
> >
> > 
> > From: Saikat Kanjilal 
> > Sent: Thursday, April 28, 2016 1:50 PM
> > To: dev@mahout.apache.org
> > Subject: Re: Mahout contributions
> >
> > I want to start with social data as an example, for example data
> returned from FB graph API as well user Twitter data, will send some
> samples later if you're interested.
> >
> > Sent from my iPhone
> >
> >> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim 
> wrote:
> >>
> >>
> >> What type of JSON payload size are we talking about here ?
> >>
> >>> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal 
> wrote:
> >>>
> >>> Because EL gives you the visualization and non Lucene type query
> constructs as well and also that it already has a rest API that I plan on
> tying into mahout.  I plan on wrapping some of the clustering algorithms
> that I implement using Mahout and Spark as a service which can then make
> calls into other services (namely elasticsearch and neo4j graph service).
> >>>
> >>> Sent from my iPhone
> >>>
>  On Apr 28, 2016, at 10:22 AM, Khurrum Nasim 
> wrote:
> 
>  @Saikat- why use EL instead of Lucene directly.
> 
> 
> 
> > On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal 
> wrote:
> >
> > This is great information thank you, based on this recommendation I
> won't create a JIRA but start work on my project and when the code
> approaches the percentages you are describing I will create the appropriate
> JIRA's and put together a proposal to send to the list, sound ok?  Based on
> your latest updates to the wiki i will work on a handful of the clustering
> algorithms since I see that the Spark implementations for these are not yet
> complete.
> > Thank you again
> >
> >> From: ap@outlook.com
> >> To: dev@mahout.apache.org
> >> Subject: Re: Mahout contributions
> >> Date: Thu, 28 Apr 2016 01:31:09 +
> >>
> >> Saikat,
> >>
> >> One other thing that I should say is that you do not need clearance
> or input from the committers to begin work on your project, and the
> interest can and should come from the community as a whole. You can write
> proposal as you've done, and if you don't see any "+1"s or responses from
> the community at whole with in a few days, you may want to explain in more
> detail, give examples and use cases.  If you are still not seeing +1s or
> any responses from others then I think you can assume that there may not be
> interest; this is usually how things work.
> >>
> >> However if its something that your passionate about and you feel
> like you can deliver this should not to stop you.  People do not always
> read the dev@ emails or have time to respond.  You can still move forward
> with your proposed contribution by following the steps laid out in my
> previous email; follow the protocol at:
> >>
> >> http://mahout.apache.org/developers/how-to-contribute.html
> >>
> >> and create a JIRA.  When you have reached a significant amount of
> completion (around 70-80%), open a PR for review, this way you can explain
> in more detail.
> >>
> >> But please realize that when you open a JIRA for a new issue there
> is some expectation of a commitment on your part to complete it.
> >>
> >> For example, I am currently investigating some new plotting
> features.  I have spent a good deal of time this week and last already and
> am even mocking up code as a sketch of what may become an

Re: About reuters-fkmeans-centroids

2016-04-28 Thread Dmitriy Lyubimov

Prakash,

(1) to be clear, the ASF trademark and branding policy is not to endorse
views of the 3rd party publications and to ask 3rd party writers to do a
disclosure that their views are not endorsed by ASF project. To that end,
ASF project can't really tell you that some publication is
"(in)appropriate". 3rd party publications are of their own account and
cannot be by default tied to the ASF views. That said, committers have
their opinions, which of course exhibit certain variation, and some things
do get linked on the site or mentioned on Twitter via Mahout account. But
some do not. Best practice is always to ask for pointers on the list first.

(2) I am not sure what your definition of "appropriate" is, but on personal
note, most of these links were quite "appropriate" at the time in the sense
that they were published prior to release 0.10 and 2/2014 or before 0.10,
 and therefore were describing what was in the project at that time. Thus,
MIA fuzzy k-means example in your very link is dated back of June 2011 and
is relevant to release circa 0.6 or 0.7. So if you mean whether those
algorithms were "in the fold" back then, the answer is yes, they were. I
see no contradiction between these publications and the current reality.

(3) If something deprecated reasonably works for a particular purpose, I
think there's no reason not to use/write about it.

*However, I just don't think most of these particular deprecated Java-based
MR algorithms work for the purposes of an established benchmark or a
standard in a research -- modern edgy ML is usually much more faster (and
often, more convenient too). *

Don't mean to come across as preachy, but research is usually held to quite
different standard as it comes to claims, than an ad-hoc industrial
application or a blog entry. I simply can't see how any of MR stuff can
work for that purpose today.

(4) if your "appropriate"-ness question is really about why they were
deprecated, well, there are two main reasons for that. First, it seems that
the realization of MR limitations w.r.t. iterative applications quickly
caught up with both users and contributors, and, second, most contributors
abandoned their MR contributions (most likely for the same reason). I
contributed a couple of MR algorithms back in 2010-2011 but i am absolutely
fine with them being deprecated and written off the books. If something is
not being used, or people (exactly as your case has demonstrated) don't get
answers to their questions, or bugs are not being fixed, it is difficult to
justify keeping the code. It is much easier to focus on what is actually
being used and maintained instead. Here, the very banal and boring reason
for the deprecations.

(5) Finally, If your goal is simply to learn "how the project works", just
like Suneel said, i'd suggest to follow release notes and the project site
(news and howtos) -- your last link in fact should perhaps be your first.
And the list, of coure.

As you probably can tell by release notes, the last two years were
practically exclusively about multiplatform Mahout involvement with Spark,
Flink and H20 backends, as well as the Samsara environment for general
numeric analysis (but no MR stuff beyond very nominal fixes).

I also agree that it looks like the Mahout site perhaps should be more
clear about the status of MR algorithms (it used to be more clear, I think,
but every news eventually becomes an old news).

Hope this clarifies.

-d

On Thu, Apr 28, 2016 at 12:02 PM, Prakash Poudyal 
wrote:

> Hi!
>
> Thank you for your emails !!
>
> Actually, I  need to use fuzzy clustering to cluster the sentence in my
> research. This is my goal.
>
> I started to use Fuzzy K means clustering of Mahout since last week !!! I
> found several blogs links, and many other helpful documents  I was
> going through, as being new, I realize this the best, easy and fast way to
> know about Mahout works. In my opinion, many new commers do the same as I
> do. After being used to the tools, than only people focus on the works and
> go deeply.
>
> I had gone through many blogs and sites to know about Mahout, some of them
> are below :
>
> http://technobium.com/introduction-to-clustering-using-apache-mahout/
>
> http://tuxdna.github.io/pages/mahout.html
>
>
> https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/FuzzyKMeansExample.java
>
> http://www.programering.com/a/MDNwgTMwATI.html
>
>
> https://www.safaribooksonline.com/library/view/apache-mahout-clustering/9781783284436/ch04.html
>
> https://ymnliu.wordpress.com/2015/11/05/install-apache-mahout-in-eclipse/
>
> https://mahout.apache.org/
>
> What do you say about these sites !! Is these sites are not appropriate
> ???
>
> I raise my problem several time, in mailing list and even IRC but I got
> response !!  just today :(
>
> So finally, it would be great, if you could reply the answers of my
> following question .
>
> Is Apache Mahout appropriate tool for clustering

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-28 Thread Dmitriy Lyubimov

(sorry for repetition, the list rejects my previous replies due to quoted
message size)

"Auto" just reclusters the input per given _configured cluster capacity_
(there's some safe guard there though i think that doesn't blow up # of
splits if the initial number of splits is ridiculously small though, e.g.
not to recluster 2-split problem into a 300-split problem).

For some algorithms, this is appropriate.

For others such as mmul-bound (A'B) problems, there's a "sweet spot" that i
mentioned due to I/O bandwidth being function of the parallelism  -- which
technically doesn't have anything to do with available cluster capacity. It
is possible that if you do A.par(auto=true).t %*% B.par(auto=true) then you
get a worse performance with 500-task cluster than on 60-task cluster
(depending on the size of operands and product).

> On Thu, Apr 28, 2016 at 11:55 AM, Pat Ferrel 
> wrote:
>
>> Actually on your advice Dmitriy I think these changes went in about 11.
>> Before 11 par was not called. Any clue here?
>>
>> This was in relation to that issue when reading a huge number of part
>> files created by Spark Streaming, which probably trickled down to cause too
>> much parallelization. The auto=true fixed this issue for me but did it have
>> other effects?
>>
>>
>>
>>

Re: About reuters-fkmeans-centroids

2016-04-28 Thread Dmitriy Lyubimov

Prakash,

if you are using any Mahout Mapreduce algorithm for research, please make
sure to make this disclosure:

all Mahout MapReduce algorithms are officially not supported and deprecated
since February, 2014 (IIRC). I can dig up a specific issue regarding this.
There also has been an announcement.

So before you really start drawing any comparisons, please be advised that
you are starting with algoritms 2+ years even since their EOL (let alone
inception).

Thanks.
-D

On Thu, Apr 28, 2016 at 11:05 AM, Prakash Poudyal 
wrote:

> Hi! Ted,
>
> You mean Mahout is no more supporting "fuzzy K clustering for the
> sentences". Can you clarify in more detail . :(
>
> Prakash
>
> On Thu, Apr 28, 2016 at 6:58 PM, Ted Dunning 
> wrote:
>
> > On Thu, Apr 28, 2016 at 10:54 AM, Prakash Poudyal <
> > prakashpoud...@gmail.com>
> > wrote:
> >
> > > Actually, I need to use fuzzy clustering to cluster the sentence in my
> > > research. I found  fuzzy k clustering algorithm in Apache Mahout,
> thus, I
> > > am trying to use it for my purpose.
> > >
> >
> > That's great.
> >
> > But that code is no longer supported.
> >
>
>
>
> --
>
> Regards
> Prakash Poudyal
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

2016-04-27 Thread Dmitriy Lyubimov

0.11 targets 1.3+.

I don't quite have anything on top of my head affecting A'B specifically,
but i think there were some chanages affecting in-memory multiplication
(which is of course used in distributed A'B).

I am not in particular familiar or remember details of row similarity on
top of my head, i really wish the original contributor would comment on
that. trying to see if i can come up with anything useful though.

what behavior do you see in this job -- cpu-bound or i/o bound?

there are a few pointers to look at:

(1)  I/O many times exceeds the input size, so spills are inevitable. So
tuning memory sizes and look at spark spill locations to make sure disks
are not slow there is critical. Also, i think in spark 1.6 spark added a
lot of flexibility in managing task/cache/shuffle memory sizes, it may help
in some unexpected way.

(2) sufficient cache: many pipelines commit reused matrices into cache
(MEMORY_ONLY) which is the default mahout algebra behavior, assuming there
is enough cache memory there for only good things to happen. if it is not,
however, it will cause recomputation of results that were evicted. (not
saying it is a known case for row similarity in particular). make sure this
is not the case. For cases of scatter type exchanges it is especially super
bad.

(3) A'B -- try to hack and play with implemetnation there in AtB (spark
side) class. See if you can come up with a better arrangement.

(4) in-memory computations (MMul class) if that's the bottleneck can be in
practice quick-hacked with mutlithreaded multiplication and bridge to
native solvers (netlib-java) at least for dense cases. this is found to
improve performance of distributed multiplications a bit. Works best if you
get 2 threads in the backend and all threads in the front end.

There are other known things that can improve speed multiplication of the
public mahout version, i hope mahout will improve on those in the future.

-d

On Wed, Apr 27, 2016 at 6:14 AM, Nikaash Puri  wrote:

> Hi,
>
> I’ve been working with LLR in Mahout for a while now. Mostly using the
> SimilarityAnalysis.cooccurenceIDss function. I recently upgraded the Mahout
> libraries to 0.11, and subsequently also tried with 0.12 and the same
> program is running orders of magnitude slower (at least 3x based on initial
> analysis).
>
> Looking into the tasks more carefully, comparing 0.10 and 0.11 shows that
> the amount of Shuffle being done in 0.11 is significantly higher,
> especially in the AtB step. This could possibly be a reason for the
> reduction in performance.
>
> Although, I am working on Spark 1.2.0. So, its possible that this could be
> causing the problem. It works fine with Mahout 0.10.
>
> Any ideas why this might be happening?
>
> Thank you,
> Nikaash Puri

Re: Congratulations to our new Chair

2016-04-20 Thread Dmitriy Lyubimov

congrats!

On Wed, Apr 20, 2016 at 4:55 PM, Suneel Marthi  wrote:

> Please join me in congratulating Andrew Palumbo on becoming our new Project
> Chair.
>
> As for me, it was a pleasure to serve as Chair starting with the Mahout
> 0.10.0 release and ending with the recent 0.12.0 release, and perhaps we
> will do it again someday.
>
> Congrats again, Andy!
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2219 matches

Mail list logo