Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.

Dmitriy Lyubimov Tue, 05 Sep 2017 16:33:43 -0700

The last thing i want to do is to overcomplicate things though.

On Tue, Sep 5, 2017 at 4:02 PM, Andrew Palumbo <ap....@outlook.com> wrote:


> > PS technically, some "flavor" of the dataset still can be attributed and
>  passed on in the pipeline, e.g., that's what i do with partitioning kind.
> if another operator messes that flavor up, this gets noted in the
> carry-over property (that's how optimizer knows if operands in a binary
> logical operator are coming in identically partitioned or not, for
> example). similar thing can be done to "sorted-ness" flavor and being
> tracked around, and operators that break "sorted-ness" would note that also
> on the tree nodes, but that only makes sense if we have "consumer"
> operators that care about sortedness, of which we have none at the moment
> (it possible that we will, perhaps). I am just saying this problem may
> benefit from some more broad thinking of the issue in optimization tree
> sense, i.e., why we do it, which things will use it and which things will
> preserve/mess it up etc.
>
>
> Agreed re: more broad thinking yes- just getting the conversation
> started.  Thanks.
>
> ________________________________
> From: Dmitriy Lyubimov <dlie...@gmail.com>
> Sent: Tuesday, September 5, 2017 6:06:35 PM
> To: dev@mahout.apache.org
> Subject: Re: [DISCUSS} New feature - DRM and in-core matrix sort and
> required test suites for modules.
>
> PS technically, some "flavor" of the dataset still can be attributed and
>  passed on in the pipeline, e.g., that's what i do with partitioning kind.
> if another operator messes that flavor up, this gets noted in the
> carry-over property (that's how optimizer knows if operands in a binary
> logical operator are coming in identically partitioned or not, for
> example). similar thing can be done to "sorted-ness" flavor and being
> tracked around, and operators that break "sorted-ness" would note that also
> on the tree nodes, but that only makes sense if we have "consumer"
> operators that care about sortedness, of which we have none at the moment
> (it possible that we will, perhaps). I am just saying this problem may
> benefit from some more broad thinking of the issue in optimization tree
> sense, i.e., why we do it, which things will use it and which things will
> preserve/mess it up etc.
>
> On Tue, Sep 5, 2017 at 3:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > In general, +1, don't see why not.
> >
> > Q -- is it something that you have encountered while doing algebra? I.e.,
> > do you need the sorted DRM to continue algebraic operations between
> > optimizer barriers, or you just need an RDD as the outcome of all this?
> >
> > if it is just an RDD, then you could just do a spark-supported sort,
> > that's why we have an drm.rdd barrier (spark-specific). Barrier out to
> > spark RDD and then continue doing whatever spark already supports.
> >
> > Another potential issue is that matrices do not generally imply ordering
> > or formation of intermediate products, i.e., we inside optimizer, you
> might
> > build a pipeline that implies ordered RDD in Spark sense, but there is no
> > algebraic operator consuming sorted rdds, and no operator that guarantees
> > preserving it (even if it just a checkpoint). This may create ambiguities
> > as more rewriting rules are added. This is not a major concern.
> >
> > On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> > wrote:
> >
> >> Ever since we moved Flink to its own profile, I have been thinking we
> >> ought
> >> to do the same to H2O but haven't been to motivated bc it was never
> >> causing
> >> anyone any problems.
> >>
> >> Maybe its time to drop H2O "official support" and move Flink Batch / H2O
> >> into a "mahout/community/engines" folder.
> >>
> >> Ive been doing a lot of Flink Streaming the last couple weeks and
> already
> >> bootlegged a few of the 'Algorithms" into Flink.  Pretty sure we could
> >> support those easily- and I _think_ we could do the same with the
> >> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement
> the
> >> the Operators on that.
> >>
> >> I'd put FlinkStreaming as another community engine.
> >>
> >> If we did that, I'd say- by convention we need a Markdown document in
> >> mahout/community/engines that has a table of what is implemented on
> what.
> >>
> >> That is to say, even if we only were able to implement the "algos" on
> >> Flink
> >> Streaming- there would still be a lot of value to that for many
> >> applications (esp considering the state of FlinkML).  Also beats having
> a
> >> half cooked engine sitting on a feature branch.
> >>
> >> Beam does something similar to that for their various engines.
> >>
> >> Speaking of Beam, I've heard rumblings here and there of people tlaking
> >> about making a Beam engine- this might motivate people to get started
> (no
> >> one person feels responsible for "boiling the ocean" and throwing down
> an
> >> entire engine in one go- but instead can hack out the portions they
> need.
> >>
> >>
> >> My .02
> >>
> >> tg
> >>
> >> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo <ap....@outlook.com>
> >> wrote:
> >>
> >> > I've found a need for the sorting a Drm as well as In-core matrices,
> >> > something like eg.: DrmLike.sortByColumn(...). I would like to
> implement
> >> > this at the math-scala engine neutral level with pass through
> functions
> >> to
> >> > underlying back ends.
> >> >
> >> >
> >> > In-core would be engine neutral by current design (in-core matrices
> are
> >> > all Mahout matrices with the exception of h2o.. which causes some
> >> concern.)
> >> >
> >> >
> >> > For Spark, we can use  RDD.sortBy(...).
> >> >
> >> >
> >> > Flink we can use DataSet.sortPartition(...).setParallelism(1).
> (There
> >> > may be a better method will look deeper).
> >> >
> >> >
> >> > h2o has an implementation, I'm sure, but this brings me to a more
> >> > important point: If we want to stub out a method in a back end module,
> >> Eg:
> >> > h2o, which test suites do we want make a requirements?
> >> >
> >> >
> >> > We've not set any specific rules for which test suites must pass for
> >> each
> >> > module. We've had a soft requirement for inheriting and passing all
> test
> >> > suites from math-scala.
> >> >
> >> >
> >> > Setting a rule for this is something that we need to IMO.
> >> >
> >> >
> >> > An easy option that I'm thinking would be to set the current core
> >> > math-scala suites as a requirement, and then allow for an optional
> suite
> >> > for methods which will be stubbed out.
> >> >
> >> >
> >> > Thoughts?
> >> >
> >> >
> >> > --andy
> >> >
> >> >
> >> >
> >>
> >
> >
>

Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules.

Reply via email to