In general, +1, don't see why not.

Q -- is it something that you have encountered while doing algebra? I.e.,
do you need the sorted DRM to continue algebraic operations between
optimizer barriers, or you just need an RDD as the outcome of all this?

if it is just an RDD, then you could just do a spark-supported sort, that's
why we have an drm.rdd barrier (spark-specific). Barrier out to spark RDD
and then continue doing whatever spark already supports.

Another potential issue is that matrices do not generally imply ordering or
formation of intermediate products, i.e., we inside optimizer, you might
build a pipeline that implies ordered RDD in Spark sense, but there is no
algebraic operator consuming sorted rdds, and no operator that guarantees
preserving it (even if it just a checkpoint). This may create ambiguities
as more rewriting rules are added. This is not a major concern.

On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> Ever since we moved Flink to its own profile, I have been thinking we ought
> to do the same to H2O but haven't been to motivated bc it was never causing
> anyone any problems.
>
> Maybe its time to drop H2O "official support" and move Flink Batch / H2O
> into a "mahout/community/engines" folder.
>
> Ive been doing a lot of Flink Streaming the last couple weeks and already
> bootlegged a few of the 'Algorithms" into Flink.  Pretty sure we could
> support those easily- and I _think_ we could do the same with the
> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement the
> the Operators on that.
>
> I'd put FlinkStreaming as another community engine.
>
> If we did that, I'd say- by convention we need a Markdown document in
> mahout/community/engines that has a table of what is implemented on what.
>
> That is to say, even if we only were able to implement the "algos" on Flink
> Streaming- there would still be a lot of value to that for many
> applications (esp considering the state of FlinkML).  Also beats having a
> half cooked engine sitting on a feature branch.
>
> Beam does something similar to that for their various engines.
>
> Speaking of Beam, I've heard rumblings here and there of people tlaking
> about making a Beam engine- this might motivate people to get started (no
> one person feels responsible for "boiling the ocean" and throwing down an
> entire engine in one go- but instead can hack out the portions they need.
>
>
> My .02
>
> tg
>
> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo <ap....@outlook.com> wrote:
>
> > I've found a need for the sorting a Drm as well as In-core matrices,
> > something like eg.: DrmLike.sortByColumn(...). I would like to implement
> > this at the math-scala engine neutral level with pass through functions
> to
> > underlying back ends.
> >
> >
> > In-core would be engine neutral by current design (in-core matrices are
> > all Mahout matrices with the exception of h2o.. which causes some
> concern.)
> >
> >
> > For Spark, we can use  RDD.sortBy(...).
> >
> >
> > Flink we can use DataSet.sortPartition(...).setParallelism(1).  (There
> > may be a better method will look deeper).
> >
> >
> > h2o has an implementation, I'm sure, but this brings me to a more
> > important point: If we want to stub out a method in a back end module,
> Eg:
> > h2o, which test suites do we want make a requirements?
> >
> >
> > We've not set any specific rules for which test suites must pass for each
> > module. We've had a soft requirement for inheriting and passing all test
> > suites from math-scala.
> >
> >
> > Setting a rule for this is something that we need to IMO.
> >
> >
> > An easy option that I'm thinking would be to set the current core
> > math-scala suites as a requirement, and then allow for an optional suite
> > for methods which will be stubbed out.
> >
> >
> > Thoughts?
> >
> >
> > --andy
> >
> >
> >
>

Reply via email to