The last thing i want to do is to overcomplicate things though. On Tue, Sep 5, 2017 at 4:02 PM, Andrew Palumbo <ap....@outlook.com> wrote:
> > PS technically, some "flavor" of the dataset still can be attributed and > passed on in the pipeline, e.g., that's what i do with partitioning kind. > if another operator messes that flavor up, this gets noted in the > carry-over property (that's how optimizer knows if operands in a binary > logical operator are coming in identically partitioned or not, for > example). similar thing can be done to "sorted-ness" flavor and being > tracked around, and operators that break "sorted-ness" would note that also > on the tree nodes, but that only makes sense if we have "consumer" > operators that care about sortedness, of which we have none at the moment > (it possible that we will, perhaps). I am just saying this problem may > benefit from some more broad thinking of the issue in optimization tree > sense, i.e., why we do it, which things will use it and which things will > preserve/mess it up etc. > > > Agreed re: more broad thinking yes- just getting the conversation > started. Thanks. > > ________________________________ > From: Dmitriy Lyubimov <dlie...@gmail.com> > Sent: Tuesday, September 5, 2017 6:06:35 PM > To: dev@mahout.apache.org > Subject: Re: [DISCUSS} New feature - DRM and in-core matrix sort and > required test suites for modules. > > PS technically, some "flavor" of the dataset still can be attributed and > passed on in the pipeline, e.g., that's what i do with partitioning kind. > if another operator messes that flavor up, this gets noted in the > carry-over property (that's how optimizer knows if operands in a binary > logical operator are coming in identically partitioned or not, for > example). similar thing can be done to "sorted-ness" flavor and being > tracked around, and operators that break "sorted-ness" would note that also > on the tree nodes, but that only makes sense if we have "consumer" > operators that care about sortedness, of which we have none at the moment > (it possible that we will, perhaps). I am just saying this problem may > benefit from some more broad thinking of the issue in optimization tree > sense, i.e., why we do it, which things will use it and which things will > preserve/mess it up etc. > > On Tue, Sep 5, 2017 at 3:01 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > In general, +1, don't see why not. > > > > Q -- is it something that you have encountered while doing algebra? I.e., > > do you need the sorted DRM to continue algebraic operations between > > optimizer barriers, or you just need an RDD as the outcome of all this? > > > > if it is just an RDD, then you could just do a spark-supported sort, > > that's why we have an drm.rdd barrier (spark-specific). Barrier out to > > spark RDD and then continue doing whatever spark already supports. > > > > Another potential issue is that matrices do not generally imply ordering > > or formation of intermediate products, i.e., we inside optimizer, you > might > > build a pipeline that implies ordered RDD in Spark sense, but there is no > > algebraic operator consuming sorted rdds, and no operator that guarantees > > preserving it (even if it just a checkpoint). This may create ambiguities > > as more rewriting rules are added. This is not a major concern. > > > > On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com> > > wrote: > > > >> Ever since we moved Flink to its own profile, I have been thinking we > >> ought > >> to do the same to H2O but haven't been to motivated bc it was never > >> causing > >> anyone any problems. > >> > >> Maybe its time to drop H2O "official support" and move Flink Batch / H2O > >> into a "mahout/community/engines" folder. > >> > >> Ive been doing a lot of Flink Streaming the last couple weeks and > already > >> bootlegged a few of the 'Algorithms" into Flink. Pretty sure we could > >> support those easily- and I _think_ we could do the same with the > >> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement > the > >> the Operators on that. > >> > >> I'd put FlinkStreaming as another community engine. > >> > >> If we did that, I'd say- by convention we need a Markdown document in > >> mahout/community/engines that has a table of what is implemented on > what. > >> > >> That is to say, even if we only were able to implement the "algos" on > >> Flink > >> Streaming- there would still be a lot of value to that for many > >> applications (esp considering the state of FlinkML). Also beats having > a > >> half cooked engine sitting on a feature branch. > >> > >> Beam does something similar to that for their various engines. > >> > >> Speaking of Beam, I've heard rumblings here and there of people tlaking > >> about making a Beam engine- this might motivate people to get started > (no > >> one person feels responsible for "boiling the ocean" and throwing down > an > >> entire engine in one go- but instead can hack out the portions they > need. > >> > >> > >> My .02 > >> > >> tg > >> > >> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo <ap....@outlook.com> > >> wrote: > >> > >> > I've found a need for the sorting a Drm as well as In-core matrices, > >> > something like eg.: DrmLike.sortByColumn(...). I would like to > implement > >> > this at the math-scala engine neutral level with pass through > functions > >> to > >> > underlying back ends. > >> > > >> > > >> > In-core would be engine neutral by current design (in-core matrices > are > >> > all Mahout matrices with the exception of h2o.. which causes some > >> concern.) > >> > > >> > > >> > For Spark, we can use RDD.sortBy(...). > >> > > >> > > >> > Flink we can use DataSet.sortPartition(...).setParallelism(1). > (There > >> > may be a better method will look deeper). > >> > > >> > > >> > h2o has an implementation, I'm sure, but this brings me to a more > >> > important point: If we want to stub out a method in a back end module, > >> Eg: > >> > h2o, which test suites do we want make a requirements? > >> > > >> > > >> > We've not set any specific rules for which test suites must pass for > >> each > >> > module. We've had a soft requirement for inheriting and passing all > test > >> > suites from math-scala. > >> > > >> > > >> > Setting a rule for this is something that we need to IMO. > >> > > >> > > >> > An easy option that I'm thinking would be to set the current core > >> > math-scala suites as a requirement, and then allow for an optional > suite > >> > for methods which will be stubbed out. > >> > > >> > > >> > Thoughts? > >> > > >> > > >> > --andy > >> > > >> > > >> > > >> > > > > >