> PS technically, some "flavor" of the dataset still can be attributed and passed on in the pipeline, e.g., that's what i do with partitioning kind. if another operator messes that flavor up, this gets noted in the carry-over property (that's how optimizer knows if operands in a binary logical operator are coming in identically partitioned or not, for example). similar thing can be done to "sorted-ness" flavor and being tracked around, and operators that break "sorted-ness" would note that also on the tree nodes, but that only makes sense if we have "consumer" operators that care about sortedness, of which we have none at the moment (it possible that we will, perhaps). I am just saying this problem may benefit from some more broad thinking of the issue in optimization tree sense, i.e., why we do it, which things will use it and which things will preserve/mess it up etc.
Agreed re: more broad thinking yes- just getting the conversation started. Thanks. ________________________________ From: Dmitriy Lyubimov <dlie...@gmail.com> Sent: Tuesday, September 5, 2017 6:06:35 PM To: dev@mahout.apache.org Subject: Re: [DISCUSS} New feature - DRM and in-core matrix sort and required test suites for modules. PS technically, some "flavor" of the dataset still can be attributed and passed on in the pipeline, e.g., that's what i do with partitioning kind. if another operator messes that flavor up, this gets noted in the carry-over property (that's how optimizer knows if operands in a binary logical operator are coming in identically partitioned or not, for example). similar thing can be done to "sorted-ness" flavor and being tracked around, and operators that break "sorted-ness" would note that also on the tree nodes, but that only makes sense if we have "consumer" operators that care about sortedness, of which we have none at the moment (it possible that we will, perhaps). I am just saying this problem may benefit from some more broad thinking of the issue in optimization tree sense, i.e., why we do it, which things will use it and which things will preserve/mess it up etc. On Tue, Sep 5, 2017 at 3:01 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > In general, +1, don't see why not. > > Q -- is it something that you have encountered while doing algebra? I.e., > do you need the sorted DRM to continue algebraic operations between > optimizer barriers, or you just need an RDD as the outcome of all this? > > if it is just an RDD, then you could just do a spark-supported sort, > that's why we have an drm.rdd barrier (spark-specific). Barrier out to > spark RDD and then continue doing whatever spark already supports. > > Another potential issue is that matrices do not generally imply ordering > or formation of intermediate products, i.e., we inside optimizer, you might > build a pipeline that implies ordered RDD in Spark sense, but there is no > algebraic operator consuming sorted rdds, and no operator that guarantees > preserving it (even if it just a checkpoint). This may create ambiguities > as more rewriting rules are added. This is not a major concern. > > On Tue, Sep 5, 2017 at 2:24 PM, Trevor Grant <trevor.d.gr...@gmail.com> > wrote: > >> Ever since we moved Flink to its own profile, I have been thinking we >> ought >> to do the same to H2O but haven't been to motivated bc it was never >> causing >> anyone any problems. >> >> Maybe its time to drop H2O "official support" and move Flink Batch / H2O >> into a "mahout/community/engines" folder. >> >> Ive been doing a lot of Flink Streaming the last couple weeks and already >> bootlegged a few of the 'Algorithms" into Flink. Pretty sure we could >> support those easily- and I _think_ we could do the same with the >> distributed (e.g. wrap a DataStream[(Key, MahoutVector)] and implement the >> the Operators on that. >> >> I'd put FlinkStreaming as another community engine. >> >> If we did that, I'd say- by convention we need a Markdown document in >> mahout/community/engines that has a table of what is implemented on what. >> >> That is to say, even if we only were able to implement the "algos" on >> Flink >> Streaming- there would still be a lot of value to that for many >> applications (esp considering the state of FlinkML). Also beats having a >> half cooked engine sitting on a feature branch. >> >> Beam does something similar to that for their various engines. >> >> Speaking of Beam, I've heard rumblings here and there of people tlaking >> about making a Beam engine- this might motivate people to get started (no >> one person feels responsible for "boiling the ocean" and throwing down an >> entire engine in one go- but instead can hack out the portions they need. >> >> >> My .02 >> >> tg >> >> On Tue, Sep 5, 2017 at 4:04 PM, Andrew Palumbo <ap....@outlook.com> >> wrote: >> >> > I've found a need for the sorting a Drm as well as In-core matrices, >> > something like eg.: DrmLike.sortByColumn(...). I would like to implement >> > this at the math-scala engine neutral level with pass through functions >> to >> > underlying back ends. >> > >> > >> > In-core would be engine neutral by current design (in-core matrices are >> > all Mahout matrices with the exception of h2o.. which causes some >> concern.) >> > >> > >> > For Spark, we can use RDD.sortBy(...). >> > >> > >> > Flink we can use DataSet.sortPartition(...).setParallelism(1). (There >> > may be a better method will look deeper). >> > >> > >> > h2o has an implementation, I'm sure, but this brings me to a more >> > important point: If we want to stub out a method in a back end module, >> Eg: >> > h2o, which test suites do we want make a requirements? >> > >> > >> > We've not set any specific rules for which test suites must pass for >> each >> > module. We've had a soft requirement for inheriting and passing all test >> > suites from math-scala. >> > >> > >> > Setting a rule for this is something that we need to IMO. >> > >> > >> > An easy option that I'm thinking would be to set the current core >> > math-scala suites as a requirement, and then allow for an optional suite >> > for methods which will be stubbed out. >> > >> > >> > Thoughts? >> > >> > >> > --andy >> > >> > >> > >> > >