On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <trevor.d.gr...@gmail.com> wrote:
> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree > structure, that will exist only within the algorithm for doing in-core > operations, (or maybe it lives slightly outside of the algorithm so we > don't need to recreate trees for DBSCAN, Random Forrests, other tree-based > algorithms- e.g. we can reuse the same trees for various algorithms.) BUT > Trees only exist WITHIN the in-core, i.e. we don't want to modify the > allReduceBlock to accept Matrices OR Trees, that will get out of hand > fast. Please anyone chime in to correct me/argue against. > +1. that's exactly what i meant. > So really, we've stumbled into a more important philosophical question- and > that is: Is it acceptable to create objects which make the internals of > algorithms easier to read and work with, so long as they may be serialized > to incore matrices/vectors? I am +1, and if it is decided this is not > acceptable, I need to go back and alter (or drop) things like the CanopyFn > [2] of the Canopy Clustering Algorithm. > +1 too if it is practical. The dilemma here is that if one wants to stay platform agnostic then the algorithm has to use platform-agnostic persistence/serialization, of which samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally mapping to record-tagged numerical information, it is preferable (and that's what i actually did a lot encoding models). In practice however of course in a particular application settings it is often such that people can't car less about backend compatibility, in which case a custom serialization is totally ok. But it in public mahout version it would run against the party line of staying backend agnostic so if at all possible with a little overhead, we try to avoid it.