Re: Density based Clustering in Mahout

Dmitriy Lyubimov Thu, 06 Jul 2017 11:22:10 -0700

On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:


> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
> structure, that will exist only within the algorithm for doing in-core
> operations, (or maybe it lives slightly outside of the algorithm so we
> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
> allReduceBlock to accept Matrices OR Trees, that will get out of hand
> fast.  Please anyone chime in to correct me/argue against.
>

+1. that's exactly what i meant.


> So really, we've stumbled into a more important philosophical question- and
> that is: Is it acceptable to create objects which make the internals of
> algorithms easier to read and work with, so long as they may be serialized
> to incore matrices/vectors? I am +1, and if it is decided this is not
> acceptable, I need to go back and alter (or drop) things like the CanopyFn
> [2] of the Canopy Clustering Algorithm.
>

+1 too if it is practical.
The dilemma here is that if one wants to stay platform agnostic then the
algorithm has to use platform-agnostic persistence/serialization, of which
samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
mapping to record-tagged numerical information, it is preferable (and
that's what i actually did a lot encoding models).

In practice however of course in a particular application settings it is
often such that people can't car less about backend compatibility, in which
case a custom serialization is totally ok. But it in public mahout version
it would run against the party line of staying backend agnostic so if at
all possible with a little overhead, we try to avoid it.

Re: Density based Clustering in Mahout

Reply via email to