Re: Density based Clustering in Mahout

Dmitriy Lyubimov Thu, 06 Jul 2017 11:25:27 -0700

PS Maybe we should say, if you can provide kryo serialization, it can be
assumed platform agnostic, and provide api for embedding that further. In
practice all backends (except, I guess, H20 which is going extinct if not
yet) currently support kryo, and the new potential ones could easily add it
too (after all it is just a bunch of bytes after serialization, can't get
any more basic than that).


On Thu, Jul 6, 2017 at 11:21 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
> On Thu, Jul 6, 2017 at 9:45 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
>> To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
>> structure, that will exist only within the algorithm for doing in-core
>> operations, (or maybe it lives slightly outside of the algorithm so we
>> don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
>> algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
>> Trees only exist WITHIN the in-core, i.e. we don't want to modify the
>> allReduceBlock to accept Matrices OR Trees, that will get out of hand
>> fast.  Please anyone chime in to correct me/argue against.
>>
>
> +1. that's exactly what i meant.
>
>
>> So really, we've stumbled into a more important philosophical question-
>> and
>> that is: Is it acceptable to create objects which make the internals of
>> algorithms easier to read and work with, so long as they may be serialized
>> to incore matrices/vectors? I am +1, and if it is decided this is not
>> acceptable, I need to go back and alter (or drop) things like the CanopyFn
>> [2] of the Canopy Clustering Algorithm.
>>
>
> +1 too if it is practical.
> The dilemma here is that if one wants to stay platform agnostic then the
> algorithm has to use platform-agnostic persistence/serialization, of which
> samsara provides only that of DRM/Matrix/Vector. So yes, if it is naturally
> mapping to record-tagged numerical information, it is preferable (and
> that's what i actually did a lot encoding models).
>
> In practice however of course in a particular application settings it is
> often such that people can't car less about backend compatibility, in which
> case a custom serialization is totally ok. But it in public mahout version
> it would run against the party line of staying backend agnostic so if at
> all possible with a little overhead, we try to avoid it.
>

Re: Density based Clustering in Mahout

Reply via email to