Re: Cross-platform discrepancies

Alexander Saydakov Wed, 15 Apr 2020 10:11:27 -0700

Let's involve Druid developers as Gian suggested.


On Tue, Apr 14, 2020 at 4:44 PM leerho <lee...@gmail.com> wrote:

> Hi,
> It is pretty easy to graphically view the various cross-platform
> discrepancies that Alex is talking about in the Sketch Features Matrix
> <https://datasketches.apache.org/docs/Architecture/SketchFeaturesMatrix.html>.
> We still need to add the PostgreSQL System Integration column.
> Nevertheless, it is easy to see that where we have implemented "Off Java
> Heap" capability is directly correlated to our Druid integrations.
>
> To underscore Alex's point, many of the other sketches in this matrix that
> are not currently integrated with Druid have very dynamic memory
> requirements and may be impractical to implement in a contiguous-memory
> model.  They could still be implemented "off-heap", but we would need a
> "maloc / free"-like API similar to how PostgreSQL has implemented their
> aggregations API.
>
> One of the conclusions from this matrix is that Druid would have a much
> richer and more powerful sketching analytics capability to offer its
> customers if it had a more flexible aggregations API.
>
>
>
> On Tue, Apr 14, 2020 at 1:47 PM Alexander Saydakov
> <sayda...@verizonmedia.com.invalid> wrote:
>
>> Gian,
>> Thanks for your reply. I did not mean to suggest moving away from
>> off-heap memory. As I see it, the problem mostly is with preallocating
>> fixed blocks for each sketch of some maximum size, which, on the one hand,
>> is wasteful because most sketches will be small, and on the other hand,
>> requires operating in a "contiguous block" mode, which is very limiting.
>> Perhaps you might consider something like PostgreSQL does: provide an
>> allocator, so that the memory allocation is controlled and can be done in a
>> context of a query or transaction, but it does not impose a single
>> contiguous block for a data structure. So a custom aggregation function
>> allocates memory for its own state on the first call, and PostgreSQL passes
>> this state to the next call as a pointer and so on. At any moment the
>> aggregation can choose to reallocate and return a new pointer. And that
>> state can be a complex data structure with pointers to other blocks
>> allocated using the same mechanism, so it does not have to be a single
>> contiguous block.
>>
>> This will still require some changes to our library to support memory
>> allocation like this, but it seems to be less challenging then the current
>> Direct memory mode we have.
>>
>> There are some trade offs here, as usually. For instance, currently (if
>> implemented) wrapping a chunk of bytes as a fast alternative to creating an
>> on-heap object during deserialization is the same operation as interpreting
>> this "contiguous block" aggregation state. But if the aggregation state is
>> fragmented, it is not going to be equivalent to a serialized blob anymore.
>> This is how it is in the current C++ implementation - there is no
>> equivalent of wrap() as in Direct mode in Java. This is a potential
>> performance improvement that can be discussed separately. Some sketches can
>> choose to operate in this contiguous block mode if it does not go against
>> the algorithm.
>>
>> On Mon, Apr 13, 2020 at 10:43 PM Gian Merlino <g...@apache.org> wrote:
>>
>>> Hi Alexander,
>>>
>>> It sounds like integrating with Druid is an important concern here. It
>>> might be nice to cross post the discussion to dev@druid.
>>>
>>> Personally, I don't think it's likely the Druid community will move away
>>> from requiring that aggregators be able to work with raw memory. The
>>> requirement exists so it can use allocate an arena for aggregation space,
>>> minimizing GC load. But there have been some proposals floating around for
>>> adding ability to dynamically allocate new memory (perhaps out of the same
>>> arena or perhaps some other way — the proposals varied). I think it would
>>> be useful for Druid devs to understand more deeply what effect the
>>> allocator API has on the DataSketches implementation.
>>>
>>> (By the way, the Druid issue you're referring to might be
>>> https://github.com/apache/druid/issues/8126.)
>>>
>>> On Mon, Apr 13, 2020 at 5:11 PM Alexander Saydakov
>>> <sayda...@verizonmedia.com.invalid> wrote:
>>>
>>>> Hi everyone!
>>>> I would like to discuss some discrepancies we accumulated as results of
>>>> some decisions around prioritizing our development efforts in Java and C++,
>>>> and also around integration with data processing systems.
>>>>
>>>> Druid is one of the most important systems in which our library is
>>>> used. It has a strong requirement to have off-heap memory support and also
>>>> to operate in a single contiguous block. This was the main motivation
>>>> behind what we call Direct memory support in our Java library.
>>>>
>>>> Currently some sketches do not support Direct memory, in particular:
>>>> KLL quantiles sketch, Frequent Items sketch, CPC distinct counting sketch.
>>>> Therefore they are not integrated with Druid yet.
>>>>
>>>> Quantiles sketch:
>>>> In Java we have ItemsSketch<T> for any type and a specialized numeric
>>>> DoublesSketch with Direct memory support - this is the one integrated with
>>>> Druid. We consider this algorithm replaced by KLL quantiles sketch, but for
>>>> historical reasons in Java we have implemented KllFloatsSketch only - no
>>>> generic implementation and no Direct memory support. This contiguous memory
>>>> block mode is so limiting that we did not think that KLL algorithm can be
>>>> implemented efficiently in that mode. We might need to reconsider this.
>>>> Development is C++ happened later, so we did not implement the older
>>>> algorithm. We may want to do it if a strong use case arises to support
>>>> reading data prepared using Java library. On the other hand, in C++
>>>> kll_sketch is a template, and therefore it can be a container of any user
>>>> type. In Java we felt the need to have both: a generic implementation and a
>>>> specialized implementation for some numeric type to avoid object overhead.
>>>>
>>>> To summarize the situation with quantiles sketches:
>>>> Java: KllFloatsSketch, older quantiles ItemsSketch<T> and DoublesSketch
>>>> C++: kll_sketch<T> (kll_sketch<float> is compatible with Java), no
>>>> older quantiles sketch.
>>>> Druid: quantiles DoublesSketch
>>>> PostgreSQL: kll_sketch<float>
>>>>
>>>> Regarding distinct counting sketches:
>>>> Druid: Theta, HLL
>>>> PostgreSQL: Theta, HLL, CPC
>>>>
>>>> Regarding frequent items:
>>>> Druid: none
>>>> PostgreSQL: frequent_items_sketch<std::string>
>>>>
>>>> It seems to me that we need to do something to improve the situation.
>>>> Possible tasks:
>>>> - Discuss alternatives to Direct memory mode in Druid with Druid
>>>> developers (again). They were not quite happy with the current approach
>>>> either. It leads to gross overallocation of BufferAggredator memory. There
>>>> was an open issue in Druid repo about this I believe.
>>>> - Find some way to have KLL sketch in Druid
>>>> - Find some way to have Frequent Items (frequent strings?) sketch in
>>>> Druid
>>>> - Consider KllItemsSketch<T> in Java
>>>> - Consider legacy quantiles_sketch<T> in C++ (and then in PostgreSQL -
>>>> with double values, compatible with Druid)
>>>>
>>>> I would appreciate your thoughts about this.
>>>> Thank you.
>>>> Alexander Saydakov
>>>>
>>>

Re: Cross-platform discrepancies

Reply via email to