Re: [DISCUSS][Java] Design of the algorithm module

Micah Kornfield Sun, 13 Oct 2019 14:46:53 -0700

Hi Liya Fan,

> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.



I agree the algorithm can be generally useful.  But I still have concerns
about who is going to use them.

I think there are two categories the algorithms fall into:
1.  Algorithms directly related to Arrow specification features.  For
these, I agree some of functionality will be needed as a reference
implementation.  At least for existing functionality I think there is
already sufficient coverage and in some cases (i.e. dictionary there is
already) duplicate coverage.

2.  Other algorithms -  I think these fall into "data analysis, data
mining, etc.", and for these I think it goes back to the question, of
whether developers/users would use the given algorithms to build there own
one-off analysis or use already existing tools like Apache Spark or
SQL-engine that already incorporates the algorithms.

I'm little disappointed that more maintainers/developers haven't given
there input on this topic.  I hope some will help with the work involved in
reviewing them if they find them valuable.

Thanks,
Micah


On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fan_li...@aliyun.com> wrote:

> Hi Micah and Praveen,
>
> Thanks a lot for your valuable feedback.
>
> My thoughts on the problems:
>
> 1. About audiance of the algorithms:
>
> I think the algorithms should be better termed "micro-algorithms". They
> are termed "micro" in the sense that they do not directly compose a query
> engine, because they only provide primitive functionalities (e.g. vector
> sort).
> Instead, they can be used as building blocks for query engines.  The major
> benefit of the micro-algorithms is their generality: they can be used in
> wide ranges of common scenarios. They can be used in more than one query
> engine. In addition, there are other common scenarios, like vector data
> compression/decompression (e.g. dictionary encoding and RLE encoding, as we
> have already supported/discussed), IPC communication, data analysis, data
> mining, etc.
>
> 2. About performance improvments:
>
> Code generation and template types are powerful tools. In addition, JIT is
> also a powerful tool, as it can inline megamorphic virtual functions for
> many scenarios, if the algorithm is implemented appropriately.
> IMO, code generation is applicable to almost all scenarios to achieve good
> performance, if we are willing to pay the price of code readability.
> I will try to detail the principles for choosing these tools for
> performance improvements later.
>
> Best,
> Liya Fan
>
> ------------------------------------------------------------------
> 发件人：Praveen Kumar <prav...@dremio.com>
> 发送时间：2019年10月4日(星期五) 19:20
> 收件人：Micah Kornfield <emkornfi...@gmail.com>
> 抄 送：Fan Liya <liya.fa...@gmail.com>; dev <dev@arrow.apache.org>
> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>
> Hi Micah,
>
> I agree with 1., i think as an end user, what they would really want is a
> query/data processing engine. I am not sure how easy/relevant the
> algorithms will be in the absence of the engine. For e.g. most of these
> operators would need to pipelined, handle memory, distribution etc. So
> bundling this along with engine makes a lot more sense, the interfaces
> required might be a bit different too for that.
>
> Thx.
>
>
>
> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Liya Fan,
> > Thanks again for writing this up.  I think it provides a road-map for
>
> > intended features.  I commented on the document but I wanted to raise a few
> > high-level concerns here as well to get more feedback from the community.
> >
>
> > 1.  It isn't clear to me who the users will of this will be.  My perception
> > is that in the Java ecosystem there aren't use-cases for the algorithms
>
> > outside of specific compute engines.  I'm not super involved in open-source
>
> > Java these days so I would love to hear others opinions. For instance, I'm
> > not sure if Dremio would switch to using these algorithms instead of the
> > ones they've already open-sourced  [1] and Apache Spark I believe is only
> > using Arrow for interfacing with Python (they similarly have there own
>
> > compute pipeline).  I think you mentioned in the past that these are being
> > used internally on an engine that your company is working on, but if that
>
> > is the only consumer it makes me wonder if the algorithm development might
> > be better served as part of that engine.
> >
> > 2.  If we do move forward with this, we also need a plan for how to
> > optimize the algorithms to avoid virtual calls.  There are two high-level
> > approaches template-based and (byte)code generation based.  Both aren't
>
> > applicable in all situations but it would be good to come consensus on when
> > (and when not to) use each.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
> >
> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > > Hi Micah,
> > >
> > > Thanks for your effort and precious time.
> > > Looking forward to receiving more valuable feedback from you.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <emkornfi...@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Liya Fan,
> > >> I started reviewing but haven't gotten all the way through it. I will
> > try
> > >> to leave more comments over the next few days.
> > >>
> > >> Thanks again for the write-up I think it will help frame a productive
> > >> conversation.
> > >>
> > >> -Micah
> > >>
> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fa...@gmail.com
> > wrote:
> > >>
> > >>> Hi Micah,
> > >>>
> > >>> Thanks for your kind reminder. Comments are enabled now.
> > >>>
> > >>> Best,
> > >>> Liya Fan
> > >>>
> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hi Liya Fan,
>
> > >>>> Thank you for this writeup, it doesn't look like comments are enabled
> > on
> > >>>> the document.  Could you allow for them?
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <liya.fa...@gmail.com>
> > wrote:
> > >>>>
> > >>>> > Dear all,
> > >>>> >
>
> > >>>> > We have prepared a document for discussing the requirements, design
> > >>>> and
> > >>>> > implementation issues for the algorithm module of Java:
> > >>>> >
> > >>>> >
> > >>>> >
> > >>>>
> >
> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
> > >>>> >
> > >>>> > So far, we have finished the initial draft for sort, search and
> > >>>> dictionary
>
> > >>>> > encoding algorithms. Discussions for more algorithms may be added in
> > >>>> the
> > >>>> > future. This document will keep evolving to reflect the latest
> > >>>> discussion
> > >>>> > results in the community and the latest code changes.
> > >>>> >
> > >>>> > Please give your valuable feedback.
> > >>>> >
> > >>>> > Best,
> > >>>> > Liya Fan
> > >>>> >
> > >>>>
> > >>>
> >
>
>
>

Re: [DISCUSS][Java] Design of the algorithm module

Reply via email to