Hi Liya Fan, > I think the algorithms should be better termed "micro-algorithms". They > are termed "micro" in the sense that they do not directly compose a query > engine, because they only provide primitive functionalities (e.g. vector > sort). > Instead, they can be used as building blocks for query engines. The major > benefit of the micro-algorithms is their generality: they can be used in > wide ranges of common scenarios. They can be used in more than one query > engine. In addition, there are other common scenarios, like vector data > compression/decompression (e.g. dictionary encoding and RLE encoding, as we > have already supported/discussed), IPC communication, data analysis, data > mining, etc.
I agree the algorithm can be generally useful. But I still have concerns about who is going to use them. I think there are two categories the algorithms fall into: 1. Algorithms directly related to Arrow specification features. For these, I agree some of functionality will be needed as a reference implementation. At least for existing functionality I think there is already sufficient coverage and in some cases (i.e. dictionary there is already) duplicate coverage. 2. Other algorithms - I think these fall into "data analysis, data mining, etc.", and for these I think it goes back to the question, of whether developers/users would use the given algorithms to build there own one-off analysis or use already existing tools like Apache Spark or SQL-engine that already incorporates the algorithms. I'm little disappointed that more maintainers/developers haven't given there input on this topic. I hope some will help with the work involved in reviewing them if they find them valuable. Thanks, Micah On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fan_li...@aliyun.com> wrote: > Hi Micah and Praveen, > > Thanks a lot for your valuable feedback. > > My thoughts on the problems: > > 1. About audiance of the algorithms: > > I think the algorithms should be better termed "micro-algorithms". They > are termed "micro" in the sense that they do not directly compose a query > engine, because they only provide primitive functionalities (e.g. vector > sort). > Instead, they can be used as building blocks for query engines. The major > benefit of the micro-algorithms is their generality: they can be used in > wide ranges of common scenarios. They can be used in more than one query > engine. In addition, there are other common scenarios, like vector data > compression/decompression (e.g. dictionary encoding and RLE encoding, as we > have already supported/discussed), IPC communication, data analysis, data > mining, etc. > > 2. About performance improvments: > > Code generation and template types are powerful tools. In addition, JIT is > also a powerful tool, as it can inline megamorphic virtual functions for > many scenarios, if the algorithm is implemented appropriately. > IMO, code generation is applicable to almost all scenarios to achieve good > performance, if we are willing to pay the price of code readability. > I will try to detail the principles for choosing these tools for > performance improvements later. > > Best, > Liya Fan > > ------------------------------------------------------------------ > 发件人:Praveen Kumar <prav...@dremio.com> > 发送时间:2019年10月4日(星期五) 19:20 > 收件人:Micah Kornfield <emkornfi...@gmail.com> > 抄 送:Fan Liya <liya.fa...@gmail.com>; dev <dev@arrow.apache.org> > 主 题:Re: [DISCUSS][Java] Design of the algorithm module > > Hi Micah, > > I agree with 1., i think as an end user, what they would really want is a > query/data processing engine. I am not sure how easy/relevant the > algorithms will be in the absence of the engine. For e.g. most of these > operators would need to pipelined, handle memory, distribution etc. So > bundling this along with engine makes a lot more sense, the interfaces > required might be a bit different too for that. > > Thx. > > > > On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hi Liya Fan, > > Thanks again for writing this up. I think it provides a road-map for > > > intended features. I commented on the document but I wanted to raise a few > > high-level concerns here as well to get more feedback from the community. > > > > > 1. It isn't clear to me who the users will of this will be. My perception > > is that in the Java ecosystem there aren't use-cases for the algorithms > > > outside of specific compute engines. I'm not super involved in open-source > > > Java these days so I would love to hear others opinions. For instance, I'm > > not sure if Dremio would switch to using these algorithms instead of the > > ones they've already open-sourced [1] and Apache Spark I believe is only > > using Arrow for interfacing with Python (they similarly have there own > > > compute pipeline). I think you mentioned in the past that these are being > > used internally on an engine that your company is working on, but if that > > > is the only consumer it makes me wonder if the algorithm development might > > be better served as part of that engine. > > > > 2. If we do move forward with this, we also need a plan for how to > > optimize the algorithms to avoid virtual calls. There are two high-level > > approaches template-based and (byte)code generation based. Both aren't > > > applicable in all situations but it would be good to come consensus on when > > (and when not to) use each. > > > > Thanks, > > Micah > > > > [1] > > > > > https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external > > > > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > Hi Micah, > > > > > > Thanks for your effort and precious time. > > > Looking forward to receiving more valuable feedback from you. > > > > > > Best, > > > Liya Fan > > > > > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <emkornfi...@gmail.com > > > > > wrote: > > > > > >> Hi Liya Fan, > > >> I started reviewing but haven't gotten all the way through it. I will > > try > > >> to leave more comments over the next few days. > > >> > > >> Thanks again for the write-up I think it will help frame a productive > > >> conversation. > > >> > > >> -Micah > > >> > > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fa...@gmail.com > > wrote: > > >> > > >>> Hi Micah, > > >>> > > >>> Thanks for your kind reminder. Comments are enabled now. > > >>> > > >>> Best, > > >>> Liya Fan > > >>> > > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield < > > emkornfi...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hi Liya Fan, > > > >>>> Thank you for this writeup, it doesn't look like comments are enabled > > on > > >>>> the document. Could you allow for them? > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <liya.fa...@gmail.com> > > wrote: > > >>>> > > >>>> > Dear all, > > >>>> > > > > >>>> > We have prepared a document for discussing the requirements, design > > >>>> and > > >>>> > implementation issues for the algorithm module of Java: > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing > > >>>> > > > >>>> > So far, we have finished the initial draft for sort, search and > > >>>> dictionary > > > >>>> > encoding algorithms. Discussions for more algorithms may be added in > > >>>> the > > >>>> > future. This document will keep evolving to reflect the latest > > >>>> discussion > > >>>> > results in the community and the latest code changes. > > >>>> > > > >>>> > Please give your valuable feedback. > > >>>> > > > >>>> > Best, > > >>>> > Liya Fan > > >>>> > > > >>>> > > >>> > > > > >