Re: Arrow compute/dataset design doc missing

Shawn Yang Tue, 24 May 2022 09:26:19 -0700

Hi Ion, thank you for your reply which recaps  the history of arrow compute. 
Those links are very valuable for me to understand arrow compute internal. I 
took a quick for those documents and will take a deeper into those later. I 
have another question, does arrow compute supports loop fusion, which execute 
multiple vectorized operand in one loop? This is very common in dataframe 
comouting, our engine can extract those expressions into a dag/tree. If arrow 
computer support loop fusion，the performance would be very promising


Sent from my iPhone

> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote:
> 
> Hi Shawn,
> 
> In March of 2021, when major work on the C++ query execution machinery
> in Arrow was beginning, Wes sent a message [1] to the dev list and
> linked to a doc [2] with some details about the planned design. A few
> months later Neal sent an update [3] about this work. However those
> documents are now somewhat out of date. More recently, Wes shared
> another update [4] and linked to a doc [5] regarding task execution /
> control flow / scheduling. However I think the best source of
> information is the doc you linked to. The query execution work has
> proceeded organically with many contributors, and efforts to document
> the overall design in sufficient detail have not kept pace.
> 
> Regarding benchmarks: There has been extensive work done using
> Conbench [6] as part of the Arrow CI infrastructure to benchmark
> commits, for purposes of avoiding / identifying performance
> regressions and measuring efforts to improve performance. However I am
> not aware of any efforts to produce and publicly share benchmarks for
> the purpose of comparing performance vs. other query engines.
> 
> There is a proposal [7] to give the name "Acero" to the Arrow C++
> compute engine, so in the future you will likely see it referred to by
> that name. I think that having a clearer name for this will motivate
> more efforts to write and share more about it.
> 
> Ian
> 
> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
> [2] 
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
> [5] 
> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
> [6] https://conbench.ursa.dev
> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
> 
> 
> 
>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]> wrote:
>> 
>> Hi, I'm considering using arrow compute as an execution kernel for our 
>> distributed dataframe framework. I already read the great doc: 
>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. Is 
>> there any design doc, inside introduction or benchmarks for arrow compute so 
>> I can quickly understand how arrow compute works, what can be done and what 
>> should be done by it? Or I should just read the code in 
>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
>> 
>>

Re: Arrow compute/dataset design doc missing

Reply via email to