I'll put recommendations for the design on the issue. Thanks! On Fri, Mar 15, 2024 at 2:03 PM Aldrin <octalene....@pm.me.invalid> wrote:
> I created a new issue [1] to track the refactoring. Could you clarify the > request (here or in the issue)? > > My understanding is that the Skyhook file format code [2] should be > refactored to use a higher-level interface rather than using > dataset::FileFormat and dataset::FragmentScanOptions directly [3]. > > I am assuming the reference to Acero and Substrait to be only for context > and not necessarily a preferred direction. If that is the preferred > direction, there is something much more general in progress that we can > perhaps specialize as a replacement for the Skyhook file format, but I'm > not sure that's what's actually being requested. > > Thank you! > > > [1]: https://github.com/apache/arrow/issues/40583 > [2]: https://github.com/apache/arrow/tree/main/cpp/src/skyhook > [3]: > https://github.com/apache/arrow/blob/main/cpp/src/skyhook/cls/cls_skyhook.cc#L153-L156 > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > On Thursday, March 14th, 2024 at 09:10, Jayjeet Chakraborty < > jayjeetchakrabort...@gmail.com> wrote: > > > Hi Ben, I am willing to help out with the refactor too ! > > > > > On Wed, Mar 13, 2024 at 9:25 PM Aldrin octalene....@pm.me.invalid wrote: > > > > > > I am interested in helping to refactor! > > > > > > > -Aldrin > > > > > > > On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman <bengil...@gmail.com > > > <On+Wed,+Mar+13,+2024+at+08:54,+Benjamin+Kietzman+%3C%3Ca+href=>> > wrote: > > > > > > > Skyhook [1] enables efficient predicate and projection pushdown from > > > Arrow Dataset to a Ceph storage cluster. This is very cool > > > functionality, but it's tightly coupled to the Arrow C++ Dataset > > > implementation in a way which blocks refactoring. In the Arrow C++ > > > codebase today, Acero is designed specifically to handle projection > > > and filtration in a more modular fashion, and to accept configuration > > > from standardized plan/expression formats like Substrait. In light of > > > improvements to Dataset which are not possible while maintaining > > > Skyhook in its current form, we need volunteers to update Skyhook. > > > Please reply to let us know if you are actively using Skyhook or if > > > you are interested in helping to refactor Skyhook. > > > > > > > Sincerely, > > > Ben Kietzman > > > > > > > [1] > > > > > > > > https://arrow.apache.org/blog/2022/01/31/skyhook-bringing-computation-to-storage-with-apache-arrow/ > > > > > > > > -- > > Jayjeet Chakraborty > > CS PhD student > > UC Santa Cruz > > California, USA