Re: Remote datasets

Adam Lippai Tue, 12 Apr 2022 13:52:21 -0700

Hi David,

This is a perfect answer. I was looking for the Fragment concept and the
issues you linked make it easy to follow.
I understand this is a really hard field with a ton of work, getting
chunking, prefetch and backpressure correctly + adding filter predicate and
other computation pushdown is an infinitely complex task.


Thank you for making this clear. You make so great progress it's hard to
keep up even with the big picture :D

Best regards,
Adam Lippai

On Tue, Apr 12, 2022 at 4:46 PM David Li <lidav...@apache.org> wrote:

> TL;DR yes, if and when all is said and done.
>
> Breaking this down…
>
> Substrait isn't really relevant here. It's a way to serialize a query in a
> way that's agnostic to whatever's actually generating or executing the
> query.
>
> But if you have a Substrait plan, that can get converted by the Arrow C++
> Query Engine into its internal "ExecPlan" for execution, which is what's
> actually implementing the joins, aggregations, etc. This engine operates in
> a streaming fashion, so your application can take the data you get out and
> use it with a Flight service/client, yes.
>
> The query engine pulls input from the Arrow Datasets library. (Though
> while I speak of them separately, really, they are intertwined.) Datasets
> is also a streaming interface to read Arrow data from various underlying
> datasources, implementing things like projection pushdown and partitioning
> where possible. This is agnostic to whether the data is local or remote,
> i.e. there's no explicit concept of "remote dataset". It's all datasets
> whether it's in memory, on local disk, or across the network.
>
> So if/when a Flight datasource ("Fragment") is implemented for Arrow
> Datasets, this will be consumed in a streaming fashion, by a query engine
> which itself is streaming, which can be fed into a streaming interface like
> Flight. There's a good amount of work to do to ensure this all works well
> together (e.g. ensuring backpressure gets reflected across all these
> layers), but what you are asking for is in principle doable, if not quite
> yet implemented.
>
> -David
>
> On Tue, Apr 12, 2022, at 16:21, Adam Lippai wrote:
> > Hi James,
> >
> > Your answer helps, yes.
> > My question is whether I will be able to join two datasets (producing a
> new
> > dataset) in a streaming way or do I have to fetch the whole response and
> > keep it in memory?
> > So if my local node has memory constraints, will it be able to stream
> data
> > from an Apache Flight datasource and stream it back to a different Apache
> > Flight target?
> > If the answer is yes, is it because there will be a Remote Dataset
> concept
> > or will it use "distributed computing" using Substrait?
> >
> > Best regards,
> > Adam Lippai
> >
> > On Tue, Apr 12, 2022 at 4:14 PM James Duong <jam...@bitquilltech.com
> .invalid>
> > wrote:
> >
> >> Hi Adam,
> >>
> >> Arrow Flight can be used to provide an RPC framework that returns
> datasets
> >> (sent over the wire as arrow buffers) and exposes them from a
> FlightClient
> >> as Arrow RecordBatches without serialization. Is this what you mean by
> >> remote datasets?
> >> Arrow Flight SQL is an application layer built on top of Arrow Flight
> that
> >> standardizes remote execution of SQL queries, getting catalog
> information,
> >> getting SQL capabilities, and other access-related concepts. Arrow
> Flight
> >> SQL is intended to provide a universal user-facing front end for
> existing
> >> SQL-capable database engines.
> >>
> >> Neither are really intended for computation, just remote access.
> >>
> >> On Tue, Apr 12, 2022 at 12:51 PM Adam Lippai <a...@rigo.sk> wrote:
> >>
> >> > Hi,
> >> >
> >> > I saw really nice features like groupby and join developed recently.
> >> > I like how Dataset is supported for joins and how streamed processing
> is
> >> > gaining momentum in Arrow.
> >> >
> >> > Does Apache Arrow have the concept of remote datasets eg using Arrow
> >> > Flight? Or will this happen directly using S3 and other protocols
> only? I
> >> > know some work has started in Substrait, but that might be a whole new
> >> > level of integration, hence my question focusing on data first.
> >> >
> >> > I was trying to browse the JIRA issues, but the future picture wasn't
> >> clear
> >> > based on that
> >> >
> >> > Best regards,
> >> > Adam Lippai
> >> >
> >>
> >>
> >> --
> >>
> >> *James Duong*
> >> Lead Software Developer
> >> Bit Quill Technologies Inc.
> >> Direct: +1.604.562.6082 | jam...@bitquilltech.com
> >> https://www.bitquilltech.com
> >>
> >> This email message is for the sole use of the intended recipient(s) and
> may
> >> contain confidential and privileged information.  Any unauthorized
> review,
> >> use, disclosure, or distribution is prohibited.  If you are not the
> >> intended recipient, please contact the sender by reply email and destroy
> >> all copies of the original message.  Thank you.
> >>
>

Re: Remote datasets

Reply via email to