TL;DR yes, if and when all is said and done.

Breaking this down…

Substrait isn't really relevant here. It's a way to serialize a query in a way 
that's agnostic to whatever's actually generating or executing the query.

But if you have a Substrait plan, that can get converted by the Arrow C++ Query 
Engine into its internal "ExecPlan" for execution, which is what's actually 
implementing the joins, aggregations, etc. This engine operates in a streaming 
fashion, so your application can take the data you get out and use it with a 
Flight service/client, yes.

The query engine pulls input from the Arrow Datasets library. (Though while I 
speak of them separately, really, they are intertwined.) Datasets is also a 
streaming interface to read Arrow data from various underlying datasources, 
implementing things like projection pushdown and partitioning where possible. 
This is agnostic to whether the data is local or remote, i.e. there's no 
explicit concept of "remote dataset". It's all datasets whether it's in memory, 
on local disk, or across the network.

So if/when a Flight datasource ("Fragment") is implemented for Arrow Datasets, 
this will be consumed in a streaming fashion, by a query engine which itself is 
streaming, which can be fed into a streaming interface like Flight. There's a 
good amount of work to do to ensure this all works well together (e.g. ensuring 
backpressure gets reflected across all these layers), but what you are asking 
for is in principle doable, if not quite yet implemented.

-David

On Tue, Apr 12, 2022, at 16:21, Adam Lippai wrote:
> Hi James,
>
> Your answer helps, yes.
> My question is whether I will be able to join two datasets (producing a new
> dataset) in a streaming way or do I have to fetch the whole response and
> keep it in memory?
> So if my local node has memory constraints, will it be able to stream data
> from an Apache Flight datasource and stream it back to a different Apache
> Flight target?
> If the answer is yes, is it because there will be a Remote Dataset concept
> or will it use "distributed computing" using Substrait?
>
> Best regards,
> Adam Lippai
>
> On Tue, Apr 12, 2022 at 4:14 PM James Duong <jam...@bitquilltech.com.invalid>
> wrote:
>
>> Hi Adam,
>>
>> Arrow Flight can be used to provide an RPC framework that returns datasets
>> (sent over the wire as arrow buffers) and exposes them from a FlightClient
>> as Arrow RecordBatches without serialization. Is this what you mean by
>> remote datasets?
>> Arrow Flight SQL is an application layer built on top of Arrow Flight that
>> standardizes remote execution of SQL queries, getting catalog information,
>> getting SQL capabilities, and other access-related concepts. Arrow Flight
>> SQL is intended to provide a universal user-facing front end for existing
>> SQL-capable database engines.
>>
>> Neither are really intended for computation, just remote access.
>>
>> On Tue, Apr 12, 2022 at 12:51 PM Adam Lippai <a...@rigo.sk> wrote:
>>
>> > Hi,
>> >
>> > I saw really nice features like groupby and join developed recently.
>> > I like how Dataset is supported for joins and how streamed processing is
>> > gaining momentum in Arrow.
>> >
>> > Does Apache Arrow have the concept of remote datasets eg using Arrow
>> > Flight? Or will this happen directly using S3 and other protocols only? I
>> > know some work has started in Substrait, but that might be a whole new
>> > level of integration, hence my question focusing on data first.
>> >
>> > I was trying to browse the JIRA issues, but the future picture wasn't
>> clear
>> > based on that
>> >
>> > Best regards,
>> > Adam Lippai
>> >
>>
>>
>> --
>>
>> *James Duong*
>> Lead Software Developer
>> Bit Quill Technologies Inc.
>> Direct: +1.604.562.6082 | jam...@bitquilltech.com
>> https://www.bitquilltech.com
>>
>> This email message is for the sole use of the intended recipient(s) and may
>> contain confidential and privileged information.  Any unauthorized review,
>> use, disclosure, or distribution is prohibited.  If you are not the
>> intended recipient, please contact the sender by reply email and destroy
>> all copies of the original message.  Thank you.
>>

Reply via email to