Re: [C++] output field names in Arrow Substrait

Yaron Gvili Wed, 20 Apr 2022 03:08:25 -0700

Hi Weston,

To reiterate, the scenario is not a user writing a Substrait plan by hand, but 
a user writing an Ibis expression and the remaining steps - Substrait 
compilation, serialization to protobuf, deserialization to an Arrow plan, and 
execution of the plan - are done systematically. This issue is related to Ibis, 
Ibis-Substrait, and Arrow together, so multiple persons may need to bear on 
this issue.
Index based field references should still work here.  For example, if
I have tables:


Key | LeftPayload    --    Key | RightPayload

The join expression would be field(0) == field(2)
Right, but the point in this use case is that it would not be convenient for a 
user writing an Ibis expression to specify each key of each table, especially 
when plenty of (say, 100) tables are being joined. It would be much more 
convenient for the user to specify once the name of the key that exists in all 
tables. This is a use case where the key should (not must) be specified by name 
for convenience.
In both of those cases the field names are not part of the plan itself.
In the second use case, the user writing the ibis expression is specifying a 
string-name as a parameter to a relation that would later get passed to an 
Arrow execution node in an options instance and used to dynamically set up 
field-names. This string-name does in fact appear explicitly in the Substrait 
plan, not for convenience but of necessity. What does not appear in the 
Substrait plan are output field-names of intermediate relations; this in itself 
is not a problem, because these field-names can (in principle) be recomputed to 
allow them to be matched to the dynamically set-up field-names. But the current 
implementation of Arrow (in its Substrait module) does not do this because the 
schemata are not available during deserialization, when the option instances 
(such as ProjectNodeOptions) that require the schemata are created. Instead, 
Arrow (in its Substrait module) ends up with non-natural field-names like 
"FieldPath(1)" that fail to be matched.

That's why this is not a Substrait specific problem, but one that is related to 
Ibis, Ibis-Substrait, and Arrow together. We think it can be resolved either by 
changing Substrait to assist Arrow by specifying the field-names of 
intermediate relations in the plan, so that they are readily available during 
deserialization, or by changing Arrow (in its Substrait module) to reproduce 
the field-names during deserialization, like by computing the schemata before 
options and node instances are created. We might come up with other solutions 
in this discussion, of course. Either way, for this second use case, the 
field-names of intermediate relations must be natural; they cannot be left as 
something like "FieldPath(1)".


Yaron.
________________________________
From: Weston Pace <weston.p...@gmail.com>
Sent: Tuesday, April 19, 2022 7:12 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Cc: Li Jin <ice.xell...@gmail.com>
Subject: Re: [C++] output field names in Arrow Substrait

> However, the problem is there are natural cases in
> which an execution node should or must take in a string-name

If we can come up with such a case then I agree it would be a problem
for Substrait's current definition.  I don't think we can come up with
such a case.  Every column that can be referenced by name has a unique
index we could use instead.

> One use case is an execution node that is joining N
> input tables on a column-name that exists in all of them.

Index based field references should still work here.  For example, if
I have tables:

Key | LeftPayload    --    Key | RightPayload

The join expression would be field(0) == field(2)

> The execution node would make itself more convenient
> for the user by allowing specifying a string-name than by
> specifying N FieldRef instances (each with the same
> name but on a different input table) like the above
> requirement would force

Substrait plans aren't normally created by humans.  I'm not sure
convenience is a factor here.

> Another use case for a string-name is when it is used within
> the execution node to dynamically set up field-names, e.g., a
> node that operates on the input's columns whose name starts
> with the given string-name or a node that operates on an input
> column whose name is given as data in another input column.

In both of those cases the field names are not part of the plan itself.

On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi Weston,
>
> Thanks for the quick response.
> I think you might have forgotten the links for [1][2][3]
> Sorry about the confusion; I use these not as references to links but as 
> markers of points I make in the beginning that I elaborate on later, in the 
> places where I reuse the markers.
> Are you going from Substrait to an Arrow execution plan?
> Yes.
> For Substrait -> Arrow most of our execution nodes should take in a FieldRef 
> which can be a name but can also be index-based. So I wouldn't expect 
> Substrait's exclusive use of index-based references to be an issue.
> Yes, I agree this is a design feature of Substrait and that if the execution 
> node takes in a FieldRef then there won't be a problem with Substrait use of 
> an index-based reference for it. If any to-be-developed execution node must 
> indeed use a FieldRef when taking in a parameter that refers to columns, then 
> this requirement should be documented for developers of execution nodes.
>
> However, the problem is there are natural cases in which an execution node 
> should or must take in a string-name; these suffer from the above design 
> feature due to the non-natural field names that it leads to. One use case is 
> an execution node that is joining N input tables on a column-name that exists 
> in all of them. The execution node would make itself more convenient for the 
> user by allowing specifying a string-name than by specifying N FieldRef 
> instances (each with the same name but on a different input table) like the 
> above requirement would force. Another use case for a string-name is when it 
> is used within the execution node to dynamically set up field-names, e.g., a 
> node that operates on the input's columns whose name starts with the given 
> string-name or a node that operates on an input column whose name is given as 
> data in another input column.
>
> This is the main reason we think there should be appropriate (natural) field 
> names defined for each relation/node.
> Also keep in mind that Substrait isn't exactly a query language for user's to 
> be typing by hand.
> Yes, I took that into consideration. Namely, the above discussion refers to 
> the scenario of a user writing an Ibis expression and the remaining steps - 
> Substrait compilation, serialization to protobuf,, deserialization to an 
> Arrow plan, and execution of the plan - are done systematically.
>
>
> Yaron.
> ________________________________
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Tuesday, April 19, 2022 1:01 PM
> To: dev@arrow.apache.org <dev@arrow.apache.org>
> Cc: Li Jin <ice.xell...@gmail.com>
> Subject: Re: [C++] output field names in Arrow Substrait
>
> Hi Yaron, I think you might have forgotten the links for [1][2][3] so
> I'm not entirely sure of the context.  Are you going from Substrait to
> an Arrow execution plan?  Or are you going from an Arrow execution
> plan to Substrait?
>
> For Substrait -> Arrow most of our execution nodes should take in a
> FieldRef which can be a name but can also be index-based.  So I
> wouldn't expect Substrait's exclusive use of index-based references to
> be an issue.  Also keep in mind that Substrait isn't exactly a query
> language for user's to be typing by hand.  So, for example, if a user
> wants a join and is using Ibis they could type:
>
> table.semi_join(s, table.x == table.y)
>
> Ibis is then responsible for converting "x" and "y" to the appropriate
> indices (and it should have all the information needed to do so).  The
> Substrait plan will refer to these nodes by index and the
> corresponding Arrow execution plan will use integer-based FieldRef.
>
> For Arrow -> Substrait then I agree that this could become a problem.
> Right now the Arrow -> Substrait path is mainly used for internal
> testing with the hope that Substrait -> Arrow -> Substrait should
> generally be possible (with zero loss of information).  This does not
> mean that every Arrow plan will be convertible into Substrait.  That
> is certainly a potential goal, and PRs to add that capability would be
> welcome, but I don't know if anyone working on the Arrow/Substrait
> integration has that goal in mind.  If that is your goal I might be
> curious to learn more about your use cases.
>
> On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > Hi,
> >
> >
> > We ran into an issue due to the fact that, for intermediate relations, 
> > Substrait does not automatically compute output field names nor allows one 
> > to explicitly name output fields [1]. This leads to trouble when one needs 
> > to refer to these output fields by name [2]. We run into this trouble when 
> > deserializing in Arrow [3] (based on commit 8e13c2dd).
> >
> >
> >
> > One use case where [1] occurs is:
> >
> > import ibis
> >
> > import pyarrow as pa
> >
> > loaded_table = ibis.local_table(...)
> >
> > cast_table = ibis.mutate(loaded_table, 
> > time=loaded_table.time.cast(pa.int64())
> >
> > Even if loaded_table has nicely named fields, like "time" and "value1" and 
> > "value2", the resulting cast_table (when deserialized in Arrow) has 
> > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for all but 
> > the "time" field. I believe that Substrait has all the information to 
> > automatically compute the field name "value*" for cast_table, but it 
> > doesn't do this. Moreover, the caller has to know the schema of 
> > loaded_table in order to set the output field names for cast_table, and 
> > this is not convenient. When cast_table is an intermediate relation (i.e., 
> > not the root relation of the plan), Substrait also doesn't allow the caller 
> > to explicitly name the output fields, and there is no place for these field 
> > names in the Substrait protobuf.
> >
> >
> >
> > One use case where [2] occurs is in a join-type operation over cast_table. 
> > The caller would normally like to use a field name (and not an index) to 
> > refer to the join-key. Even if the caller knows the field name for the 
> > join-key in loaded_table, many field names in cast_table (when deserialized 
> > in Arrow) are different (and each includes an index in it) than those of 
> > loaded_table.
> >
> >
> >
> > The case of [3] occurs because Arrow deserializes ExecNodeOption instances 
> > before it has Schema instances at hand. At this stage, without schemata, 
> > Arrow cannot compute field names that should be placed in a ExecNodeOption 
> > that needs them, in particular ProjectNodeOptions. Currently, Arrow creates 
> > an expression vector for the ProjectNodeOptions instance but leaves the 
> > names vector empty, and later defaults each name to the ToString of each 
> > expression.
> >
> >
> >
> > We'd like your input on this issue. Currently, we see two viable ways to 
> > resolve this issue: (1) add output field name for intermediate relations in 
> > the Substrait plan so that Arrow can directly access them, or (2) reproduce 
> > the field names in Arrow during deserialization by creating Schema 
> > instances before ExecNodeOptions instances. In any case, we think there 
> > should be appropriate field names defined for each relation/node.
> >
> >
> >
> > Note that due to this issue, any Arrow execution node that prompts the user 
> > to refer to fields by name might not play nicely with Substrait.
> >
> >
> > Cheers,
> > Yaron.

Re: [C++] output field names in Arrow Substrait

Reply via email to