[GitHub] [arrow] jvanstraten commented on a change in pull request #12279: ARROW-15238: [C++] ARROW_ENGINE module with substrait consumer

GitBox Thu, 03 Feb 2022 04:57:17 -0800


jvanstraten commented on a change in pull request #12279:
URL: https://github.com/apache/arrow/pull/12279#discussion_r798535741




##########
File path: format/substrait/extension_types.yaml
##########
@@ -0,0 +1,121 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# substrait::{ExtensionTypeVariation, ExtensionType}s
+# for wrapping types which appear in the arrow type system but
+# are not first-class in substrait. These include:
+# - null
+# - unsigned integers
+# - half-precision floating point numbers
+# - 32 bit times and dates
+# - timestamps with units other than microseconds
+# - timestamps with timezones other than UTC
+# - 256 bit decimals
+# - sparse and dense unions
+# - dictionary encoded types
+# - durations
+# - string and binary with 64 bit offsets
+# - list with 64 bit offsets
+# - interval<months: i32>
+# - interval<days: i32, millis: i32>
+# - interval<months: i32, days: i32, nanos: i64>
+# - arrow::ExtensionTypes
+
+# FIXME these extension types are not parameterizable, which means among
+# other things that we can't declare dictionary type here at all since
+# we'd have to declare a different dictionary type for all encoded types
+# (but that is an infinite space). Similarly, do we need to declare a
+# timestamp variation for all possible timezone strings?
+#
+# Ultimately these declarations are a promise which needs to be backed by
+# equivalent serde in c++. For example, consider u8: when serializing to
+# substrait, we need to wrap instances of arrow::uint8 into the type
+# variation listed below. It would be ideal if we could SinglePointOfTruth
+# this correspondence; either generating c++ from the YAML or YAML from the
+# c++.
+#
+# At present (AFAICT) it's not valid to make this user extensible because
+# even if a user adds their custom scalar function to the registry *and*
+# defines the mapping from that scalar function to a 
substrait::ExtensionFunction
+# the corresponding YAML doesn't exist at any URI and so it can't be used in
+# substrait. Perhaps we could still help that case by providing a tool to
+# generate YAML from functions; that'd simplify the lives of people trying to
+# write arrow::compute::Functions to "define the function and if you want to
+# reference it from substrait generate this YAML and put it at some URI".
+#
+# In any case for the foreseeable future generation would be far too brittle;
+# URIs will not be accessed by anything but humans and the YAML is effectively
+# structured documentation. Thus extenders should pass the URI in the same way
+# they pass a description string; it's opaque to anything in arrow.

Review comment:
       > I'm assuming you meant "emit functions/types that a consumer can 
handle"
   
   Oops, my bad.
   
   > I think it will always be a valid path that the YAML either does not exist 
or is not publicly hosted.
   
   I suppose you're right in the sense that if a particular producer/consumer 
pair already agree on which functions and types exist and how they are named, 
neither needs to read the YAML, and therefore it doesn't need to be accessible 
or exist at all for that pair to be able to communicate plans. This is only 
true because Substrait (rather unconventionally IMO, though I understand the 
reasoning) explicitly and redundantly specifies type information at (almost) 
every point in an expression tree (notably, for whatever reason it doesn't do 
this for relations, so figuring out the type of a fieldref still requires type 
annotation at the consumer end). And, indeed, consumers will probably never 
have to read the YAMLs, at least not for functions they can map to an 
implementation by URI, name, and argument/return type information alone (i.e. 
they don't rely on the implementation map field in the YAML). I agree that it 
makes the most sense if a consumer advertises what it can do via YAML file
 s (contrary to what the implementation map seems to be trying to accomplish), 
and therefore it doesn't have to read them.
   
   It's the "always" part of that sentence which I didn't (and still don't) 
necessarily agree with. For example, as you then add yourself, a validator 
would need it. Kind of by definition therefore, a plan is technically not valid 
Substrait unless there is an accessible YAML for all extensions used. Whether a 
particular producer/consumer pair can talk just fine regardless because they 
don't use that part of the spec is beyond the point for spec compliance. 
However, I'm a bit of a purist when it comes to specifications, and I should 
probably be more pragmatic in this context, so I'm happy to leave it at this 
and yield to the idea that they don't need to exist in practice.
   
   > I do believe that automatic YAML generation is our best long term solution.
   
   I agree, but I think you're only referring to generating the YAML file 
that's currently hand-written and checked into the repo with this PR. The other 
thing Ben was talking about is providing a public API to users of Arrow to also 
generate these YAML files for custom functions they may add to the function 
registry at runtime.
   
   > I don't think there is much benefit in doing automatic YAML generation 
right now
   
   +1
   
   > These comments do not belong in the YAML doc. [...]
   
   I'm not sure I agree. A user who wants to use a type or function that Arrow 
supports (or they made themselves) will probably end up in the YAML file when 
looking for resources, not on a JIRA issue. Once the issues are resolved 
however and all this stuff actually works, the notes should be removed. I *do* 
think they could be structured better, because I'm still having a hard time 
piecing together what all of it means and what is or isn't implemented even 
after spending lots of time with the relevant code. They should also link to 
the relevant JIRAs. I'm also only just now realizing that the file is named 
`extension_types.yaml`, yet it is also talking about function extensions. I'll 
see if I can clean it up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jvanstraten commented on a change in pull request #12279: ARROW-15238: [C++] ARROW_ENGINE module with substrait consumer

Reply via email to