jvanstraten commented on a change in pull request #12279: URL: https://github.com/apache/arrow/pull/12279#discussion_r798535741
########## File path: format/substrait/extension_types.yaml ########## @@ -0,0 +1,121 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# substrait::{ExtensionTypeVariation, ExtensionType}s +# for wrapping types which appear in the arrow type system but +# are not first-class in substrait. These include: +# - null +# - unsigned integers +# - half-precision floating point numbers +# - 32 bit times and dates +# - timestamps with units other than microseconds +# - timestamps with timezones other than UTC +# - 256 bit decimals +# - sparse and dense unions +# - dictionary encoded types +# - durations +# - string and binary with 64 bit offsets +# - list with 64 bit offsets +# - interval<months: i32> +# - interval<days: i32, millis: i32> +# - interval<months: i32, days: i32, nanos: i64> +# - arrow::ExtensionTypes + +# FIXME these extension types are not parameterizable, which means among +# other things that we can't declare dictionary type here at all since +# we'd have to declare a different dictionary type for all encoded types +# (but that is an infinite space). Similarly, do we need to declare a +# timestamp variation for all possible timezone strings? +# +# Ultimately these declarations are a promise which needs to be backed by +# equivalent serde in c++. For example, consider u8: when serializing to +# substrait, we need to wrap instances of arrow::uint8 into the type +# variation listed below. It would be ideal if we could SinglePointOfTruth +# this correspondence; either generating c++ from the YAML or YAML from the +# c++. +# +# At present (AFAICT) it's not valid to make this user extensible because +# even if a user adds their custom scalar function to the registry *and* +# defines the mapping from that scalar function to a substrait::ExtensionFunction +# the corresponding YAML doesn't exist at any URI and so it can't be used in +# substrait. Perhaps we could still help that case by providing a tool to +# generate YAML from functions; that'd simplify the lives of people trying to +# write arrow::compute::Functions to "define the function and if you want to +# reference it from substrait generate this YAML and put it at some URI". +# +# In any case for the foreseeable future generation would be far too brittle; +# URIs will not be accessed by anything but humans and the YAML is effectively +# structured documentation. Thus extenders should pass the URI in the same way +# they pass a description string; it's opaque to anything in arrow. Review comment: > I'm assuming you meant "emit functions/types that a consumer can handle" Oops, my bad. > I think it will always be a valid path that the YAML either does not exist or is not publicly hosted. I suppose you're right in the sense that if a particular producer/consumer pair already agree on which functions and types exist and how they are named, neither needs to read the YAML, and therefore it doesn't need to be accessible or exist at all for that pair to be able to communicate plans. This is only true because Substrait (rather unconventionally IMO, though I understand the reasoning) explicitly and redundantly specifies type information at (almost) every point in an expression tree (notably, for whatever reason it doesn't do this for relations, so figuring out the type of a fieldref still requires type annotation at the consumer end). And, indeed, consumers will probably never have to read the YAMLs, at least not for functions they can map to an implementation by URI, name, and argument/return type information alone (i.e. they don't rely on the implementation map field in the YAML). I agree that it makes the most sense if a consumer advertises what it can do via YAML file s (contrary to what the implementation map seems to be trying to accomplish), and therefore it doesn't have to read them. It's the "always" part of that sentence which I didn't (and still don't) necessarily agree with. For example, as you then add yourself, a validator would need it. Kind of by definition therefore, a plan is technically not valid Substrait unless there is an accessible YAML for all extensions used. Whether a particular producer/consumer pair can talk just fine regardless because they don't use that part of the spec is beyond the point for spec compliance. However, I'm a bit of a purist when it comes to specifications, and I should probably be more pragmatic in this context, so I'm happy to leave it at this and yield to the idea that they don't need to exist in practice. > I do believe that automatic YAML generation is our best long term solution. I agree, but I think you're only referring to generating the YAML file that's currently hand-written and checked into the repo with this PR. The other thing Ben was talking about is providing a public API to users of Arrow to also generate these YAML files for custom functions they may add to the function registry at runtime. > I don't think there is much benefit in doing automatic YAML generation right now +1 > These comments do not belong in the YAML doc. [...] I'm not sure I agree. A user who wants to use a type or function that Arrow supports (or they made themselves) will probably end up in the YAML file when looking for resources, not on a JIRA issue. Once the issues are resolved however and all this stuff actually works, the notes should be removed. I *do* think they could be structured better, because I'm still having a hard time piecing together what all of it means and what is or isn't implemented even after spending lots of time with the relevant code. They should also link to the relevant JIRAs. I'm also only just now realizing that the file is named `extension_types.yaml`, yet it is also talking about function extensions. I'll see if I can clean it up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org