Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2805716506 > extension types automatically "just work" This is definitely true for some operations (e.g., arrow-select), but for others (e.g., cast, parse, print, write to CSV, arithmetic) it is very easy to do the wrong thing because there is no built-in alternative to capture that context: the `DataType` in its current form is not really "how the values of an array should be interpreted", but more like `ArrowStorageType` or `PhysicalTypeButAlsoTimestamps`. You are absolutely right that adding `Extension` to the `DataType` was never the intention of the original `DataType`...I think it is also true that a lot of usage (including DataFusion and the higher-level arrow crates) is not consistent with that intention. We have an internal workaround that we're pursuing for now, but I'll continue to play with/review alternatives both here and in DataFusion...if nothing else, it's a great way to get to know arrow-rs! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
tustvold commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2796253645 > There are also a number of arrow-rs' own crates that make heavy use of the DataType and ArrayRef (e.g., the Parquet internals, casting) Indeed, and the benefit of the current design is extension types automatically "just work". The only logic that needs be concerned with them is logic that seeks to explicitly handle them, I am pretty strongly opposed to changing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
alamb commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2794816716 A relevant PR from @timsaucer downstream in DataFusion in fact proposes exactly this (use a Field instead of DataType) for scalar functions: - https://github.com/apache/datafusion/pull/15646 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2794478219 > not based on what is expedient for one particular downstream use-case Agreed! There are also a number of arrow-rs' own crates that make heavy use of the `DataType` and `ArrayRef` (e.g., the Parquet internals, casting). Another approach might be an `arrow-extensions` crate that includes an extension-aware `cast()`, CSV reader/writer (i.e., extension authors can choose how an array is represented in a CSV file), Parquet reader/writer (the issue here...how to propagate an extension type to Parquet and vice versa), extension-aware pretty printer, etc. All of these utilities are made use of by DataFusion but there is nothing strictly related to DataFusion about them and the work we do to (e.g.) support Variant, Geometry/Geography, or UUIDs doesn't have to only benefit DataFusion users. > what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType? Yes, I'll continue my efforts there, too! In addition to requiring a `Field` a number of them also require a `Context` of some kind to look up extension-specific behaviour (e.g., whether a cast should or should not happen). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
alamb commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2792321244 > I'm not familiar with the precise problem being discussed here, but I'm guessing DF is relying on DataType instead of propogating Field/Schema. Such an approach is inherently flawed, as not only does it discard metadata, but also things like nullability, etc... I'm a little surprised if this is the case, as DF at least used to fairly aggressively make use of Schema, etc... but perhaps things have changed since I was very involved... Yes, there are quite a few APIs in DataFusion that use DataType and not Field (for example the type coercion logic). If we were starting again today they would all probably be in terms of Field. On the other hand I don't see huge problems in making it easier to use extension types downstream by making changes to arrow-rs @paleolimbot what do you think about filing a ticket in DataFusion like "Better extension type support" where we can start listing the APIs that would need to be updated to Field instead of DataType? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
tustvold commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791518611 IMO a design should stand on its own merits, not based on what is expedient for one particular downstream use-case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791466144 > Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion. I had envisioned that the use of Extension would only be opt-in (i.e., never automatically created on import from IPC or FFI), which would perhaps limit changes to things that work today (via field metadata propagation where this is possible). Libraries built on arrow-rs could experiment with creating these types and opt-in to their existence slowly (or never!). > It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated. Yes, the top-level data type approach does involve implementing extension-aware code in some places. Mostly this is to just fall back to storage type behaviour, error, or use it as an opportunity for the extension author to inject behaviour (e.g., type equality). It's absolutely fair if the number of callsites required is inappropriate in scope in arrow-rs...I'm pursuing this as a comparison to the number of callsites required to update the array and/or data type representation in DataFusion to allow for extension type information to be propagated through expressions (which is also very high). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
tustvold commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791056738 > I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported. Right but this regresses functionality that currently "just works", the same is likely true in many places in DataFusion. There is a big difference between it compiling, and it working as users expect. It seems inherently wrong to me that callsites that are agnostic to extension types, and which likely significantly outnumber the callsites that need to be aware of extension types, should be the ones needing to be updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2790967285 In https://github.com/apache/arrow-rs/pull/7398 and https://github.com/apache/datafusion/pull/15663 I mostly found that I was adding an additional "this type isn't supported" to the end of enums with one or more entires that already had other types that weren't supported. The tradeoff is I think that arrow-rs would take on the responsibility of ensuring the hard-to-implement operations (filter, take, concatenate) propagate the type so that extension authors/frameworks built on arrow don't have to (there's a few PRs in DataFusion that demonstrate that updating the DataType representation has quite large footprint). I would argue that by excluding the option in arrow-rs, we aren't necessarily eliminating any complexity, we're just punting it on to other projects. I've tried building the PR implementing this against DataFusion to check the impact on downstream code ( https://github.com/apache/datafusion/pull/15663 )...are there other projects with a high Arrow surface area that might have a larger disruption? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
tustvold commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2788874068 The consensus reached on https://github.com/apache/arrow-rs/issues/4472 was not to do this. I can appreciate there are tradeoffs to both approaches, but forcing all arrow code to care about extensions seems unfortunate and against the spirit of extension types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
mbrobbel commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2788713660 > > The spec defines extension types as metadata on Fields > > I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have. The `DataType` approach requires all matches on instances of `DataType` to consider the extension type variant inner datatype, which has major downstream impact. It means that consumers that don't care about extension types now need to deal with them. The spec states that implementations are not required to support canonical extension types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2787455962 > The spec defines extension types as metadata on Fields I believe the language in the spec is limited to exchange via FFI and serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as objects with native fields and only serialize when specifically requested). I don't believe it has caused confusing situations in other languages but it could be worth asking others! I will give it a try anyway and see what kind of implementation effort/downstream impacts it may have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
mbrobbel commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2787200164 The spec defines extension types as metadata on `Field`s, so I think adding an `Extension` variant to `DataType` is going to cause confusing situations. But maybe I'm misunderstanding what you are suggesting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
alamb commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786805463 > > replacing some parquet impls to operate on fields instead of types. > > I'm experimenting with this in DataFusion, too ( [apache/datafusion#15036](https://github.com/apache/datafusion/pull/15036) ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of `DataType::Extension(name, metadata)` was rejected but I wonder if `DataType::Extension(extension_type)` (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult). I would personally be very open to the idea of adding a new enum to DataType. While it may not be strictly necessary from an API point of view, if it reduces churn and makes it easier to roll out extension types across the ecosystem it makes a lot of sense to me We could also potentially feature flag it for a while (`feature=extension`) as a way to roll it out before the next major release 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786760254 > replacing some parquet impls to operate on fields instead of types. I'm experimenting with this in DataFusion, too ( https://github.com/apache/datafusion/pull/15036 ) and there it's a fairly disruptive change (particularly for the UDFs). I know the version of `DataType::Extension(name, metadata)` was rejected but I wonder if `DataType::Extension(extension_type)` (never created without opt-in) would help minimize the surface area of supporting these types. (Although here there are probably sufficiently few internals that the change is not that difficult). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
mbrobbel commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786739757 > [@mbrobbel](https://github.com/mbrobbel) and [@kylebarron](https://github.com/kylebarron) -- this item was brought up as something required to implement Varaint support ([#6736](https://github.com/apache/arrow-rs/issues/6736)). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)? > > cc [@PinkCrow007](https://github.com/PinkCrow007) My thinking was: given an Arrow field with an extension type and writing this field to Parquet and then reading it again (using the provied Arrow writers/readers in the `parquet` crate) should retain the extension type information i.e. after reading this field (from Parquet) it should still have the same extension type. From briefly looking at this I think we can implement this by replacing some `parquet` impls to operate on fields instead of types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
paleolimbot commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2783890401 We'll likely need this for Geometry/Geography as well! I can look more closely if I'm the one to start in on that change, but in C++ we handle something like that with an option to the reader properties: https://github.com/apache/arrow/blob/c56ec12a2f38ed6fce2d87ea65345b6edf4234f5/cpp/src/parquet/properties.h#L988-L998 ...although something more like a registry might be more flexible (e.g., so that a type could be registered with, for example, a DataFusion context like a UDT might be registered in Spark or DuckDB). (With apologies if I missed the major API challenge here 😬 ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] Support round tripping extension types in parquet [arrow-rs]
alamb commented on issue #7063: URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2781350170 @mbrobbel and @kylebarron -- this item was brought up as something required to implement Varaint support (#6736). Do you by any chance have any additional information on what is required / how it might be implemented you can share (so we can make progress)? cc @PinkCrow007 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
