Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-15 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2805716506

   > extension types automatically "just work"
   
   This is definitely true for some operations (e.g., arrow-select), but for 
others (e.g., cast, parse, print, write to CSV, arithmetic) it is very easy to 
do the wrong thing because there is no built-in alternative to capture that 
context: the `DataType` in its current form is not really "how the values of an 
array should be interpreted", but more like `ArrowStorageType` or 
`PhysicalTypeButAlsoTimestamps`. You are absolutely right that adding 
`Extension` to the `DataType` was never the intention of the original 
`DataType`...I think it is also true that a lot of usage (including DataFusion 
and the higher-level arrow crates) is not consistent with that intention.
   
   We have an internal workaround that we're pursuing for now, but I'll 
continue to play with/review alternatives both here and in DataFusion...if 
nothing else, it's a great way to get to know arrow-rs!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-11 Thread via GitHub


tustvold commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2796253645

   >  There are also a number of arrow-rs' own crates that make heavy use of 
the DataType and ArrayRef (e.g., the Parquet internals, casting)
   
   Indeed, and the benefit of the current design is extension types 
automatically "just work". The only logic that needs be concerned with them is 
logic that seeks to explicitly handle them, I am pretty strongly opposed to 
changing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-10 Thread via GitHub


alamb commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2794816716

   A relevant PR from @timsaucer downstream in DataFusion in fact proposes 
exactly this (use a Field instead of DataType) for scalar functions:
   - https://github.com/apache/datafusion/pull/15646


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-10 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2794478219

   > not based on what is expedient for one particular downstream use-case
   
   Agreed! There are also a number of arrow-rs' own crates that make heavy use 
of the `DataType` and `ArrayRef` (e.g., the Parquet internals, casting). 
Another approach might be an `arrow-extensions` crate that includes an 
extension-aware `cast()`, CSV reader/writer (i.e., extension authors can choose 
how an array is represented in a CSV file), Parquet reader/writer (the issue 
here...how to propagate an extension type to Parquet and vice versa), 
extension-aware pretty printer, etc. All of these utilities are made use of by 
DataFusion but there is nothing strictly related to DataFusion about them and 
the work we do to (e.g.) support Variant, Geometry/Geography, or UUIDs doesn't 
have to only benefit DataFusion users.
   
   > what do you think about filing a ticket in DataFusion like "Better 
extension type support" where we can start listing the APIs that would need to 
be updated to Field instead of DataType?
   
   Yes, I'll continue my efforts there, too! In addition to requiring a `Field` 
a number of them also require a `Context` of some kind to look up 
extension-specific behaviour (e.g., whether a cast should or should not happen).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-10 Thread via GitHub


alamb commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2792321244

   > I'm not familiar with the precise problem being discussed here, but I'm 
guessing DF is relying on DataType instead of propogating Field/Schema. Such an 
approach is inherently flawed, as not only does it discard metadata, but also 
things like nullability, etc... I'm a little surprised if this is the case, as 
DF at least used to fairly aggressively make use of Schema, etc... but perhaps 
things have changed since I was very involved...
   
   
   Yes, there are quite a few APIs in DataFusion that use DataType and not 
Field (for example the type coercion logic). If we were starting again today 
they would all probably be in terms of Field. On the other hand I don't see 
huge problems in making it easier to use extension types downstream by making 
changes to arrow-rs
   
   @paleolimbot what do you think about filing a ticket in DataFusion like 
"Better extension type support" where we can start listing the APIs that would 
need to be updated to Field instead of DataType? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


tustvold commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791518611

   IMO a design should stand on its own merits, not based on what is expedient 
for one particular downstream use-case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791466144

   > Right but this regresses functionality that currently "just works", the 
same is likely true in many places in DataFusion.
   
   I had envisioned that the use of Extension would only be opt-in (i.e., never 
automatically created on import from IPC or FFI), which would perhaps limit 
changes to things that work today (via field metadata propagation where this is 
possible). Libraries built on arrow-rs could experiment with creating these 
types and opt-in to their existence slowly (or never!).
   
   > It seems inherently wrong to me that callsites that are agnostic to 
extension types, and which likely significantly outnumber the callsites that 
need to be aware of extension types, should be the ones needing to be updated.
   
   Yes, the top-level data type approach does involve implementing 
extension-aware code in some places. Mostly this is to just fall back to 
storage type behaviour, error, or use it as an opportunity for the extension 
author to inject behaviour (e.g., type equality). It's absolutely fair if the 
number of callsites required is inappropriate in scope in arrow-rs...I'm 
pursuing this as a comparison to the number of callsites required to update the 
array and/or data type representation in DataFusion to allow for extension type 
information to be propagated through expressions (which is also very high).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


tustvold commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2791056738

   > I mostly found that I was adding an additional "this type isn't supported" 
to the end of enums with one or more entires that already had other types that 
weren't supported.
   
   Right but this regresses functionality that currently "just works", the same 
is likely true in many places in DataFusion. There is a big difference between 
it compiling, and it working as users expect. It seems inherently wrong to me 
that callsites that are agnostic to extension types, and which likely 
significantly outnumber the callsites that need to be aware of extension types, 
should be the ones needing to be updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2790967285

   In https://github.com/apache/arrow-rs/pull/7398 and 
https://github.com/apache/datafusion/pull/15663 I mostly found that I was 
adding an additional "this type isn't supported" to the end of enums with one 
or more entires that already had other types that weren't supported.
   
   The tradeoff is I think that arrow-rs would take on the responsibility of 
ensuring the hard-to-implement operations (filter, take, concatenate) propagate 
the type so that extension authors/frameworks built on arrow don't have to 
(there's a few PRs in DataFusion that demonstrate that updating the DataType 
representation has quite large footprint). I would argue that by excluding the 
option in arrow-rs, we aren't necessarily eliminating any complexity, we're 
just punting it on to other projects.
   
   I've tried building the PR implementing this against DataFusion to check the 
impact on downstream code ( https://github.com/apache/datafusion/pull/15663 
)...are there other projects with a high Arrow surface area that might have a 
larger disruption?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


tustvold commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2788874068

   The consensus reached on https://github.com/apache/arrow-rs/issues/4472 was 
not to do this. I can appreciate there are tradeoffs to both approaches, but 
forcing all arrow code to care about extensions seems unfortunate and against 
the spirit of extension types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-09 Thread via GitHub


mbrobbel commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2788713660

   > > The spec defines extension types as metadata on Fields
   > 
   > I believe the language in the spec is limited to exchange via FFI and 
serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as 
objects with native fields and only serialize when specifically requested). I 
don't believe it has caused confusing situations in other languages but it 
could be worth asking others! I will give it a try anyway and see what kind of 
implementation effort/downstream impacts it may have.
   
   The `DataType` approach requires all matches on instances of `DataType` to 
consider the extension type variant inner datatype, which has major downstream 
impact. It means that consumers that don't care about extension types now need 
to deal with them. The spec states that implementations are not required to 
support canonical extension types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-08 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2787455962

   > The spec defines extension types as metadata on Fields
   
   I believe the language in the spec is limited to exchange via FFI and 
serialization via IPC (e.g., Java, Go, and Arrow C++ include extension types as 
objects with native fields and only serialize when specifically requested). I 
don't believe it has caused confusing situations in other languages but it 
could be worth asking others! I will give it a try anyway and see what kind of 
implementation effort/downstream impacts it may have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-08 Thread via GitHub


mbrobbel commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2787200164

   The spec defines extension types as metadata on `Field`s, so I think adding 
an `Extension` variant to `DataType` is going to cause confusing situations. 
But maybe I'm misunderstanding what you are suggesting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-08 Thread via GitHub


alamb commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786805463

   > > replacing some parquet impls to operate on fields instead of types.
   > 
   > I'm experimenting with this in DataFusion, too ( 
[apache/datafusion#15036](https://github.com/apache/datafusion/pull/15036) ) 
and there it's a fairly disruptive change (particularly for the UDFs). I know 
the version of `DataType::Extension(name, metadata)` was rejected but I wonder 
if `DataType::Extension(extension_type)` (never created without opt-in) would 
help minimize the surface area of supporting these types. (Although here there 
are probably sufficiently few internals that the change is not that difficult).
   
   I would personally be very open to the idea of adding a new enum to 
DataType. While it may not be strictly necessary from an API point of view, if 
it reduces churn and makes it easier to roll out extension types across the 
ecosystem it makes a lot of sense to me
   
   We could also potentially feature flag it for a while (`feature=extension`) 
as a way to roll it out before the next major release 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-08 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786760254

   > replacing some parquet impls to operate on fields instead of types.
   
   I'm experimenting with this in DataFusion, too ( 
https://github.com/apache/datafusion/pull/15036 ) and there it's a fairly 
disruptive change (particularly for the UDFs). I know the version of 
`DataType::Extension(name, metadata)` was rejected but I wonder if 
`DataType::Extension(extension_type)` (never created without opt-in) would help 
minimize the surface area of supporting these types. (Although here there are 
probably sufficiently few internals that the change is not that difficult).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-08 Thread via GitHub


mbrobbel commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2786739757

   > [@mbrobbel](https://github.com/mbrobbel) and 
[@kylebarron](https://github.com/kylebarron) -- this item was brought up as 
something required to implement Varaint support 
([#6736](https://github.com/apache/arrow-rs/issues/6736)). Do you by any chance 
have any additional information on what is required / how it might be 
implemented you can share (so we can make progress)?
   > 
   > cc [@PinkCrow007](https://github.com/PinkCrow007)
   
   My thinking was: given an Arrow field with an extension type and writing 
this field to Parquet and then reading it again (using the provied Arrow 
writers/readers in the `parquet` crate) should retain the extension type 
information i.e. after reading this field (from Parquet) it should still have 
the same extension type.
   
   From briefly looking at this I think we can implement this by replacing some 
`parquet` impls to operate on fields instead of types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-07 Thread via GitHub


paleolimbot commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2783890401

   We'll likely need this for Geometry/Geography as well! I can look more 
closely if I'm the one to start in on that change, but in C++ we handle 
something like that with an option to the reader properties:
   
   
https://github.com/apache/arrow/blob/c56ec12a2f38ed6fce2d87ea65345b6edf4234f5/cpp/src/parquet/properties.h#L988-L998
   
   ...although something more like a registry might be more flexible (e.g., so 
that a type could be registered with, for example, a DataFusion context like a 
UDT might be registered in Spark or DuckDB). (With apologies if I missed the 
major API challenge here 😬 )
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] Support round tripping extension types in parquet [arrow-rs]

2025-04-06 Thread via GitHub


alamb commented on issue #7063:
URL: https://github.com/apache/arrow-rs/issues/7063#issuecomment-2781350170

   @mbrobbel and @kylebarron  -- this item was brought up as something required 
to implement Varaint support (#6736). Do you by any chance have any additional 
information on what is required / how it might be implemented you can share (so 
we can make progress)?
   
   cc @PinkCrow007
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]