Re: [PR] Metadata handling announcement [datafusion-site]
2010YOUY01 commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2367314150 ## content/blog/2025-09-21-custom-types-using-metadata.md: ## @@ -0,0 +1,296 @@ +--- +layout: post +title: Custom types in DataFusion using Metadata +date: 2025-09-21 +author: Tim Saucer, Dewey Dunnington, Andrew Lamb +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function `return_field_from_args()`. This will +provide a list of input fields to the function and return the output field. To evaluate
Re: [PR] Metadata handling announcement [datafusion-site]
alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2372759527 ## content/blog/2025-09-21-custom-types-using-metadata.md: ## @@ -0,0 +1,309 @@ +--- +layout: post +title: Custom types in DataFusion using Metadata +date: 2025-09-21 +author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData) +categories: [core] +--- + + + +[TOC] + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for [user defined scalar functions], there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function [return_field_from_args()]. This will +provide a list of input fields to the function an
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer merged PR #73: URL: https://github.com/apache/datafusion-site/pull/73 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3324891884 Thank you @paleolimbot @alamb and @2010YOUY01 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2372959896 ## content/blog/2025-09-21-custom-types-using-metadata.md: ## @@ -0,0 +1,296 @@ +--- +layout: post +title: Custom types in DataFusion using Metadata +date: 2025-09-21 +author: Tim Saucer, Dewey Dunnington, Andrew Lamb +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function `return_field_from_args()`. This will +provide a list of input fields to the function and return the output field. To evaluate
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2369551345 ## content/blog/2025-09-21-custom-types-using-metadata.md: ## @@ -0,0 +1,309 @@ +--- +layout: post +title: Custom types in DataFusion using Metadata +date: 2025-09-21 +author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData) +categories: [core] +--- + + + +[TOC] + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for [user defined scalar functions], there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function [return_field_from_args()]. This will +provide a list of input fields to the functio
Re: [PR] Metadata handling announcement [datafusion-site]
alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2369438675 ## content/blog/2025-09-21-custom-types-using-metadata.md: ## @@ -0,0 +1,298 @@ +--- +layout: post +title: Custom types in DataFusion using Metadata +date: 2025-09-21 +author: Tim Saucer, Dewey Dunnington, Andrew Lamb +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function `return_field_from_args()`. This will +provide a list of input fields to the function and return the output field. To evaluate +meta
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2366208700 ## content/blog/2025-07-29-metadata-handling.md: ## @@ -0,0 +1,285 @@ +--- +layout: post +title: Field metadata and extension type support in user defined functions +date: 2025-07-29 +author: Tim Saucer, Dewey Dunnington, Andrew Lamb +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function `return_field_from_args()`. This will +provide a list of input fields to the function and return the output field
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2243202260 ## content/blog/2025-07-29-metadata-handling.md: ## @@ -0,0 +1,285 @@ +--- +layout: post +title: Field metadata and extension type support in user defined functions +date: 2025-07-29 +author: Tim Saucer, Dewey Dunnington, Andrew Lamb +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output. + +Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations to support [extension types] and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller. + +[Extension types] are user defined data types where the data is stored using one of the +existing [Arrow data types] but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases. + +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/ +[extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types +[Arrow data types]: https://arrow.apache.org/docs/format/Columnar.html#data-types + +## Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + + + + +Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + + + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I wish to write a function that takes in a UUID and returns a string +of the [variant] of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in [DataFusion 48.0.0] +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right *kind* of data. The UUID example is a +common one, and it is included in the [canonical extension types] that are now +supported in DataFusion. + +Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of `u8` data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field +[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1 +[canonical extension types]: https://arrow.apache.org/docs/format/CanonicalExtensions.html + +## How to use metadata in user defined functions + +When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation. + +- Computing the return field from the arguments +- Invocation + +During planning, we will attempt to call the function `return_field_from_args()`. This will +provide a list of input fields to the function and return the output fie
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3136224877 Should we add some kind of call out to the work underway for the geoarrow udfs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3134288811 Ok, I think this is ready for review now. I'm open to all suggestions! https://datafusion.staged.apache.org/blog/2025/07/29/metadata-handling/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3132737146 This looks great! Thank you for doing the Rust side (and apologies for not circling back to it 😬 ). I don't think the pyarrow UDF fix is particularly difficult and I'm happy to give it a go (no need to block this, obviously!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3132515212 @paleolimbot I've pushed a [repository here](https://github.com/timsaucer/datafusion_extension_type_examples) that demonstrates using scalar UDFs for working with UUIDs. Can you take a look at [this python code](https://github.com/timsaucer/datafusion_extension_type_examples/blob/main/python/examples/example_scalar_udf.py) to see if it matches your expectations? This is what it generates for me: ``` DataFrame() +-+--+--+--+--+ | idx | uuid_string | uuid | uuid_string_round_trip | uuid_version | +-+--+--+--+--+ | 0 | ab021fa5-66dc-4b26-a959-19fa1a786777 | ab021fa566dc4b26a95919fa1a786777 | ab021fa5-66dc-4b26-a959-19fa1a786777 | 4 | | 1 | 73faefe0-86dc-4a28-ae95-9c57a97b20ea | 73faefe086dc4a28ae959c57a97b20ea | 73faefe0-86dc-4a28-ae95-9c57a97b20ea | 4 | | 2 | 4b7e9127-4eff-499d-bb5c-fc903076c676 | 4b7e91274eff499dbb5cfc903076c676 | 4b7e9127-4eff-499d-bb5c-fc903076c676 | 4 | | 3 | 7d80a455-10a1-4329-9498-cf8124ca57cf | 7d80a45510a143299498cf8124ca57cf | 7d80a455-10a1-4329-9498-cf8124ca57cf | 4 | | 4 | efa5e79e-b57f-4b9e-9d20-7cefabe5db05 | efa5e79eb57f4b9e9d207cefabe5db05 | efa5e79e-b57f-4b9e-9d20-7cefabe5db05 | 4 | | 5 | 3760dcd8-8513-4164-a55f-652069ce0d5b | 3760dcd885134164a55f652069ce0d5b | 3760dcd8-8513-4164-a55f-652069ce0d5b | 4 | | 6 | e0dfef4c-1463-4983-ad17-daa84202c513 | e0dfef4c14634983ad17daa84202c513 | e0dfef4c-1463-4983-ad17-daa84202c513 | 4 | | 7 | f052994d-b55b-4c49-9004-64d2e23507b6 | f052994db55b4c49900464d2e23507b6 | f052994d-b55b-4c49-9004-64d2e23507b6 | 4 | | 8 | 4fe043ec-9899-4b38-9c55-eda9dcc7562a | 4fe043ec98994b389c55eda9dcc7562a | 4fe043ec-9899-4b38-9c55-eda9dcc7562a | 4 | | 9 | fadaee8d-5fa5-4b0c-a6de-2ac09f7a3354 | fadaee8d5fa54b0ca6de2ac09f7a3354 | fadaee8d-5fa5-4b0c-a6de-2ac09f7a3354 | 4 | +-+--+--+--+--+ Field names and data types: idx int64 uuid_string string uuid extension uuid_string_round_trip string_view uuid_version uint32 ``` I did need to do the work in the rust side, because I'm not yet sure how I would add metadata to the python style UDFs. I will think more about that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
alamb commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3073138372 By way of inspiration, someone was asking about user defined types (aka what this blog post is about) on hacker news: https://news.ycombinator.com/item?id=44562036#44566613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2996949961
Example one!
```python
from uuid import UUID
import datafusion
import pyarrow as pa
@datafusion.udf([pa.string()], pa.uuid(), "stable")
def uuid_from_string(uuid_string):
return pa.array((UUID(s).bytes for s in uuid_string.to_pylist()),
pa.uuid())
@datafusion.udf([pa.uuid()], pa.string(), "stable")
def uuid_to_string(uuid):
return pa.array(str(s) for s in uuid.to_pylist())
@datafusion.udf([pa.uuid()], pa.int64(), "stable")
def uuid_version(uuid):
return pa.array(s.version for s in uuid.to_pylist())
def main():
ctx = datafusion.SessionContext()
batch = pa.record_batch({"idx": pa.array(range(100))})
tab = (
ctx.create_dataframe([[batch]])
.with_column("uuid_string", datafusion.functions.uuid())
.with_column("uuid", uuid_from_string(datafusion.col("uuid_string")))
.with_column("uuid_string2", uuid_to_string(datafusion.col("uuid")))
.with_column("uuid_version", uuid_version(datafusion.col("uuid")))
)
#> AttributeError("'bytes' object has no attribute 'version'"), since
metadata doesn't make it through
print(tab)
if __name__ == "__main__":
main()
```
...this currently fails since the metadata doesn't make it through (I
installed datafusion-python/main)...I can take a look at that if there isn't
already a PR in the works.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2980714250 I'll try to have the examples I promised to come up with by then! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2980469864 This week I'm busy heads down on a deliverable, but hope to pick this up early next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
timsaucer commented on PR #73: URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2956377850 > I have some optional example suggestions...I'm happy to help put some together! Is there a target date you're hoping to have this completed by? I don't think there's any rush. Personally, I would love it to release at the same time as datafusion-python 48 which I expect to be a couple of weeks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Metadata handling announcement [datafusion-site]
paleolimbot commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2135925472 ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. Review Comment: ```suggestion metadata is specified as a map of key-value pairs of strings. This extra metadata is used by Arrow implementations implement [extension types] and can also be used to add use case-specific context to a column of values where the formality of an extension type is not required. In previous versions of DataFusion field metadata was propagated through certain operations (e.g., renaming or selecting a column) but was not accessible to others (e.g., scalar, window, or aggregate function calls). In the new implementation, during processing of all user defined functions we pass the input field information and allow user defined function implementations to return field information to the caller. [extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types ``` ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. Review Comment: I wonder if we could turn this into a code example with DataFusion(Python?) UDFs to make it more concrete (I can help). Maybe a UDF called `uuid_version` or `uuid_timestamp` that extracts the embedded version or timestamp off of a UUID type (and a `uuid()` generating function)? (pyarrow and DuckDB both understand the arrow.uuid extension type out of the box which facilitates a nice interchange example where the uuid-ness isn't lost at the edges). The arbitrary key/value metadata use case is cool too (and I get that it's the use case that motivated this whole thing from your end!) but it's harder to find an in-the-wild example where a user can leverage this out of the box. The places I have run into this in the wild are basically data sources that write things there (like perhaps rerun) whose provider didn't
Re: [PR] Metadata handling announcement [datafusion-site]
alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2134828346 ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions Review Comment: I think we could make the title a bit more specific. Maybe something like ```suggestion title: Custom types in DataFusion using Metadata ``` ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + +# Extension types + +TODO + +# Working with literals + +TODO + +# Thanks to our sponsor + +We would like to thank [Rerun.io] for sponsoring the development of this work. [Rerun.io] +is building a data visualization system for Physical AI and uses metadata to specify +context about columns in Arrow record batches. + +[Rerun.io]: https://rerun.io + +# Conclusion Review Comment: I recommend ending with a 🎣 expedition as always "This feature is still evolving and we would love you to come test it out, help us implement improvements, and document it. We are a welcoming community, etc" Basically the standard plea for help :) ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + + + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, su
