Re: [PR] Metadata handling announcement [datafusion-site]

2025-10-18 Thread via GitHub


2010YOUY01 commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2367314150


##
content/blog/2025-09-21-custom-types-using-metadata.md:
##
@@ -0,0 +1,296 @@
+---
+layout: post
+title: Custom types in DataFusion using Metadata
+date: 2025-09-21
+author: Tim Saucer, Dewey Dunnington, Andrew Lamb
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for user defined scalar functions, there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
`return_field_from_args()`. This will
+provide a list of input fields to the function and return the output field. To 
evaluate

Re: [PR] Metadata handling announcement [datafusion-site]

2025-10-18 Thread via GitHub


alamb commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2372759527


##
content/blog/2025-09-21-custom-types-using-metadata.md:
##
@@ -0,0 +1,309 @@
+---
+layout: post
+title: Custom types in DataFusion using Metadata
+date: 2025-09-21
+author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew 
Lamb(InfluxData)
+categories: [core]
+---
+
+
+
+[TOC]
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for [user defined scalar functions], there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
[return_field_from_args()]. This will
+provide a list of input fields to the function an

Re: [PR] Metadata handling announcement [datafusion-site]

2025-10-17 Thread via GitHub


timsaucer merged PR #73:
URL: https://github.com/apache/datafusion-site/pull/73


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-09-23 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3324891884

   Thank you @paleolimbot @alamb and @2010YOUY01 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-09-23 Thread via GitHub


paleolimbot commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2372959896


##
content/blog/2025-09-21-custom-types-using-metadata.md:
##
@@ -0,0 +1,296 @@
+---
+layout: post
+title: Custom types in DataFusion using Metadata
+date: 2025-09-21
+author: Tim Saucer, Dewey Dunnington, Andrew Lamb
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for user defined scalar functions, there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
`return_field_from_args()`. This will
+provide a list of input fields to the function and return the output field. To 
evaluate

Re: [PR] Metadata handling announcement [datafusion-site]

2025-09-22 Thread via GitHub


timsaucer commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2369551345


##
content/blog/2025-09-21-custom-types-using-metadata.md:
##
@@ -0,0 +1,309 @@
+---
+layout: post
+title: Custom types in DataFusion using Metadata
+date: 2025-09-21
+author: Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew 
Lamb(InfluxData)
+categories: [core]
+---
+
+
+
+[TOC]
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for [user defined scalar functions], there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
[return_field_from_args()]. This will
+provide a list of input fields to the functio

Re: [PR] Metadata handling announcement [datafusion-site]

2025-09-22 Thread via GitHub


alamb commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2369438675


##
content/blog/2025-09-21-custom-types-using-metadata.md:
##
@@ -0,0 +1,298 @@
+---
+layout: post
+title: Custom types in DataFusion using Metadata
+date: 2025-09-21
+author: Tim Saucer, Dewey Dunnington, Andrew Lamb
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for user defined scalar functions, there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
`return_field_from_args()`. This will
+provide a list of input fields to the function and return the output field. To 
evaluate
+meta

Re: [PR] Metadata handling announcement [datafusion-site]

2025-09-21 Thread via GitHub


timsaucer commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2366208700


##
content/blog/2025-07-29-metadata-handling.md:
##
@@ -0,0 +1,285 @@
+---
+layout: post
+title: Field metadata and extension type support in user defined functions
+date: 2025-07-29
+author: Tim Saucer, Dewey Dunnington, Andrew Lamb
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for user defined scalar functions, there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
`return_field_from_args()`. This will
+provide a list of input fields to the function and return the output field

Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-30 Thread via GitHub


paleolimbot commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2243202260


##
content/blog/2025-07-29-metadata-handling.md:
##
@@ -0,0 +1,285 @@
+---
+layout: post
+title: Field metadata and extension type support in user defined functions
+date: 2025-07-29
+author: Tim Saucer, Dewey Dunnington, Andrew Lamb
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
metadata on
+the input columns to functions and produce metadata in the output.
+
+Metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
+by Arrow implementations to support [extension types] and can also be used to 
add
+use case-specific context to a column of values where the formality of an 
extension type
+is not required. In previous versions of DataFusion field metadata was 
propagated through
+certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
+(e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
+processing of all user defined functions we pass the input field information 
and allow
+user defined function implementations to return field information to the 
caller.
+
+[Extension types] are user defined data types where the data is stored using 
one of the
+existing [Arrow data types] but the metadata specifies how we are to interpret 
the
+stored data. The use of extension types was one of the primary motivations for 
adding
+metadata to the function processing, but arbitrary metadata can be put on the 
input and
+output fields. This allows for a range of other interesting use cases.
+
+[DataFusion 48.0.0]: 
https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/
+[extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
+[Arrow data types]: 
https://arrow.apache.org/docs/format/Columnar.html#data-types
+
+## Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+
+  
+  
+Relationship between a Record Batch, it's schema, and the underlying 
arrays. There is a one to one relationship between each Field in the Schema and 
Array entry in the Columns.
+  
+
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I wish to write a function that takes in a UUID and 
returns a string
+of the [variant] of the input field. We would want this function to be able to 
handle
+all of the string types and also a binary encoded UUID. The arrow 
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed 
sized binary
+array where each element is 16 bytes long. With the metadata handling in 
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct 
underlying
+data type, but that it also represents the right *kind* of data. The UUID 
example is a
+common one, and it is included in the [canonical extension types] that are now
+supported in DataFusion.
+
+Another common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. Most likely this data is 
stored as
+an array of `u8` data. Without knowing a priori what the encoding of that blob 
of data is,
+you cannot ensure you are using the correct methods for decoding it. You may 
work around
+this by adding another column to your data source indicating the encoding, but 
this can be
+wasteful for systems where the encoding never changes. Instead, you could use 
metadata to
+specify the encoding for the entire column.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+[variant]: https://www.ietf.org/rfc/rfc9562.html#section-4.1
+[canonical extension types]: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html
+
+## How to use metadata in user defined functions
+
+When working with metadata for user defined scalar functions, there are 
typically two
+places in the function definition that require implementation.
+
+- Computing the return field from the arguments
+- Invocation
+
+During planning, we will attempt to call the function 
`return_field_from_args()`. This will
+provide a list of input fields to the function and return the output fie

Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-30 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3136224877

   Should we add some kind of call out to the work underway for the geoarrow 
udfs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-29 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3134288811

   Ok, I think this is ready for review now. I'm open to all suggestions!
   
   https://datafusion.staged.apache.org/blog/2025/07/29/metadata-handling/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-29 Thread via GitHub


paleolimbot commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3132737146

   This looks great! Thank you for doing the Rust side (and apologies for not 
circling back to it 😬 ). I don't think the pyarrow UDF fix is particularly 
difficult and I'm happy to give it a go (no need to block this, obviously!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-29 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3132515212

   @paleolimbot I've pushed a [repository 
here](https://github.com/timsaucer/datafusion_extension_type_examples) that 
demonstrates using scalar UDFs for working with UUIDs. Can you take a look at 
[this python 
code](https://github.com/timsaucer/datafusion_extension_type_examples/blob/main/python/examples/example_scalar_udf.py)
 to see if it matches your expectations? 
   
   This is what it generates for me:
   
   ```
   DataFrame()
   
+-+--+--+--+--+
   | idx | uuid_string  | uuid  
   | uuid_string_round_trip   | uuid_version |
   
+-+--+--+--+--+
   | 0   | ab021fa5-66dc-4b26-a959-19fa1a786777 | 
ab021fa566dc4b26a95919fa1a786777 | ab021fa5-66dc-4b26-a959-19fa1a786777 | 4 
   |
   | 1   | 73faefe0-86dc-4a28-ae95-9c57a97b20ea | 
73faefe086dc4a28ae959c57a97b20ea | 73faefe0-86dc-4a28-ae95-9c57a97b20ea | 4 
   |
   | 2   | 4b7e9127-4eff-499d-bb5c-fc903076c676 | 
4b7e91274eff499dbb5cfc903076c676 | 4b7e9127-4eff-499d-bb5c-fc903076c676 | 4 
   |
   | 3   | 7d80a455-10a1-4329-9498-cf8124ca57cf | 
7d80a45510a143299498cf8124ca57cf | 7d80a455-10a1-4329-9498-cf8124ca57cf | 4 
   |
   | 4   | efa5e79e-b57f-4b9e-9d20-7cefabe5db05 | 
efa5e79eb57f4b9e9d207cefabe5db05 | efa5e79e-b57f-4b9e-9d20-7cefabe5db05 | 4 
   |
   | 5   | 3760dcd8-8513-4164-a55f-652069ce0d5b | 
3760dcd885134164a55f652069ce0d5b | 3760dcd8-8513-4164-a55f-652069ce0d5b | 4 
   |
   | 6   | e0dfef4c-1463-4983-ad17-daa84202c513 | 
e0dfef4c14634983ad17daa84202c513 | e0dfef4c-1463-4983-ad17-daa84202c513 | 4 
   |
   | 7   | f052994d-b55b-4c49-9004-64d2e23507b6 | 
f052994db55b4c49900464d2e23507b6 | f052994d-b55b-4c49-9004-64d2e23507b6 | 4 
   |
   | 8   | 4fe043ec-9899-4b38-9c55-eda9dcc7562a | 
4fe043ec98994b389c55eda9dcc7562a | 4fe043ec-9899-4b38-9c55-eda9dcc7562a | 4 
   |
   | 9   | fadaee8d-5fa5-4b0c-a6de-2ac09f7a3354 | 
fadaee8d5fa54b0ca6de2ac09f7a3354 | fadaee8d-5fa5-4b0c-a6de-2ac09f7a3354 | 4 
   |
   
+-+--+--+--+--+
   
   Field names and data types:
idx int64
uuid_string string
uuid extension
uuid_string_round_trip string_view
uuid_version uint32
   ```
   
   I did need to do the work in the rust side, because I'm not yet sure how I 
would add metadata to the python style UDFs. I will think more about that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-07-15 Thread via GitHub


alamb commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-3073138372

   By way of inspiration, someone was asking about user defined types (aka what 
this blog post is about) on hacker news:
   
   https://news.ycombinator.com/item?id=44562036#44566613


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-23 Thread via GitHub


paleolimbot commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2996949961

   Example one!
   
   ```python
   from uuid import UUID
   
   import datafusion
   import pyarrow as pa
   
   
   @datafusion.udf([pa.string()], pa.uuid(), "stable")
   def uuid_from_string(uuid_string):
   return pa.array((UUID(s).bytes for s in uuid_string.to_pylist()), 
pa.uuid())
   
   
   @datafusion.udf([pa.uuid()], pa.string(), "stable")
   def uuid_to_string(uuid):
   return pa.array(str(s) for s in uuid.to_pylist())
   
   
   @datafusion.udf([pa.uuid()], pa.int64(), "stable")
   def uuid_version(uuid):
   return pa.array(s.version for s in uuid.to_pylist())
   
   
   def main():
   ctx = datafusion.SessionContext()
   
   batch = pa.record_batch({"idx": pa.array(range(100))})
   tab = (
   ctx.create_dataframe([[batch]])
   .with_column("uuid_string", datafusion.functions.uuid())
   .with_column("uuid", uuid_from_string(datafusion.col("uuid_string")))
   .with_column("uuid_string2", uuid_to_string(datafusion.col("uuid")))
   .with_column("uuid_version", uuid_version(datafusion.col("uuid")))
   )
   #> AttributeError("'bytes' object has no attribute 'version'"), since 
metadata doesn't make it through
   
   print(tab)
   
   
   if __name__ == "__main__":
   main()
   ```
   
   ...this currently fails since the metadata doesn't make it through (I 
installed datafusion-python/main)...I can take a look at that if there isn't 
already a PR in the works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-17 Thread via GitHub


paleolimbot commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2980714250

   I'll try to have the examples I promised to come up with by then!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-17 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2980469864

   This week I'm busy heads down on a deliverable, but hope to pick this up 
early next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-09 Thread via GitHub


timsaucer commented on PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#issuecomment-2956377850

   > I have some optional example suggestions...I'm happy to help put some 
together! Is there a target date you're hoping to have this completed by?
   
   I don't think there's any rush. Personally, I would love it to release at 
the same time as datafusion-python 48 which I expect to be a couple of weeks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-09 Thread via GitHub


paleolimbot commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2135925472


##
content/blog/2025-06-09-metadata-handling.md:
##
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.

Review Comment:
   ```suggestion
   metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
   by Arrow implementations implement [extension types] and can also be used to 
add
   use case-specific context to a column of values where the formality of an 
extension type
   is not required. In previous versions of DataFusion field metadata was 
propagated through
   certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
   (e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
   processing of all user defined functions we pass the input field information 
and allow
   user defined function implementations to return field information to the 
caller.
   
   [extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
   ```



##
content/blog/2025-06-09-metadata-handling.md:
##
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.

Review Comment:
   I wonder if we could turn this into a code example with DataFusion(Python?) 
UDFs to make it more concrete (I can help). Maybe a UDF called `uuid_version` 
or `uuid_timestamp` that extracts the embedded version or timestamp off of a 
UUID type (and a `uuid()` generating function)? (pyarrow and DuckDB both 
understand the arrow.uuid extension type out of the box which facilitates a 
nice interchange example where the uuid-ness isn't lost at the edges).
   
   The arbitrary key/value metadata use case is cool too (and I get that it's 
the use case that motivated this whole thing from your end!) but it's harder to 
find an in-the-wild example where a user can leverage this out of the box. The 
places I have run into this in the wild are basically data sources that write 
things there (like perhaps rerun) whose provider didn't 

Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-08 Thread via GitHub


alamb commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2134828346


##
content/blog/2025-06-09-metadata-handling.md:
##
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions

Review Comment:
   I think we could make the title a bit more specific. Maybe something like
   
   ```suggestion
   title: Custom types in DataFusion using Metadata
   ```



##
content/blog/2025-06-09-metadata-handling.md:
##
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. You could use metadata to 
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined 
function. Both during
+the planning phase and execution, we have access to these field information. 
This allows
+the user to determine the appropriate output fields during planning and to 
validate the
+input. For other use cases, it may only be necessary to access these fields 
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as 
well. You can
+specify this to create your own metadata from your functions or to pass 
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With 
these you can
+create more expressive nullability of your output data instead of having a 
single output.
+For example, you could write a function to convert a string to uppercase. If 
we know the
+input field is non-nullable, then we can set the output field to non-nullable 
as well.
+
+# Extension types
+
+TODO
+
+# Working with literals
+
+TODO
+
+# Thanks to our sponsor
+
+We would like to thank [Rerun.io] for sponsoring the development of this work. 
[Rerun.io]
+is building a data visualization system for Physical AI and uses metadata to 
specify 
+context about columns in Arrow record batches.
+
+[Rerun.io]: https://rerun.io
+
+# Conclusion

Review Comment:
   I recommend ending with a 🎣  expedition as always
   
   "This feature is still evolving and we would love you to come test it out, 
help us implement improvements, and document it. We are a welcoming community, 
etc"
   
   Basically the standard plea for help :)



##
content/blog/2025-06-09-metadata-handling.md:
##
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, su