Re: [PR] Support metadata on scalar values [datafusion]

2025-05-23 Thread via GitHub


timsaucer commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2105076604


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   I have a slightly deeper cut in this PR: 
https://github.com/apache/datafusion/pull/16170
   
   In addition to the work in the current PR, that above one adds the metadata 
directly to the `Literal` variant of the logical plan rather than only the 
physical plan. I think that one is a better version long term, but it is more 
of a breaking change. If we're doing these other breaking changes for metadata 
in user defined functions in DF48, then maybe we can add it in as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-22 Thread via GitHub


paleolimbot commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2103559755


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   Keeping with the literial is OK as long as we are also prepared to make any 
other breaking changes required to pipe metadata through to ensure that it is 
not dropped/it is possible to implement extension logic in other places without 
rewriting large chunks of DataFusion code. A few places I've noted in passing 
are the Statistics, the physical and logical literals, casting logic, and some 
conversions like protobuf and to/from Python via ffi. Perhaps those places 
should be using a `Literal` instead of a `ScalarValue`?
   
   This is possibly a good illustration of the difference between "metadata" 
and extension types...an extension type scalar should not be blindly/implicitly 
cast to its storage type; however, for an integer that happened to have some 
field metadata attached to it, it would be surprising if that metadata caused 
an error somewhere in the engine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-22 Thread via GitHub


timsaucer commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2102451109


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   I think there are two compelling arguments for keeping the metadata on the 
`Literal` rather than the `ScalarValue`
   
   - Does not require a change to the protobuf definitions. I haven't tested 
this thoroughly yet, but I think if we put the metadata on the scalar value we 
would need more breaking changes on the conversion between protobuf and 
internal types, basically needing to pass the schema around. This isn't a 
deal-breaker but I could see pain points for some users.
   - The match statements like @tobixdev points out. I've also seen a growing 
usage of things like `let ScalarValue::Utf8(my_string) = val else { return 
exec_error!("invalid scalar value") }` which would not survive a case where 
they really do have a valid value.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-20 Thread via GitHub


tobixdev commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2098428747


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   Thank you for working on that! That's an interesting idea! When i 
experimented with logical scalars, I did not go down that path as I thought 
that it is very easy to miss handling metadata that only adds little 
information to a scalar.
   
   To elaborate a bit on that: Let's assume I have a `DataType::Utf8` field and 
a `len()` function that returns the length of a value. For strings, that could 
be the number of unicode characters.
   
   Now, given the following "implementation" of `len`:
   ```rust
   match value {
   ScalarValue::Utf8(value) => count_chars(value),
   _ => panic!("bye bye")
   }
   ```
   
   Now I'd like to call this function with 
`ScalarValue::WithMetadata(my_string, LANGUAGE_INFO)`. This call will panic, as 
the actual scalar is hidden behind the metadata wrapper. Initially I though 
this is a deal-breaker, as many functions just want to work on the physical 
data. After all, that's how it is with arrays, as the Metadata can be simply 
ignored.
   
   However, in a scenario where the metadata indicates an extension type, it 
may be a good thing that Metadata cannot be ignored that easily. For example, 
the `len()` function should probably return the number of keys/entries in the 
top-level JSON object/array. Carrying the metadata in a separate `ScalarValue` 
variant then may be a good thing! Especially if there is a function that 
unwraps the metadata so that some functions can operate on the scalar without 
metadata.
   
   Hope this does make some sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-20 Thread via GitHub


tobixdev commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2098428747


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   Thanks! That's an interesting idea! When i experimented with logical 
scalars, I did not go down that path as I thought that it is very easy to miss 
handling metadata that only adds little information to a scalar.
   
   To elaborate a bit on that: Let's assume I have a `DataType::Utf8` field and 
a `len()` function that returns the length of a value. For strings, that could 
be the number of unicode characters.
   
   Now, given the following "implementation" of `len`:
   ```rust
   match value {
   ScalarValue::Utf8(value) => count_chars(value),
   _ => panic!("bye bye")
   }
   ```
   
   Now I'd like to call this function with 
`ScalarValue::WithMetadata(my_string, LANGUAGE_INFO)`. This call will panic, as 
the actual scalar is hidden behind the metadata wrapper. Initially I though 
this is a deal-breaker, as many functions just want to work on the physical 
data. After all, that's how it is with arrays, as the Metadata can be simply 
ignored.
   
   However, in a scenario where the metadata indicates an extension type, it 
may be a good thing that Metadata cannot be ignored that easily. For example, 
the `len()` function should probably return the number of keys/entries in the 
top-level JSON object/array. Carrying the metadata in a separate `ScalarValue` 
variant then may be a good thing! Especially if there is a function that 
unwraps the metadata so that some functions can operate on the scalar without 
metadata.
   
   Hope this does make some sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-18 Thread via GitHub


timsaucer commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2094500696


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   Oh, I hadn't thought about that. That's a very interesting way to solve one 
of the problems I was coming up against. I will try a proof of concept of that 
on a separate branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-17 Thread via GitHub


paleolimbot commented on code in PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#discussion_r2094213318


##
datafusion/physical-expr/src/expressions/literal.rs:
##
@@ -34,15 +36,37 @@ use datafusion_expr_common::interval_arithmetic::Interval;
 use datafusion_expr_common::sort_properties::{ExprProperties, SortProperties};
 
 /// Represents a literal value
-#[derive(Debug, PartialEq, Eq, Hash)]
+#[derive(Debug, PartialEq, Eq)]
 pub struct Literal {
 value: ScalarValue,
+metadata: Option>,

Review Comment:
   Interesting - this puts the responsibility for carrying metadata on the 
`Literal` instead of the `ScalarValue`. I wonder if there are other places 
(e.g., the `Statistics`) where we'll need a scalar paired with metadata (I was 
wondering if `ScalarValue::WithMetadata(Box, HashMap>)` might work as well).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support metadata on scalar values [datafusion]

2025-05-17 Thread via GitHub


timsaucer commented on PR #16053:
URL: https://github.com/apache/datafusion/pull/16053#issuecomment-2888458903

   @kylebarron @paleolimbot I tested this latest push against the 
`test_st_point` including the additional parts that were commented out. Do you 
have other examples of problems you've run into?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]