Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-05-01 Thread via GitHub


alamb closed pull request #10221: fix: preserve more dictionaries when coercing 
types
URL: https://github.com/apache/datafusion/pull/10221


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-05-01 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2089086256

   I believe this is superceded by 
https://github.com/apache/datafusion/pull/10323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-29 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2083654007

   I had a chat with @erratic-pattern  the plan:
   
   We'll put this PR into draft mode
   
   1. He is going to make a ticket + PR will improve unwrap cast comparison to 
handle dictionaries
   2. I will review https://github.com/apache/datafusion/pull/10268 from 
@jayzhan211  and hopefully get that sorted out
   3. Once that is complete, we'll revisit the approach in this PR and decide 
how to proceed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-207731

   rebased onto main


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


erratic-pattern commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1581447540


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I agree with @jayzhan211 that we probably don't want to cast everything to 
dictionaries in the way that we are currently doing it, and what we really want 
is a way for the optimizer to avoid expensive casts of dictionary columns, and 
more generally to just avoid column casts in favor of casting constants and 
scalar expressions. 
   
   I think what we have works fine for now and fixes performance issues we're 
seeing on dictionary columns, but should be improved for the general case in 
subsequent PRs that redesign the type coercion logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


alamb commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1581375417


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   FWIW I tried the example from @jayzhan211  on main and it doesn't put a cast 
on the filter -- thus I agree this PR would be a regression if merged as is. I 
will dismiss my review
   
   ```
   DataFusion CLI v37.1.0
   > create table test as values
 ('row1', arrow_cast(1, 'Utf8')),
 ('row2', arrow_cast(2, 'Utf8')),
 ('row3', arrow_cast(3, 'Utf8'))
   ;
   0 row(s) fetched.
   Elapsed 0.050 seconds.
   
   > explain SELECT * from test where column2 = arrow_cast(2, 
'Dictionary(Int32, Int64)');
   
   +---+---+
   | plan_type | plan  |
   +---+---+
   | logical_plan  | Filter: test.column2 = Utf8("2")  |
   |   |   TableScan: test projection=[column1, column2]   |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192   |
   |   |   FilterExec: column2@1 = 2   |
   |   | MemoryExec: partitions=1, partition_sizes=[1] |
   |   |   |
   +---+---+
   2 row(s) fetched.
   Elapsed 0.053 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2079894304

   > I'm wondering why coalesce has the signature VariadicEqual, since it takes 
a first non-null value, ideally, coercion is not necessary. I think 
`VariadicAny` is more suitable.
   > 
   > In this PR, coalesce do the coercion internally, I forgot why we do 
coercion but not returning first non-null value with the type it has
   
   I can't remember either but I remember it was tricky -- I suggest we open a 
new ticket / discussion about it rather than trying to change the behavior in 
this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


alamb commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1581371173


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   > 02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
   
   Yes I agree that looks bad. It should be unwrapped. 
   
   Thank you for that example 💯 
   
   Maybe we could  extend 
https://github.com/apache/datafusion/blob/a5ce56831a9ec61e634d1c285c1b28d8c3891503/datafusion/optimizer/src/unwrap_cast_in_comparison.rs#L160
  which handles `CAST(col, ..) = const` for other datatypes 🤔 
   
   I can try to do so later this weekend or



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-26 Thread via GitHub


alamb commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1581371173


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   > 02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
   
   Yes I agree that looks bad. It should be unwrapped. 
   
   Thank you for that example 💯 
   
   Maybe we could  extend 
https://github.com/apache/datafusion/blob/a5ce56831a9ec61e634d1c285c1b28d8c3891503/datafusion/optimizer/src/unwrap_cast_in_comparison.rs#L160
  which handles `CAST(col, ..) = const` for other datatypes 🤔 
   
   I can try to do so later this weekend. Or would you like to try it 
@erratic-pattern ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580522622


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I think what we need to solve the issue is avoiding `casting from value to 
dict for column`, because casting for column is costly compare with casting 
constant.
   
   Given the example, if we preserve dict, we still ends up casting column 
(utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to 
utf8 and it is enough.
   
   ```
   statement ok
   create table test as values
 ('row1', arrow_cast(1, 'Utf8')),
 ('row2', arrow_cast(2, 'Utf8')),
 ('row3', arrow_cast(3, 'Utf8'))
   ;
   
   # query using an string '1' which must be coerced into a dictionary string
   query TT
   SELECT * from test where column2 =  arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   row2 2
   
   query TT
   explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   logical_plan
   01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, 
Utf8("2"))
   02)--TableScan: test projection=[column1, column2]
   physical_plan
   01)CoalesceBatchesExec: target_batch_size=8192
   02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
   03)MemoryExec: partitions=1, partition_sizes=[1]
   
   statement ok
   drop table test;
   ```
   
   expect plan
   ```
   01)Filter: test.column2 = Utf8("2")
   02)--TableScan: test projection=[column1, column2]
   physical_plan
   01)CoalesceBatchesExec: target_batch_size=8192
   02)--FilterExec: column2@1 = 2
   03)MemoryExec: partitions=1, partition_sizes=[1]
   ```
   
   
   I think we should not preserve dict, but need specialization on  comparing 
dict vs non-dict case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580522622


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I think what we need to solve the issue is avoiding `casting from value to 
dict for column`, because casting for column is costly compare with casting 
constant.
   
   Given the example, if we preserve dict, we still ends up casting column 
(utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to 
utf8 and it is enough.
   
   ```
   statement ok
   create table test as values
 ('row1', arrow_cast(1, 'Utf8')),
 ('row2', arrow_cast(2, 'Utf8')),
 ('row3', arrow_cast(3, 'Utf8'))
   ;
   
   # query using an string '1' which must be coerced into a dictionary string
   query TT
   SELECT * from test where column2 =  arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   row2 2
   
   query TT
   explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   logical_plan
   01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, 
Utf8("2"))
   02)--TableScan: test projection=[column1, column2]
   physical_plan
   01)CoalesceBatchesExec: target_batch_size=8192
   02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
   03)MemoryExec: partitions=1, partition_sizes=[1]
   
   statement ok
   drop table test;
   ```
   
   
   I think we should not preserve dict, but need specialization on  comparing 
dict vs non-dict case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580522622


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I think we should avoid `casting from value to dict for column`
   
   Given the example, if we preserve dict, we still ends up casting column 
(utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to 
utf8 and it is enough.
   
   ```
   statement ok
   create table test as values
 ('row1', arrow_cast(1, 'Utf8')),
 ('row2', arrow_cast(2, 'Utf8')),
 ('row3', arrow_cast(3, 'Utf8'))
   ;
   
   # query using an string '1' which must be coerced into a dictionary string
   query TT
   SELECT * from test where column2 =  arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   row2 2
   
   query TT
   explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, 
Int64)');
   
   logical_plan
   01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, 
Utf8("2"))
   02)--TableScan: test projection=[column1, column2]
   physical_plan
   01)CoalesceBatchesExec: target_batch_size=8192
   02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
   03)MemoryExec: partitions=1, partition_sizes=[1]
   
   statement ok
   drop table test;
   ```
   
   
   I think we should not preserve dict, but need specialization on  comparing 
dict vs non-dict case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580504830


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   Maybe we should not preserve dict for comparison overall 🤔 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580496844


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   Probably because value type is enough for comparison



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580491542


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I guess for comparison logic, preserving dict is not necessary, we just care 
about the value type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1580491542


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   I guess for comparison logic, preserving dict is not necessary, we just care 
about the value type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


jayzhan211 commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2078386508

   > coalesce
   
   I'm wondering why coalesce has the signature VariadicEqual, since it takes a 
first non-null value, ideally, coercion is not necessary. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1579870509


##
datafusion/sqllogictest/test_files/dictionary.slt:
##
@@ -386,3 +386,53 @@ drop table m3;
 
 statement ok
 drop table m3_source;
+
+
+## Test that filtering on dictionary columns coerces the filter value to the 
dictionary type
+statement ok
+create table test as values
+  ('row1', arrow_cast('1', 'Dictionary(Int32, Utf8)')),
+  ('row2', arrow_cast('2', 'Dictionary(Int32, Utf8)')),
+  ('row3', arrow_cast('3', 'Dictionary(Int32, Utf8)'))
+;
+
+# query using an string '1' which must be coerced into a dictionary string
+query T?
+SELECT * from test where column2 = '1';
+
+row1 1
+
+# filter should not have a cast on column2
+query TT
+explain SELECT * from test where column2 = '1';
+
+logical_plan
+01)Filter: test.column2 = Dictionary(Int32, Utf8("1"))
+02)--TableScan: test projection=[column1, column2]
+physical_plan
+01)CoalesceBatchesExec: target_batch_size=8192
+02)--FilterExec: column2@1 = 1
+03)MemoryExec: partitions=1, partition_sizes=[1]
+
+
+# Now query using an integer which must be coerced into a dictionary string
+query T?
+SELECT * from test where column2 = 1;
+
+row1 1
+
+# filter should not have a cast on column2
+query TT
+explain SELECT * from test where column2 = 1;
+
+logical_plan
+01)Filter: test.column2 = Dictionary(Int32, Utf8("1"))

Review Comment:
   the key is that there is no CAST on `column2` here



##
datafusion/sqllogictest/test_files/scalar.slt:
##
@@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3);
 [1, 2, 3]
 
 # coalesce static empty value
-query T
-SELECT COALESCE('', 'test')
+query TT
+SELECT   COALESCE('', 'test'),
+arrow_typeof(COALESCE('', 'test'))
 
-(empty)
+(empty) Utf8
 
 # coalesce static value with null
-query T
-SELECT COALESCE(NULL, 'test')
+query TT
+SELECT   COALESCE(NULL, 'test'),
+arrow_typeof(COALESCE(NULL, 'test'))
 
-test
-
+test Utf8
 
+# Create table with a dictionary value
 statement ok
 create table test1 as values (arrow_cast('foo', 'Dictionary(Int32, Utf8)')), 
(null);
 
-# test coercion string
-query ?
-select  coalesce(column1, 'none_set') from test1;
+# test coercion string (should preserve the dictionary type)
+query ?T
+select   coalesce(column1, 'none_set'),
+arrow_typeof(coalesce(column1, 'none_set'))
+from test1;
 
-foo
-none_set
+foo Dictionary(Int32, Utf8)
+none_set Dictionary(Int32, Utf8)
 
-# test coercion Int
-query I
-select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
+# test coercion Int and Dictionary
+query ?T

Review Comment:
   As @erratic-pattern  noted the difference here is that now that the output 
type of cealesce is `Dictionary` rather than Int. 



##
datafusion/sqllogictest/test_files/scalar.slt:
##
@@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3);
 [1, 2, 3]
 
 # coalesce static empty value
-query T
-SELECT COALESCE('', 'test')
+query TT

Review Comment:
   I updated these tests to show what the coerced type was as well 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1579832028


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   Not that I know of



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-203700

   For this PR I suggest:
   1. We add tests showing the desired behavior (I will push a commit or two)
   2. Update the coercion tests to show the resulting types
   
   Then as a follow on, if you want to take on the refactoring of type coecion 
we can start doing that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-201173

   > I updated the tests, but I'm still unsure if it's okay to ignore this 
change in behavior. It seems like this could be a regression in other cases 
where we might be converting things to dictionaries for no reason.
   
   I think the reason to convert to a dictonary is that the other side of the 
comparsion is already a dictionary, which thus avoids convering *both* 
arguments (though I may be missing something)
   
   Maybe @viirya  has some thoughts given he worked on coercion as part of 
https://github.com/apache/datafusion/pull/9459 recently
   
   > It would be nice to have some issue tracking this, or maybe there is one 
already.
   
   Perhaps https://github.com/apache/datafusion/issues/5928 ? 
   
   There appear to be a bunch of open tickets about coercion 
https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+coercion
   
   > A possibly easy change to make in the short-term is to refactor away from 
using `comparison_coercion` for non-comparisons and instead use coercion logic 
specifically tailored to those expressions.
   
   This seems reasonable to me (aka have special rules for `coerce`). Other 
functions like `BETWEEN` I think can be thought of as syntactic sugar for 
comparisons (namely `x > low AND x < high`) 
   
   > I just feels like we're playing Jenga with type coercion logic right now. 
The coercion logic should prioritize avoiding expensive casts, I think.
   
   The key to ensuring we don't break existing code is to rely on the tests. 
   
   I agree the type coercion logic could be improved -- specifically I think it 
needs to have some encapsulation rather than being spread out through a bunch 
of free functions.
   
   Is this something you are interested in helping out with? I think the first 
thing to do would be to try and explain how the current logic works (and in 
that process we will likely uncover ways to make it better). Then,  would 
improve the code the structure to encapsulate the logic (into structs / enums 
perhaps -- like I did recently in 
https://github.com/apache/datafusion/pull/10216).  Once the logic is more 
encapsulated, then I think we'll be able to reason about it and feel good about 
how it fits into the overall picture
   
   I think the difference in casting a scalar integer to a scalar dictionary is 
neglible. The difference casting a column to a different type is likely 
subtantial (though casting string --> dictionary doesn't require any data 
copying)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077734372

   > > I think it's now complaining that the type is no longer an integer and 
is instead Dictionary
   > 
   > That seems like you could just update the test. Perhaps via 
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode
   
   I updated the tests, but I'm still unsure if it's okay to ignore this change 
in behavior. It seems like this could be a regression in other cases where we 
might be converting things to dictionaries for no reason.
   
   It would be nice to have some issue tracking this, or maybe there is one 
already.
   
   A possibly easy change to make in the short-term is to refactor away from 
using `comparison_coercion` for non-comparisons and instead use coercion logic 
specifically tailored to those expressions.
   
   I just feels like we're playing Jenga with type coercion logic right now. 
The coercion logic should prioritize avoiding expensive casts, I think.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077684900

   This diff shows a list of types that coalesce can accept. There might be 
others but this would serve as a good starting point:
   
https://github.com/apache/datafusion/pull/9459/files#diff-9b644c479dfb999609b40e8da6d3b2a40c7adadf88296eeefd2f803201e7ab6dL25-L43


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077679196

   this comment also seems to hint at a possible refactor:
   
https://github.com/apache/datafusion/blob/f47e74d6d52000eed6f9472bc2cc43a16b8814a8/datafusion/expr/src/type_coercion/functions.rs#L190-L194
   
   we could potentially rewrite `coalesce`  coercion to use `maybe_data_type` 
which allows restricting coercion to a list of output types
   
https://github.com/apache/datafusion/blob/f47e74d6d52000eed6f9472bc2cc43a16b8814a8/datafusion/expr/src/type_coercion/functions.rs#L262-L271


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077669921

   > > I may try to approach a solution from a different angle here, to avoid 
needing to refactor the type coercion logic too much without any clear 
understanding of what the impacts could be.
   > 
   > FWIW I think the coercion is the right approach here. Let me take a look 
at this PR and see if I can find something
   
   I think it is the right approach for this one particular case, but the 
`coalesce` example in the tests is a good example of where you probably do want 
to convert to a simple integer type rather than converting everything to 
dictionaries. 
   
   Maybe we need to decouple the non-comparison type coercions from using the 
`comparison_coercion` function and have their own coercion function(s). For 
example `VariadicEqual` functions can use a new `operand_coercion` function 
which does the same thing as the previous `comparison_coercion` function, but 
with `preserve_dictionaries` flag set to false.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077662992

   Or even more generally, you could ignore whether the operands are column 
references or constants and just have:
   
   `ValueType( :: Dictionary)  
ValueType( :: )`
   becomes:
   `( :: Dictionary)  CAST( as 
Dictionary)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077656268

   > I may try to approach a solution from a different angle here, to avoid 
needing to refactor the type coercion logic too much without any clear 
understanding of what the impacts could be.
   
   
   
   FWIW I think the coercion is the right approach here. Let me take a look at 
this PR and see if I can find something


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077614253

   I may try to approach a solution from a different angle here, to avoid 
needing to refactor the type coercion logic too much without any clear 
understanding of what the impacts could be.
   
   Think a more specific optimization might look something like
   `ValueType(column)  CAST( as ValueType)`
   to:
   `column  CAST( as Dictionary)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2077604564

   > I think it's now complaining that the type is no longer an integer and is 
instead Dictionary
   
   
   That seems like you could just update the test. Perhaps via 
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-25 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2076826036

   Marking as draft as I think this PR is still in progress and the CI is not 
yet passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2076328245

   It looks like the failing test case is no longer throwing an error after I 
added @jayzhan211 's suggestion:
   
   ```
   DataFusion CLI v37.1.0
   > select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
   
++
   | coalesce(Int64(34),arrow_cast(Int64(123),Utf8("Dictionary(Int32, Int8)"))) 
|
   
++
   | 34 
|
   
++
   1 row(s) fetched.
   Elapsed 0.071 seconds.
   ```
   
I think it's now complaining that the type is no longer an integer and is 
instead `Dictionary`
   
   ```
   External error: query columns mismatch:
   [SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
   [Expected] [I]
   [Actual  ] [?]
   at test_files/scalar.slt:1794
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1578809051


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -638,22 +638,26 @@ fn dictionary_coercion(
 preserve_dictionaries: bool,
 ) -> Option {
 use arrow::datatypes::DataType::*;
+let maybe_preserve_dictionaries =
+|index_type: &Box, value_type: DataType| -> DataType {
+if preserve_dictionaries {
+Dictionary(index_type.clone(), Box::new(value_type))
+} else {
+value_type
+}
+};
 match (lhs_type, rhs_type) {
 (
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),

Review Comment:
   Is there any particular reason why this case doesn't  also preserve the 
dictionary type? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1578804434


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -643,9 +643,8 @@ fn dictionary_coercion(
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),
-(d @ Dictionary(_, value_type), other_type)
-| (other_type, d @ Dictionary(_, value_type))
-if preserve_dictionaries && value_type.as_ref() == other_type =>
+(d @ Dictionary(_, _), other_type) | (other_type, d @ Dictionary(_, _))
+if preserve_dictionaries && can_cast_types(other_type, d) =>

Review Comment:
   @jayzhan211 Thanks for the tip.  I initially avoiding calling 
`comparison_coercion` because the way I was doing it would cause an infinite 
recursion, but calling `comparison_coercion` on the inner value type and then 
wrapping in a new `Dictionary` as you suggested solves that issue. 
   
   I was hoping this would fix the test failure, but unfortunately it does not 
so I will need to look into reworking the type coercion logic for 
`VariadicEqual` functions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1578683165


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -643,9 +643,8 @@ fn dictionary_coercion(
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),
-(d @ Dictionary(_, value_type), other_type)
-| (other_type, d @ Dictionary(_, value_type))
-if preserve_dictionaries && value_type.as_ref() == other_type =>
+(d @ Dictionary(_, _), other_type) | (other_type, d @ Dictionary(_, _))
+if preserve_dictionaries && can_cast_types(other_type, d) =>

Review Comment:
   If I understand correctly, we should check whether `value_type` and 
`other_type` are coercible. If yes return the new dict with new value type
   
   `Dict(k, i8) and i64 -> Dict(k, i64)`
   
   Instead of `Some(d.clone())` we should create a new dict with new value type.
   
   And, I think we should use the coerce logic in `comparison_coercion` not 
`can_cast_types`, because the logic is not totally the same.
   
   ```rust
   | (other_type, d @ Dictionary(key_type, value_type))
   if preserve_dictionaries && 
comparison_coercion(value_type.as_ref(), other_type).is_some() =>
   {
   let new_value = comparison_coercion(&value_type, 
other_type).unwrap();
   Some(DataType::Dictionary(key_type.clone(), Box::new(new_value)))
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


jayzhan211 commented on code in PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#discussion_r1578683165


##
datafusion/expr/src/type_coercion/binary.rs:
##
@@ -643,9 +643,8 @@ fn dictionary_coercion(
 Dictionary(_lhs_index_type, lhs_value_type),
 Dictionary(_rhs_index_type, rhs_value_type),
 ) => comparison_coercion(lhs_value_type, rhs_value_type),
-(d @ Dictionary(_, value_type), other_type)
-| (other_type, d @ Dictionary(_, value_type))
-if preserve_dictionaries && value_type.as_ref() == other_type =>
+(d @ Dictionary(_, _), other_type) | (other_type, d @ Dictionary(_, _))
+if preserve_dictionaries && can_cast_types(other_type, d) =>

Review Comment:
   If I understand correctly, we should check whether `value_type` and 
`other_type` are coercible. If yes return the new dict with new value type
   
   `Dict(k, i8) and i64 -> Dict(k, i64)`
   `Dict(k, i64) and i8 -> Dict(k, i64)`
   
   Instead of `Some(d.clone())` we should create a new dict with new value type.
   
   And, I think we should use the coerce logic in `comparison_coercion` not 
`can_cast_types`, because the logic is not totally the same.
   
   ```rust
   | (other_type, d @ Dictionary(key_type, value_type))
   if preserve_dictionaries && 
comparison_coercion(value_type.as_ref(), other_type).is_some() =>
   {
   let new_value = comparison_coercion(&value_type, 
other_type).unwrap();
   Some(DataType::Dictionary(key_type.clone(), Box::new(new_value)))
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2075857365

   There are other places `comparison_coercion` is used so we may want to look 
for additional unintended side effects of this change.
   
   array prepend/append:
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/functions.rs#L138
   
   list type coercions:
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/other.rs#L25-L34
 
   
   `CASE WHEN ...` : 
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/other.rs#L39-L54
   
   `IN` subquery:
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/optimizer/src/analyzer/type_coercion.rs#L151-L172
   
   `BETWEEN`: 
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/optimizer/src/analyzer/type_coercion.rs#L236-L270
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2075845410

   The test failure is from `comparison_coercion` being used in function 
argument type coercion for the `coalesce` function and other 
`Signature::VariadicEqual` functions. 
   
   
https://github.com/apache/datafusion/blob/7fe8eeb698802776744cbba6d71abd30cc850968/datafusion/expr/src/type_coercion/functions.rs#L186-L201


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2075827245

   Can we please add a test to 
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt
 with the explain plan showing the correct filter being added?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-04-24 Thread via GitHub


erratic-pattern commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2075809276

   There is a test failure in sqllogictests
   ```
   External error: query failed: DataFusion error: Error during planning: No 
function matches the given name and argument types 'coalesce(Int64, 
Dictionary(Int32, Int8))'. You might need to add explicit type casts.
   Candidate functions:
   coalesce(CoercibleT, .., CoercibleT)
   [SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
   at test_files/scalar.slt:1794
   ```
   
   here is the test case:
   ```
   
https://github.com/erratic-pattern/arrow-datafusion/blob/3a6c04ec8f989ddcf3b13a871a782c0bf56b9f8e/datafusion/sqllogictest/test_files/scalar.slt#L1795
   ```
   
   There could be unintended consequences to this change for casting so I will 
look into it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org