Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-27 Thread via GitHub


alamb commented on code in PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#discussion_r2109637172


##
parquet-testing:
##


Review Comment:
   There appears to be an update of parquet-testing with unrelated content, but 
I think it is ok to update it too



##
datafusion/core/tests/user_defined/user_defined_plan.rs:
##
@@ -796,22 +798,30 @@ fn accumulate_batch(
 k: &usize,
 ) -> BTreeMap {
 let num_rows = input_batch.num_rows();
+
 // Assuming the input columns are
-// column[0]: customer_id / UTF8
+// column[0]: customer_id / UTF8 or UTF8View

Review Comment:
   I feel like the example would be simpler if it just used `Utf8View`. Could 
we do that and avoid adding the logic to handle both types?



##
datafusion-examples/examples/dataframe.rs:
##
@@ -182,7 +182,11 @@ async fn write_out(ctx: &SessionContext) -> 
std::result::Result<(), DataFusionEr
 let mut df = ctx.sql("values ('a'), ('b'), ('c')").await.unwrap();
 
 // Ensure the column names and types match the target table
-df = df.with_column_renamed("column1", "tablecol1").unwrap();
+df = df

Review Comment:
   I think it would be better if we could change the example to maybe just use 
a `StringViewArray` in the first place (rather than have to cast it here)



##
datafusion/sqllogictest/test_files/avro.slt:
##
@@ -15,6 +15,10 @@
 # specific language governing permissions and limitations
 # under the License.
 
+# Currently, the avro not support Utf8View type, so we disable the 
map_varchar_to_utf8view

Review Comment:
   πŸ‘ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-27 Thread via GitHub


zhuqi-lucas commented on PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2912064010

   Thank you @alamb , it seems no regression from the benchmark result!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-27 Thread via GitHub


alamb commented on PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2912049125

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and varchar_default_ut8view
   
   Benchmark clickbench_extended.json
   
   ┏━━┳┳━┳━━━┓
   ┃ Query┃   HEAD ┃ varchar_default_ut8view ┃Change ┃
   ┑━━╇╇━╇━━━┩
   β”‚ QQuery 0 β”‚  1898.26ms β”‚   1810.38ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚   658.97ms β”‚670.82ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚  1406.85ms β”‚   1396.48ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚   709.94ms β”‚713.90ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  1429.02ms β”‚   1398.58ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 15171.65ms β”‚  15018.76ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  2057.97ms β”‚   1992.46ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚  2336.53ms β”‚   2061.27ms β”‚ +1.13x faster β”‚
   β”‚ QQuery 8 β”‚   839.18ms β”‚837.10ms β”‚ no change β”‚
   β””β”€β”€β”΄β”΄β”€β”΄β”€β”€β”€β”˜
   ┏┳┓
   ┃ Benchmark Summary  ┃┃
   ┑╇┩
   β”‚ Total Time (HEAD)  β”‚ 26508.38ms β”‚
   β”‚ Total Time (varchar_default_ut8view)   β”‚ 25899.76ms β”‚
   β”‚ Average Time (HEAD)β”‚  2945.38ms β”‚
   β”‚ Average Time (varchar_default_ut8view) β”‚  2877.75ms β”‚
   β”‚ Queries Faster β”‚  1 β”‚
   β”‚ Queries Slower β”‚  0 β”‚
   β”‚ Queries with No Change β”‚  8 β”‚
   β””β”΄β”˜
   
   Benchmark clickbench_partitioned.json
   
   ┏━━┳┳━┳━━━┓
   ┃ Query┃   HEAD ┃ varchar_default_ut8view ┃Change ┃
   ┑━━╇╇━╇━━━┩
   β”‚ QQuery 0 β”‚15.20ms β”‚ 15.35ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚33.35ms β”‚ 33.29ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚83.06ms β”‚ 79.39ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚96.03ms β”‚ 95.15ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚   603.39ms β”‚590.48ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚   851.82ms β”‚855.46ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚23.71ms β”‚ 22.55ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚36.83ms β”‚ 37.07ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚   933.66ms β”‚916.79ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚  1192.79ms β”‚   1193.28ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚   260.86ms β”‚262.38ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚   299.90ms β”‚293.48ms β”‚ no change β”‚
   β”‚ QQuery 12β”‚   906.00ms β”‚898.14ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚  1344.05ms β”‚   1220.93ms β”‚ +1.10x faster β”‚
   β”‚ QQuery 14β”‚   842.98ms β”‚823.52ms β”‚ no change β”‚
   β”‚ QQuery 15β”‚   842.44ms β”‚835.95ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  1725.19ms β”‚   1718.98ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚  1598.00ms β”‚   1614.71ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚  3092.64ms β”‚   3093.01ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚81.09ms β”‚ 81.79ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚  .88ms β”‚   1115.69ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚  1298.42ms β”‚   1296.34ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  2165.31ms β”‚   2174.91ms β”‚ no change β”‚
   β”‚ QQuery 23β”‚  8034.33ms β”‚   7921.79ms β”‚ no change β”‚
   β”‚ QQuery 24β”‚   458.46ms β”‚457.26ms β”‚ no change β”‚
   β”‚ QQuery 25β”‚   391.76ms β”‚372.64ms β”‚ no change β”‚
   β”‚ QQuery 26β”‚   528.56ms β”‚519.61ms β”‚ no change β”‚
   β”‚ QQuery 27β”‚  1595.08ms β”‚   1586.78ms β”‚ no change β”‚
   β”‚ QQuery 28β”‚ 12566.00ms β”‚  12405.74ms β”‚ no change β”‚
   β”‚ QQuery 29β”‚   535.24ms β”‚532.68ms β”‚ no change β”‚
   β”‚ QQuery 30β”‚   796.06ms β”‚807.39ms β”‚ no change β”‚
   β”‚ QQuery 31β”‚   863.55ms β”‚857.96ms β”‚ no change β”‚
   β”‚ QQuery 32β”‚  2633.13ms β”‚   2621.41ms β”‚ no change β”‚
   β”‚ QQuery 33β”‚  3300.20ms β”‚   3347.80ms β”‚ no change β”‚
   β”‚ QQuery 34β”‚  3429.42ms β”‚   3335.38ms β”‚ no change β”‚
   β”‚ QQuery 35β”‚  1338.57ms β”‚   1340.32ms β”‚ no change β”‚
   β”‚ QQuery 36β”‚   123.48ms β”‚118.34ms β”‚ 

Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-27 Thread via GitHub


alamb commented on PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2911938227

   > Resolved all problems now including the slt testing.
   > 
   > I think the PR is ready for benchmark progress to see if switching to 
Utf8View by default slows some queries down. cc @alamb Thanks a lot.
   
   THanks @zhuqi-lucas  -- I have started it up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-27 Thread via GitHub


alamb commented on PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2911937640

   πŸ€– `./gh_compare_branch.sh` [Benchmark 
Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr  2 16:34:16 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing varchar_default_ut8view (5d9c537c446073c433bb4e3299bf9aa775fa0b60) 
to f3aed4a9bf988f5c0fdc3be845d57c0fe253cd45 
[diff](https://github.com/apache/datafusion/compare/f3aed4a9bf988f5c0fdc3be845d57c0fe253cd45..5d9c537c446073c433bb4e3299bf9aa775fa0b60)
   Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: varchar default mapping to utf8view [datafusion]

2025-05-26 Thread via GitHub


zhuqi-lucas commented on PR #16142:
URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2909555117

   Resolved all problems now including the slt testing.
   
   I think the PR is ready for benchmark progress to see if switching to 
Utf8View by default slows some queries down.  cc @alamb Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]