Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3201899310 Thanks again for helping get this over the line @Omega359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb merged PR #15361: URL: https://github.com/apache/datafusion/pull/15361 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3180132963 > > 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) > > BTW, that script looks like it could be interesting but I don't see it in your repo. Any chance you could push that one up? Soryr -- I think the link is wrong. The script is here: https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch_bench.sh I will update the checkout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3180078538 > 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) BTW, that script looks like it could be interesting but I don't see it in your repo. Any chance you could push that one up? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2270340912
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -248,11 +250,17 @@ fn _to_char_scalar(
))
}
} else {
+// if the data type was a Date32, formatting could have failed because
the format string
+// contained datetime specifiers, so we'll retry by casting the date
array as a timestamp array
+if data_type == &Date32 {
+return to_char_scalar(expression.clone().cast_to(&Date64, None)?,
format);
Review Comment:
this clone seems unecessary. I'll make a PR to try and avoid some clones as
a follow up
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3179915293 🤖: Benchmark completed Details ``` group friendlymatthew_date32-with-ts-formatmain - - to_char_array_array_1000 1.00240.9±0.78µs? ?/sec to_char_array_date_only_patterns_10001.00242.8±1.12µs? ?/sec to_char_array_datetime_patterns_1000 1.00853.6±5.00µs? ?/sec to_char_array_mixed_patterns_10001.00 579.5±11.75µs? ?/sec to_char_array_scalar_1000 1.00218.7±0.48µs? ?/sec to_char_scalar_1000 1.00 716.5±21.37ns? ?/sec to_char_scalar_date_only_pattern_10001.00 210.4±20.22µs? ?/sec to_char_scalar_datetime_pattern_1000 1.00 384.3±66.98µs? ?/sec to_char_scalar_scalar_1000 1.00877.8±1.81ns? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3179882287 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing friendlymatthew/date32-with-ts-format (9ab910b9dadf5a90552d7e54eaf270612bddc687) to 7d5214512740b4dfb742b6b3d91ed9affcc2c9d0 [diff](https://github.com/apache/datafusion/compare/7d5214512740b4dfb742b6b3d91ed9affcc2c9d0..9ab910b9dadf5a90552d7e54eaf270612bddc687) BENCH_NAME=to_char BENCH_COMMAND=cargo bench --bench to_char BENCH_FILTER= BENCH_BRANCH_NAME=friendlymatthew_date32-with-ts-format Results will be posted here when complete -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3172980200 I've taken the liberty to revert https://github.com/apache/datafusion/commit/b379fe1b2927fc5c275670c70721df947a433c13 and to update the code to reflect changes in main. This should be ready to review now @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3160193541 > @friendlymatthew - Are you be able to update this with the previous version of the code or would you mind if I made the changes either against your branch or a new branch and PR (giving you credit for it of course since you did the work!) ? Hi, please feel free to take over. You should have write access to my fork; you can also make a new PR if you'd like. I'm a bit swamped with work-- I need to close out some projects and go back to work related to arrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3156148877 @friendlymatthew - Are you be able to update this with the previous version of the code or would you mind if I made the changes either against your branch or a new branch and PR (giving you credit for it of course since you did the work!) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3025499081 Thanks @Omega359 and @friendlymatthew -- when you are happy with it, please ping me and I will give it a final review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3024951447 @friendlymatthew - I think this actually should be reopened and just be updated to the previous version of the code then merged in - I think it's a worthy addition to DF. I think this issue just unfortunately dropped off the radar at the end :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-3025056915 > @friendlymatthew - I think this actually should be reopened and just be updated to the previous version of the code then merged in - I think it's a worthy addition to DF. I think this issue just unfortunately dropped off the radar at the end :( Sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2810508066 To put my notes here I prefer the version before the addition of the consecutive failures check - simpler and faster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797704916 _Note: I'm sorry about the super long write ups. I'm not trying to bike shed._ I was thinking about https://github.com/apache/datafusion/pull/15361/commits/f906af55df1b9c9bbfa15513da2e8ca17580e783 and realized we could do more. The selective retry approach will extract every failed `Date32` and cast it to a `Date64`. If we have M failed `Date32`'s, we'll still recast M times. It's just that every cast will have a single element in the array. We can use the cast even more sparingly, by keeping track of consecutive failed `Date32`s. Suppose you had a 1000 format attempts, and a contiguous block of those format attempts erred because they contained datetime specifiers: ```sh |--|x|---| valid invalid valid ``` For every invalid `Date32`, we'll create an array of size 1 and recast that array to a `Date64` array. https://github.com/apache/datafusion/pull/15361/commits/b379fe1b2927fc5c275670c70721df947a433c13 will keep track of consecutive failed `Date32` values. By batching the failed `Date32`'s, we can call cast _once_. And of course, we preserve the order. This approach also helps with readability and saves LOC, `to_char_inner` gets recursively invoked with the newly casted timestamp array. ## Benchmarks She leaves the existing code paths undisturbed like usual. While performance has slightly regressed, I really like how this approach looks and would prefer to call `cast` sparingly and make use of the batching. ```sh # I ran the initial selective retry approach first. # This run includes the new logic mentioned above. to_char_array_date_only_patterns_1000 time: [136.92 µs 137.15 µs 137.39 µs] change: [-3.1659% -2.3990% -1.6897%] (p = 0.00 < 0.05) Performance has improved. Found 19 outliers among 100 measurements (19.00%) 7 (7.00%) high mild 12 (12.00%) high severe to_char_array_datetime_patterns_1000 time: [543.24 µs 543.84 µs 544.47 µs] change: [+4.4460% +6.0837% +7.4847%] (p = 0.00 < 0.05) Performance has regressed. to_char_array_mixed_patterns_1000 time: [339.68 µs 342.60 µs 345.66 µs] change: [-5.3101% -2.9305% -0.8615%] (p = 0.01 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe to_char_scalar_date_only_pattern_1000 time: [95.107 µs 98.972 µs 102.90 µs] change: [-5.9031% -1.2243% +3.4337%] (p = 0.60 > 0.05) No change in performance detected. to_char_scalar_datetime_pattern_1000 time: [171.09 µs 179.74 µs 188.39 µs] change: [-4.0895% +1.3179% +7.1836%] (p = 0.64 > 0.05) No change in performance detected. to_char_scalar_1000 time: [315.63 ns 319.69 ns 323.50 ns] change: [-7.9240% -6.5685% -5.1341%] (p = 0.00 < 0.05) Performance has improved. ``` cc @Omega359 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797640563 Thanks for the additional work on this @friendlymatthew ! I think this approach is solid - the overhead for the casting is limited to only the cases where the format string includes time type specifiers and otherwise the performance is unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2794878604 f906af55df1b9c9bbfa15513da2e8ca17580e783 is a proof of concept that involves selective retry on err. _There's benchmark results below!_ ## Quick overview The `to_char_scalar` is quite trivial-- we deal with 1 format string for N Date32s, the first format err will trigger a retry, treating the `Date32` array as a `Date64` array. The `to_char_array` is trickier since we deal with N format strings for N `Date32`s. _When_ a format error occurs at the ith `Date32`, we'll retry that specific date as a `Date64`. And rather than recast the entire input `Date32` array as a `Date64` array, we slice into the array and only retrieve the faulty `Date32`. Another method I considered was casting the input `Date32` array at _most_ once. When the first occurrence of a format err occurs, we'll just recast the entire input array as a `Date64` array. So that subsequent format strings with time-specifiers will format without err. I don't really like this, because I'd like to cast from `Date32` to `Date64` very sparingly. Also, if we have a 1000 `Date32` and a 1000 format strings, and it just happens that the last format string contains time-specifiers, we'd endure the pain of casting the entire input array as a `Date64` array just to format the last element. ## Benchmarks ### Selective retry vs. main The selective retry doesn't disturb the code path. When benchmarking with main, we see little to no variance. ```sh # The selective retry code was benched first. This is the benchmark result when running on main. to_char_array_date_only_patterns_1000 time: [136.87 µs 137.13 µs 137.49 µs] change: [-0.9784% -0.7511% -0.4721%] (p = 0.00 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 2 (2.00%) high mild 7 (7.00%) high severe to_char_scalar_date_only_pattern_1000 time: [101.26 µs 104.97 µs 108.40 µs] change: [-1.5505% +2.7529% +7.4144%] (p = 0.21 > 0.05) No change in performance detected. ``` ### Eager cast vs. selective retry Since selective retry doesn't disturb the existing paths, those cases also run much faster when compared to the eager cast approach. Plus, the selective retry approach sees a 3% improvement in the `to_char_scalar` than the eager cast approach. The tradeoff here is that the new features we want to have suffer and is slower than the eager cast approach. ```sh # The eager cast code was benched first. This is the benchmark result when running the selective retry code. to_char_array_date_only_patterns_1000 time: [137.99 µs 138.20 µs 138.44 µs] change: [-21.564% -20.959% -20.538%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild to_char_array_datetime_patterns_1000 time: [503.48 µs 506.55 µs 509.87 µs] change: [+110.95% +112.69% +114.12%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild to_char_array_mixed_patterns_1000 time: [323.05 µs 325.66 µs 328.11 µs] change: [+65.744% +66.705% +67.701%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild to_char_scalar_date_only_pattern_1000 time: [96.153 µs 100.02 µs 103.88 µs] change: [-12.058% -8.2654% -4.3595%] (p = 0.00 < 0.05) Performance has improved. to_char_scalar_datetime_pattern_1000 time: [177.38 µs 186.04 µs 194.64 µs] change: [-11.383% -7.3528% -3.2635%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 14 (14.00%) low mild 6 (6.00%) high mild to_char_scalar_1000 time: [312.03 ns 316.56 ns 320.91 ns] change: [-4.3628% -3.1215% -1.7750%] (p = 0.00 < 0.05) Performance has improved. ``` If you want to run these benchmarks yourself, you can run: ```sh # use this to compare the alternate approaches (eager cast vs. selective retry) cargo bench --package datafusion-functions "to_char" ``` ```sh # use this to compare approaches to main cargo be
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2783424002
It may be beneficial to leave the existing code paths undisturbed and decide
whether to retry on err.
It avoids the overhead of custom `chrono::Strftime::parse` validation. Plus,
the retry logic for `to_char_scalar` is simple to implement. Something like:
to_char_scalar
```rust
let formatter = ArrayFormatter::try_new(array.as_ref(), &format_options)?;
let formatted: Result>, ArrowError> = (0..array.len())
.map(|i| {
if array.is_null(i) {
Ok(None)
} else {
formatter.value(i).try_to_string().map(Some)
}
})
.collect(); /*
will stop iterating upon the first err and
since we're dealing with one format string,
it'll halt after the first format attempt
*/
if let Ok(formatted) = formatted {
if is_scalar_expression {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8(
formatted.first().unwrap().clone(),
)))
} else {
Ok(ColumnarValue::Array(
Arc::new(StringArray::from(formatted)) as ArrayRef
))
}
} else {
// if the format attempt is with a Date32, it's possible the attempt
failed because
// the format string contained time-specifiers, so we'll retry as
Date64s
if data_type == &Date32 {
return to_char_scalar(expression.clone().cast_to(&Date64, None)?,
format);
}
exec_err!("{}", formatted.unwrap_err())
}
```
`to_char_array` is a bit more complex (see
https://github.com/apache/datafusion/pull/15361#discussion_r2015455219), but
I'll think it through.
FWIW, it's worth implementing the retry logic and benchmarking the
performance. I'm happy to do the work. The decision between eager casting and
selective retry likely comes down to whether we're willing to slightly slow
down existing behavior in exchange for better performance in the new case. But
I'm just spitballing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2783327108 It's only slower when date32 is the input type. It's going to be hard to have this functionality for formats with timestamp specifiers but not have some sort of impact on those with just date specifiers. The only way I can think of to handle that is via catching the error when the parsing fails and retrying with date64. To put some context on this: - The 25% is for the case where the use has an array of 1000 dates *and* 1000 formats (the table case) - The 7% is for the far more common case of 1000 dates and a single format. That is the overhead of chrono parsing a format string and doing matches on that. - This does not affect any other input types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2782998319 > (TLDR) This branch will make the existing to_char paths slower, at the cost of adding the ability to format time-specifiers. Slowing down existing functionality for current users for some new functionality doesn't sound like a great tradeoff to me -- if users want this behavior they can always override the to_char implementation If we can retain the same performance for existing to_char but add the new feature I think that is much more reasonable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2781505801 Some benchmark results when comparing main with this branch: _(TLDR) This branch will make the existing `to_char` paths slower, at the cost of adding the ability to format time-specifiers._ I compared two benchmark functions: `to_char_array_date_only_patterns_1000` and `to_char_scalar_date_only_pattern_1000`. These functions are concerned with formatting dates with date-only specifiers. From main: ```sh to_char_array_date_only_patterns_1000 time: [136.39 µs 136.68 µs 137.03 µs] to_char_scalar_date_only_pattern_1000 time: [92.084 µs 95.999 µs 100.12 µs] ``` And when compared to this branch: ```sh to_char_array_date_only_patterns_1000 time: [173.46 µs 174.21 µs 175.10 µs] change: [+25.159% +25.881% +26.651%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 6 (6.00%) high mild to_char_scalar_date_only_pattern_1000 time: [105.04 µs 109.46 µs 113.80 µs] change: [+6.6648% +11.410% +16.429%] (p = 0.00 < 0.05) Performance has regressed. ``` ### To reproduce: You can run the two benchmark functions by: ```sh cargo bench --package datafusion-functions "to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000" ``` Since the benchmark function names clash between main and this branch, you can consider [this branch](https://github.com/apache/datafusion/compare/main...friendlymatthew:datafusion:friendlymatthew/date32-with-ts-format-main-reference) representing main. (This branch only contains the refactor benchmark logic). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2781078642 > I messed up and pushed directly to your branch vs a PR, sorry about that @friendlymatthew Not at all, thank you and happy to collaborate on this together! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2781048171 I messed up and pushed directly to your branch vs a PR, sorry about that @friendlymatthew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2773811330 > I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats ([main...Omega359:arrow-datafusion:to_char_date32_with_time_formats](https://github.com/apache/datafusion/compare/main...Omega359:arrow-datafusion:to_char_date32_with_time_formats)) > > I haven't done a comparison with the benchmark results from main yet. I was planning on getting to [this](https://github.com/apache/datafusion/pull/15361#discussion_r2020139946) during the weekend, but wow, this is super awesome @Omega359! I love how `has_time_specifier` will early return `true` when encountering the first time-related specifier. Let me know how you want to proceed, I'd be happy to help out with benchmarking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2009208756
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> I don't understand why this is handling errors -- shouldn't we just
dispach on type when building the format options?
Sorry.
I assume you mean detect whether the format options contain time specifiers
ahead of time and if so cast the `Date32` into a `Date64`.
Or are you saying avoid the cast altogether? A
`FormatOption.with_datetime_format()` won't work with `Date32`, as Arrow
doesn't consider the `datetime_format` field when formatting a `Date32`.
```rs
#[test]
fn unaffected_format() {
let date_format = "%Y-%m %H";
let cast_options = CastOptions {
safe: false,
format_options:
FormatOptions::default().with_datetime_format(Some(date_format)),
};
let array = Date32Array::from(vec![1, 17890]);
let res = cast_with_options(&array, &DataType::Utf8, &cast_options);
dbg!(&res); // StringArray(["1997-05-19", "2018-12-25"])
}
```
Initially, I figured if the initial format attempt fails with a `Date32`,
it'll retry as a `Date64`. This is similar to how [Postgres tries to parse an
interval string under the normal standard, and if that errs, it tries to parse
the string again using the ISO8601
standard.](https://github.com/postgres/postgres/blob/58fdca2204de5f683f025df37553e5e69cb6adb1/src/backend/utils/adt/timestamp.c#L920-L929)
Regardless, I'll be happy to refactor this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2779327000 > Your branch you mean? I can try and push a PR to your branch once I figure out how :) I just sent an invite to collaborate on my fork. From there you should be able to directly push a commit to this branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2779311964 Your branch you mean? I can try and push a PR to your branch once I figure out how :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2779265295 > I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats ([main...Omega359:arrow-datafusion:to_char_date32_with_time_formats](https://github.com/apache/datafusion/compare/main...Omega359:arrow-datafusion:to_char_date32_with_time_formats)) > > I haven't done a comparison with the benchmark results from main yet. @Omega359 would you be open to committing to this branch? This way, I could base the benchmarking work on top of your work. If we decide to on a different approach than the experimented one, we could just roll back to a prior commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2776384329 > @friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for date32 data. Sounds great. I'll have some free time tomorrow to work on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2775861390 @friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for date32 data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2773790617 I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats (https://github.com/apache/datafusion/compare/main...Omega359:arrow-datafusion:to_char_date32_with_time_formats) I haven't done a comparison with the benchmark results from main yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2020191523
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -220,6 +221,13 @@ fn _to_char_scalar(
}
}
+// eagerly cast Date32 values to Date64 to support date formatting with
time-related specifiers
+// without error.
+if data_type == &Date32 {
Review Comment:
Some data from a quick little benchmark I wrote
```
test_date32_to_date64_cast_array_1000
time: [311.06 ns 314.48 ns 318.16 ns]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
test_parse_performance_with_time_specifiers_array_1000
time: [343.79 ns 351.64 ns 359.98 ns]
Found 14 outliers among 100 measurements (14.00%)
10 (10.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
test_parse_performance_without_time_specifiers_array_1000
time: [196.59 µs 201.06 µs 206.45 µs]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
```
test_date32_to_date64_cast_array_1000: just casts from date32 to date64
test_parse_performance_with_time_specifiers_array_1000: parses the formats
looking for time specifiers and when found does the cast from date32 to date64
test_parse_performance_without_time_specifiers_array_1000: parses the
formats looking for time specifiers (doesn't find anything), no cast
Note that results on my machine will vary +-10% between runs. The check for
time specifiers in the format is simple and could be improved but I think is
enough to show general performance. Note that the list of time specifiers is
not complete
```
const TIME_SPECIFIERS: &[&str] = &[
"%H", "%M", "%S", "%c", "%f", "%k", "%I", "%l", "%p", "%R", "%T", "%x",
"%r", "%Z",
];
fn has_time_specifiers(str: &str) -> bool {
for specifier in TIME_SPECIFIERS {
if str.contains(specifier) {
return true;
}
}
false
}
```
A couple of takeaways from this. casting is not as cheap as I thought
however parsing is seems to be more expensive than that but perhaps with some
really good optimizations it could be reduced.
I don't see a good way to make this feature have no cost with Date32 && no
time specifiers in the format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2020191523
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -220,6 +221,13 @@ fn _to_char_scalar(
}
}
+// eagerly cast Date32 values to Date64 to support date formatting with
time-related specifiers
+// without error.
+if data_type == &Date32 {
Review Comment:
Some data from a quick little benchmark I wrote
```
test_date32_to_date64_cast_array_1000
time: [311.06 ns 314.48 ns 318.16 ns]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
test_parse_performance_with_time_specifiers_array_1000
time: [343.79 ns 351.64 ns 359.98 ns]
Found 14 outliers among 100 measurements (14.00%)
10 (10.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
test_parse_performance_without_time_specifiers_array_1000
time: [196.59 µs 201.06 µs 206.45 µs]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
```
Note that results on my machine will vary +-10% between runs. The check for
time specifiers in the format is simple and could be improved but I think is
enough to show general performance. Note that the list of time specifiers is
not complete
```
const TIME_SPECIFIERS: &[&str] = &[
"%H", "%M", "%S", "%c", "%f", "%k", "%I", "%l", "%p", "%R", "%T", "%x",
"%r", "%Z",
];
fn has_time_specifiers(str: &str) -> bool {
for specifier in TIME_SPECIFIERS {
if str.contains(specifier) {
return true;
}
}
false
}
```
A couple of takeaways from this. casting is not as cheap as I thought
however parsing is seems to be more expensive than that but perhaps with some
really good optimizations it could be reduced.
I don't see a good way to make this feature have no cost with Date32 && no
time specifiers in the format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2020170477
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -220,6 +221,13 @@ fn _to_char_scalar(
}
}
+// eagerly cast Date32 values to Date64 to support date formatting with
time-related specifiers
+// without error.
+if data_type == &Date32 {
Review Comment:
It is possible to check for time formats in the format string however I'm
not sure that cost would be any less than just doing the conversion up front. I
think we need a benchmark to have quantified data vs assumptions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2020139946
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -220,6 +221,13 @@ fn _to_char_scalar(
}
}
+// eagerly cast Date32 values to Date64 to support date formatting with
time-related specifiers
+// without error.
+if data_type == &Date32 {
Review Comment:
as @Omega359 says, this will now penalize performance for all existing
Date32 columns
Is there any way we can check if the format string contains any time related
specifiers before doing this conversion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019547635
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> > Yeah, this is a limitation of arrow-rs currently. I never got around to
pushing a PR to arrow-rs to change that.
>
> I'd be happy to work on this next, if you'd like.
It would clean things up but I'm not sure if it would be a measurable
performance improvement or not. Leaning to it not being all that noticeable but
I'm often wrong :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019270254
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> Casting Date32 to Date64 should be incredibly cheap I think
I agree, converting involves a cast between primitives and a mult.
https://github.com/apache/arrow-rs/blob/206fe024cdd9b9de770714baa27b58a1442fcc75/arrow-cast/src/cast/mod.rs#L1423-L1427
> Yeah, this is a limitation of arrow-rs currently. I never got around to
pushing a PR to arrow-rs to change that.
I'd be happy to work on this next, if you'd like.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019195514
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
I thought about this a bit and was mentally toying with the idea of testing
the format string(s) for time formats vs just casting always ... and generally
I don't think it would be worth it at all. Casting Date32 to Date64 should be
incredibly cheap I think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019198451
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> _*The entire `Date32` array because for every format string, we create a
new `ArrayFormatter` by supplying the entire array and specifying the index for
the formatter to format:_
>
>
https://github.com/apache/datafusion/blob/fdb4e848b65c001dd3f65b477296e07cbe8e0b07/datafusion/functions/src/datetime/to_char.rs#L274-L277
Yeah, this is a limitation of arrow-rs currently. I never got around to
pushing a PR to arrow-rs to change that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2015455219
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
I took another look and reached a different conclusion from above. I fear
[1e57ed1](https://github.com/apache/datafusion/pull/15361/commits/1e57ed10f03b352fd2588fb8adf011e2f15766bd)
is performing way too many casts in the retry logic, and it is better to
implicitly treat `Date32` as timestamps ahead of time.
`to_char` invokes either `to_char_scalar` or `to_char_array` based on the
number of formats provided in the arguments (i.e. N dates with 1 format string
or N dates with N format strings). If any of the dates err when formatting,
we'd cast all N dates as `Date64`s and redo the body of work.
While this would be fine with `to_char_scalar` because a format string with
a time-specifier would err for any `Date32`, so at worst it'd err, cast, and
retry after the first format attempt. But in `to_char_array` where we're
iterating through N format strings, at worst we'd need to recast the _entire_
`Date32` array N times*.
Since we can't avoid the `Date32` to `Date64` cast to support this feature,
it's much simpler to check ahead of time and cast the input `Date32` array into
a `Date64` array. Plus, all date-specifiers are valid in timestamp formatting,
so existing features work as expected.
I pushed
[ec2373a](https://github.com/apache/datafusion/pull/15361/commits/ec2373a4d58d6e373c6982ac85c6396c5c25d9ea)
as a proof of concept. But I'd be happy to rework it as you see fit.
_*The entire `Date32` array because for every format string, we create a new
`ArrayFormatter` by supplying the entire array and specifying the index for the
formatter to format:_
https://github.com/apache/datafusion/blob/fdb4e848b65c001dd3f65b477296e07cbe8e0b07/datafusion/functions/src/datetime/to_char.rs#L274-L277
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2015534358
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> But in to_char_array where we're iterating through N format strings, at
worst we'd need to recast the entire Date32 array N times.
We _could_ make the `Date32` array `mut` and keep it outside of the loop.
Then when we encounter the first failed format attempt, we'd cast the date
array to a `Date64` array and treat the subsequent dates as `Date64`s. This
way, we limit the cast to only once per batch.
But this adds a lot of unnecessary code complexity.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2015455219
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
I took another look and reached a different conclusion from above. I fear
[1e57ed1](https://github.com/apache/datafusion/pull/15361/commits/1e57ed10f03b352fd2588fb8adf011e2f15766bd)
is performing way too many casts in the retry logic, and it is better to
implicitly treat `Date32` as timestamps ahead of time.
`to_char` invokes either `to_char_scalar` or `to_char_array` based on the
number of formats provided in the arguments (i.e. N dates with 1 format string
or N dates with N format strings). If any of the dates err when formatting,
we'd cast all N dates as `Date64`s and redo the body of work.
While this would be fine with `to_char_scalar` because a format string with
a time-specifier would err for any `Date32`, so at worst it'd err, cast, and
retry after the first format attempt. But in `to_char_array` where we're
iterating through N format strings, at worst we'd need to recast the _entire_
`Date32` array N times*.
Since we can't avoid the `Date32` to `Date64` cast to support this feature,
it's much simpler to check ahead of time and cast the input `Date32` array into
a `Date64` array. Plus, all date-specifiers are valid in timestamp formatting,
so existing features work as expected.
I pushed
[12e2314](https://github.com/apache/datafusion/pull/15361/commits/12e2314edcaa6ae0b7378a1986560b7a123bf3f2)
as a proof of concept. But I'd be happy to rework it as you see fit.
_*The entire `Date32` array because for every format string, we create a new
`ArrayFormatter` by supplying the entire array and specifying the index for the
formatter to format:_
https://github.com/apache/datafusion/blob/fdb4e848b65c001dd3f65b477296e07cbe8e0b07/datafusion/functions/src/datetime/to_char.rs#L274-L277
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2009208756
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> I don't understand why this is handling errors -- shouldn't we just
dispach on type when building the format options?
Sorry.
I assume you mean detect whether the format options contain time specifiers
ahead of time and if so cast the `Date32` into a `Date64`.
Or are you saying avoid the cast altogether? A
`FormatOption.with_datetime_format()` won't work with `Date32`, as Arrow
doesn't consider the `datetime_format` field when formatting a `Date32`.
```rs
#[test]
fn unaffected_format() {
let date_format = "%Y-%m %H";
let cast_options = CastOptions {
safe: false,
format_options:
FormatOptions::default().with_datetime_format(Some(date_format)),
};
let array = Date32Array::from(vec![1, 17890]);
let res = cast_with_options(&array, &DataType::Utf8, &cast_options);
dbg!(&res); // StringArray(["1997-05-19", "2018-12-25"])
}
```
Initially, I figured if the initial format attempt fails with a `Date32`,
it'll retry as a `Date64`. This is similar to how [Postgres tries to cast a
`Date` into an `Interval` and if it fails, it reattempts with a different
format.](https://github.com/postgres/postgres/blob/58fdca2204de5f683f025df37553e5e69cb6adb1/src/backend/utils/adt/timestamp.c#L920-L929)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2010730547
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
Right now I do find it very hard to understand all the nested error handling
Perhaps you could pull the retry logic into a function that handles the
retry and is well documented (I think your description above makes a lot of
sense, but it is not at all clear from the code in my mind)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2009208756
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
> I don't understand why this is handling errors -- shouldn't we just
dispach on type when building the format options?
Sorry.
I assume you mean detect whether the format options contain time specifiers
ahead of time and if so cast the `Date32` into a `Date64`.
Initially, I figured if the initial format attempt fails with a `Date32`,
it'll retry as a `Date64`. This is similar to how [Postgres tries to cast a
`Date` into an `Interval` and if it fails, it reattempts with a different
format.](https://github.com/postgres/postgres/blob/58fdca2204de5f683f025df37553e5e69cb6adb1/src/backend/utils/adt/timestamp.c#L920-L929)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
alamb commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r200920
##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) ->
Result {
let result = formatter.value(idx).try_to_string();
match result {
Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {
Review Comment:
I don't understand why this is handling errors -- shouldn't we just dispach
on type when building the format options?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
