dragosmg commented on PR #13196: URL: https://github.com/apache/arrow/pull/13196#issuecomment-1161550157
It look like combining the separator and non-separator formats into a single vector (my original implementation) is faster than using them separately based on if the data contains a separator or not.  <details> <summary>Results table</summary> ```r > results # A tibble: 2 × 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> 1 separate 6s 6.03s 0.165 15.5MB 0.0292 17 3 1.72m <tibble> <Rprofmem> <bench_tm [20]> <tibble> 2 combined 3.36s 3.37s 0.297 15.5MB 0.0330 18 2 1.01m <tibble> <Rprofmem> <bench_tm [20]> <tibble> ``` </details> <details> <summary> Code </summary> ```r library(dplyr) library(lubridate) library(ggplot2) library(hrbrthemes) load_all() test_df <- tibble::tibble( a = rep(c("20220614", "2022-06-14"), 1e6) ) results <- bench::mark( separate = test_df %>% arrow_table() %>% mutate(b = parse_date_time(a, orders = "ymd")) %>% collect(), combined = test_df %>% arrow_table() %>% mutate(b = parse_date_time_combined(a, orders = "ymd")) %>% collect(), min_iterations = 20 ) results ggplot2::autoplot(results) + theme_ipsum_rc(grid = "XxY") + labs(title = "Comparison of format parsing", subtitle = "separate = formats with or without separator are tried separately\n combined = formats are combined in a single vector and all are passed to `coalesce()`") ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org