Re: [I] [Benchmarking][R] conbench is failing [arrow]

via GitHub Tue, 17 Jun 2025 05:49:41 -0700


jonkeane commented on issue #46716:
URL: https://github.com/apache/arrow/issues/46716#issuecomment-2980258909


   I haven't dug too closely, but was able to replicate this with that 
particular query. This is failing on every TPC-H query, yeah? If so, it's 
likely something to do with [the data generation 
code](https://github.com/voltrondata-labs/arrowbench/blob/deacdebc64bb5c04f8976138c45db96710e56e77/R/ensure-tpch-source.R#L22-L42)
 
   
   You can run this locally using the value that's in the `script` section of 
the benchmark details (I've copied / pasted here after removing the JSON cruft 
of `[`, ` , `, and `]` (and also formatted with AIR, but it's runable without 
that!):
   
   
   ```
   library(arrowbench)
   out <- run_bm(
       format = "parquet",
       scale_factor = 10,
       engine = "arrow",
       memory_map = FALSE,
       query_id = 5,
       bm = structure(
           list(
               name = "tpch",
               setup = function(
                   engine = "arrow",
                   query_id = 1:22,
                   format = c("native", "parquet"),
                   scale_factor = c(1, 10),
                   memory_map = FALSE,
                   output = "data_frame",
                   chunk_size = NULL
               ) {
                   engine <- match.arg(
                       engine,
                       c("arrow", "duckdb", "duckdb_sql", "dplyr")
                   )
                   format <- match.arg(format, c("parquet", "feather", 
"native"))
                   stopifnot(
                       "query_id must be an int" = query_id %% 1 == 0,
                       "query_id must 1-22" = query_id >= 1 & query_id <= 22
                   )
                   output <- match.arg(output, c("arrow_table", "data_frame"))
                   library("dplyr", warn.conflicts = FALSE)
                   collect_func <- collect
                   if (output == "data_frame") {
                       collect_func <- collect
                   } else if (output == "arrow_table") {
                       collect_func <- compute
                   }
                   con <- NULL
                   if (engine %in% c("duckdb", "duckdb_sql")) {
                       con <- DBI::dbConnect(duckdb::duckdb())
                       DBI::dbExecute(
                           con,
                           paste0("PRAGMA threads=", getOption("Ncpus"))
                       )
                   }
                   BenchEnvironment(
                       input_func = get_input_func(
                           engine = engine,
                           scale_factor = scale_factor,
                           query_id = query_id,
                           format = format,
                           con = con,
                           memory_map = memory_map,
                           chunk_size = chunk_size
                       ),
                       query = get_query_func(query_id, engine),
                       engine = engine,
                       con = con,
                       scale_factor = scale_factor,
                       query_id = query_id,
                       collect_func = collect_func
                   )
               },
               before_each = quote({
                   result <- NULL
               }),
               run = quote({
                   result <- query(input_func, collect_func, con)
               }),
               after_each = quote({
                   if (scale_factor %in% c(0.01, 0.10000000000000001, 1, 10)) {
                       answer <- tpch_answer(scale_factor, query_id)
                       result <- dplyr::as_tibble(result)
                       all_equal_out <- waldo::compare(
                           result,
                           answer,
                           tolerance = 0.01
                       )
                       if (length(all_equal_out) != 0) {
                           warning(paste0("\n", all_equal_out, "\n"))
                           stop("The answer does not match")
                       }
                   } else {
                       warning(
                           "There is no validation for scale_factors other than 
0.01, 0.1, 1, and 10. Be careful with these results!"
                       )
                   }
                   result <- NULL
               }),
               teardown = quote({
                   if (!is.null(con)) {
                       DBI::dbDisconnect(con, shutdown = TRUE)
                   }
               }),
               valid_params = function(params) {
                   drop <- (params$engine != "arrow" &
                       params$format == "feather") |
                       (params$engine != "arrow" &
                           params$output == "arrow_table") |
                       (params$engine != "arrow" &
                           params$memory_map == TRUE) |
                       (params$engine == "dplyr" & params$format == "native")
                   params[!drop, ]
               },
               case_version = function(params) {
                   NULL
               },
               batch_id_fun = function(params) {
                   batch_id <- uuid()
                   paste0(
                       batch_id,
                       "-",
                       params$scale_factor,
                       substr(params$format, 1, 1)
                   )
               },
               tags_fun = function(params) {
                   params$query_id <- sprintf("TPCH-%02d", params$query_id)
                   if (!is.null(params$output) && params$output == 
"data_frame") {
                       params$output <- NULL
                   }
                   params
               },
               packages_used = function(params) {
                   c(params$engine, "dplyr", "lubridate")
               }
           ),
           class = "Benchmark"
       ),
       n_iter = 1,
       batch_id = NULL,
       profiling = FALSE,
       global_params = list(cpu_count = NULL, lib_path = "latest"),
       run_id = NULL,
       run_name = NULL,
       run_reason = NULL
   )
   cat(" ##### RESULTS FOLLOW ")
   cat(out$json)
   cat(" ##### RESULTS END ")
   ```
   
   I got: `Error: NotImplemented: Unhandled type for Arrow to Parquet schema 
conversion: decimal64(15, 2)`
   
   Before I ran I needed to install arrow (I used our released/CRAN version) 
along with `arrowbench` 
(`remotes::install_github("voltrondata-labs/arrowbench")`). But ^^^ is 
self-contained and will do what it needs to to construct the dataset + run the 
query.
   
   As extra evidence that this is in the data generation code: when I ran this, 
in the `data` directory in my working directory (which arrowbench uses to keep 
data, there was a single parquet file `customer` that also is 0 bytes (so 
failed to write). I would have expected 8 files, one for teach TPC-H table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Benchmarking][R] conbench is failing [arrow]

Reply via email to