Weston Pace created ARROW-16148: ----------------------------------- Summary: [C++] TPC-H generator cleanup Key: ARROW-16148 URL: https://issues.apache.org/jira/browse/ARROW-16148 Project: Apache Arrow Issue Type: Bug Reporter: Weston Pace
An umbrella issue for a number of issues I've run into with our TPC-H generator. h2. We emit fixed_size_binary fields with nuls padding the strings. Ideally we would either emit these as utf8 strings like the others, or we would have a toggle to emit them as such (though see below about needing to strip nuls) When I try and run these through the I get a number of seg faults or hangs when running a number of the TPC-H queries. Additionally, even converting these to utf8|string types, I also need to strip out the nuls in order to actually query against them: {code} library(arrow, warn.conflicts = FALSE) #> See arrow_info() for available features library(dplyr, warn.conflicts = FALSE) options(arrow.skip_nul = TRUE) tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE) tab #> Table #> 25 rows x 4 columns #> $N_NATIONKEY <int32> #> $N_NAME <fixed_size_binary[25]> #> $N_REGIONKEY <int32> #> $N_COMMENT <string> # This will not work (Though is how the TPC-H queries are structured) tab %>% filter(N_NAME == "JAPAN") %>% collect() #> # A tibble: 0 × 4 #> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>, #> # N_REGIONKEY <int>, N_COMMENT <chr> # Instead, we need to create the nul padded string to do the comparison japan_raw <- as.raw( c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) ) # confirming this is the same thing as in the data japan_raw == as.vector(tab$N_NAME)[[13]] #> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE tab %>% filter(N_NAME == Scalar$create(japan_raw, type = fixed_size_binary(25))) %>% collect() #> # A tibble: 1 × 4 #> N_NATIONKEY #> <int> #> 1 12 #> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY <int>, #> # N_COMMENT <chr> {code} Here is the code I've been using to cast + strip these out after the fact: {code} library(arrow, warn.conflicts = FALSE) options(arrow.skip_nul = TRUE) options(arrow.use_altrep = FALSE) tables <- arrowbench:::tpch_tables for (table_name in tables) { message("Working on ", table_name) tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), as_data_frame=FALSE) for (col in tab$schema$fields) { if (inherits(col$type, "FixedSizeBinary")) { message("Rewritting ", col$name) tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string()))) } } tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet")) } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)