r2evans commented on issue #44309:
URL: https://github.com/apache/arrow/issues/44309#issuecomment-2405773105
Yes, my issue here appears to be a dupe of #38724 and #41146. I'm glad to
discover I'm not the only (or first) one to have this question, though I'm
dismayed to see no obvious progress since the first opened 11 months ago.
For the record, I using the following to get what I "need":
```r
arrow::register_scalar_function(
"extract_keyfield_from_path",
function(context, filenames, key) {
uniqfn <- unique(filenames)
val <- setNames(sub(paste0(".*/", key, "=([^/]*)/.*"), "\\1", uniqfn,
ignore.case = TRUE), uniqfn)
val[val == uniqfn] <- NA
unname(val[filenames])
},
in_type = arrow::schema(filenames = arrow::string(), key =
arrow::string()),
out_type = arrow::string(),
auto_convert = TRUE
)
arr <- arrow::open_dataset(file.path(td, "cyl=4"))
need_keys <- setdiff(names(arr$schema$metadata$r$columns), names(arr)) # can
be 0+
arr$schema$metadata$r$columns <- arr$schema$metadata$r$columns[ names(arr) ]
arr <- mutate(arr, fn = add_filename())
arr <- Reduce(function(ar, ky) mutate(ar, !!as.symbol(ky) :=
extract_keyfield_from_path(fn, ky)), need_keys, init = arr)
arr |>
select(-fn) |>
head(n=2) |>
collect()
# mpg disp hp drat wt qsec vs am gear carb cyl
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char>
# 1: 22.8 108.0 93 3.85 2.32 18.61 1 1 4 1 4
# 2: 24.4 146.7 62 3.69 3.19 20.00 1 0 4 2 4
```
While far from herculean, the use of regex is overkill and fragile (it
assumes hive-style, for instance).
Since the hive-partitioning is known at write time, and the key columns are
already stored in the schema/metadata, would it make sense to store the actual
value in the schema if not in the data? Looking at the schema metadata, while
`$r` is the only thing in `arr$schema$metadata`, what if instead the metadata
looked like this:
```r
str(arr$schema$metadata)
# List of 2
# $ partitions:List of 1
# ..$ columns:List of 1
# .. ..$ cyl: chr "4"
# $ r :List of 2
# ..$ attributes:List of 1
# .. ..$ class: chr [1:2] "data.table" "data.frame"
# ..$ columns :List of 11
# .. ..$ mpg : NULL
# .. ..$ cyl : NULL
# .. ..$ disp: NULL
# .. ..$ hp : NULL
# ...
```
or in the case of non-hive partitioning (which I don't use, so I'm
speculating here):
```r
str(arr$schema$metadata)
# List of 2
# $ partitions:List of 1
# ..$ columns:List of 1
# .. ..$ : chr "4"
# $ r :List of 2
# ..$ attributes:List of 1
# .. ..$ class: chr [1:2] "data.table" "data.frame"
# ..$ columns :List of 11
# .. ..$ mpg : NULL
# .. ..$ cyl : NULL
# .. ..$ disp: NULL
# ...
```
In this way, I'd think it'd be feasible to use this in other languages as
well. Since it would be a breaking change, I suggest one could "opt-in" to this
feature. While "how" that is done is left up to the architects, I don't think
it's hard at all. And it knocks out three issues for the price of one ;-)
I know it's not that simple, I don't know if other languages write
attributes the same as `$r$columns`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]