r2evans commented on issue #44309:
URL: https://github.com/apache/arrow/issues/44309#issuecomment-2405773105

   Yes, my issue here appears to be a dupe of #38724 and #41146. I'm glad to 
discover I'm not the only (or first) one to have this question, though I'm 
dismayed to see no obvious progress since the first opened 11 months ago.
   
   For the record, I using the following to get what I "need":
   
   ```r
   arrow::register_scalar_function(
     "extract_keyfield_from_path",
     function(context, filenames, key) {
       uniqfn <- unique(filenames)
       val <- setNames(sub(paste0(".*/", key, "=([^/]*)/.*"), "\\1", uniqfn, 
ignore.case = TRUE), uniqfn)
       val[val == uniqfn] <- NA
       unname(val[filenames])
     },
     in_type = arrow::schema(filenames = arrow::string(), key = 
arrow::string()),
     out_type = arrow::string(),
     auto_convert = TRUE
   )
   
   arr <- arrow::open_dataset(file.path(td, "cyl=4"))
   need_keys <- setdiff(names(arr$schema$metadata$r$columns), names(arr)) # can 
be 0+
   arr$schema$metadata$r$columns <- arr$schema$metadata$r$columns[ names(arr) ]
   arr <- mutate(arr, fn = add_filename())
   arr <- Reduce(function(ar, ky) mutate(ar, !!as.symbol(ky) := 
extract_keyfield_from_path(fn, ky)), need_keys, init = arr)
   arr |>
     select(-fn) |>
     head(n=2) |>
     collect()
   #      mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb    cyl
   #    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char>
   # 1:  22.8 108.0    93  3.85  2.32 18.61     1     1     4     1      4
   # 2:  24.4 146.7    62  3.69  3.19 20.00     1     0     4     2      4
   ```
   
   While far from herculean, the use of regex is overkill and fragile (it 
assumes hive-style, for instance).
   
   Since the hive-partitioning is known at write time, and the key columns are 
already stored in the schema/metadata, would it make sense to store the actual 
value in the schema if not in the data? Looking at the schema metadata, while 
`$r` is the only thing in `arr$schema$metadata`, what if instead the metadata 
looked like this:
   
   ```r
   str(arr$schema$metadata)
   # List of 2
   #  $ partitions:List of 1
   #   ..$ columns:List of 1
   #   .. ..$ cyl: chr "4"
   #  $ r         :List of 2
   #   ..$ attributes:List of 1
   #   .. ..$ class: chr [1:2] "data.table" "data.frame"
   #   ..$ columns   :List of 11
   #   .. ..$ mpg : NULL
   #   .. ..$ cyl : NULL
   #   .. ..$ disp: NULL
   #   .. ..$ hp  : NULL
   # ...
   ```
   
   or in the case of non-hive partitioning (which I don't use, so I'm 
speculating here):
   
   ```r
   str(arr$schema$metadata)
   # List of 2
   #  $ partitions:List of 1
   #   ..$ columns:List of 1
   #   .. ..$ : chr "4"
   #  $ r         :List of 2
   #   ..$ attributes:List of 1
   #   .. ..$ class: chr [1:2] "data.table" "data.frame"
   #   ..$ columns   :List of 11
   #   .. ..$ mpg : NULL
   #   .. ..$ cyl : NULL
   #   .. ..$ disp: NULL
   # ...
   ```
   
   In this way, I'd think it'd be feasible to use this in other languages as 
well. Since it would be a breaking change, I suggest one could "opt-in" to this 
feature. While "how" that is done is left up to the architects, I don't think 
it's hard at all. And it knocks out three issues for the price of one ;-)
   
   I know it's not that simple, I don't know if other languages write 
attributes the same as `$r$columns`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to