Neal Richardson created ARROW-12620: ---------------------------------------
Summary: [C++] Dataset writing can only include projected columns if input columns are also included Key: ARROW-12620 URL: https://issues.apache.org/jira/browse/ARROW-12620 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 4.0.0 Reporter: Neal Richardson I discovered this while working on https://github.com/apache/arrow/pull/10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example: {code} # Simple function to write and re-open the new dataset write_then_open <- function(ds, path, ...) { write_dataset(ds, path, ...) open_dataset(path) } tab <- Table$create(a = 1:5) tab %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # a # <int> # 1 1 # 2 2 # 3 3 # 4 4 # 5 5 # If you rename a column, it's all nulls tab %>% select(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # <int> # 1 NA # 2 NA # 3 NA # 4 NA # 5 NA # If you derive a new column and keep the original, it works tab %>% mutate(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 2 # a b # <int> <int> # 1 1 1 # 2 2 2 # 3 3 3 # 4 4 4 # 5 5 5 # transmute() only keeps the added columns, so it also illustrates the failure tab %>% transmute(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # <int> # 1 NA # 2 NA # 3 NA # 4 NA # 5 NA {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)