Neal Richardson created ARROW-12620:
---------------------------------------

             Summary: [C++] Dataset writing can only include projected columns 
if input columns are also included
                 Key: ARROW-12620
                 URL: https://issues.apache.org/jira/browse/ARROW-12620
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 4.0.0
            Reporter: Neal Richardson


I discovered this while working on https://github.com/apache/arrow/pull/10191. 
You can project new columns when writing a dataset, but only if they are 
derived from columns that are included in the output. Here's an R-based example:

{code}
# Simple function to write and re-open the new dataset
write_then_open <- function(ds, path, ...) {
  write_dataset(ds, path, ...)
  open_dataset(path)
}

tab <- Table$create(a = 1:5)

tab %>% 
  write_then_open(ds_dir) %>%
  collect()

# # A tibble: 5 x 1
#       a
#   <int>
# 1     1
# 2     2
# 3     3
# 4     4
# 5     5

# If you rename a column, it's all nulls
tab %>%
  select(b = a) %>%
  write_then_open(ds_dir) %>%
  collect()

# # A tibble: 5 x 1
#       b
#   <int>
# 1    NA
# 2    NA
# 3    NA
# 4    NA
# 5    NA

# If you derive a new column and keep the original, it works
tab %>%
  mutate(b = a) %>%
  write_then_open(ds_dir) %>%
  collect()

# # A tibble: 5 x 2
#       a     b
#   <int> <int>
# 1     1     1
# 2     2     2
# 3     3     3
# 4     4     4
# 5     5     5

# transmute() only keeps the added columns, so it also illustrates the failure
tab %>%
  transmute(b = a) %>%
  write_then_open(ds_dir) %>%
  collect()

# # A tibble: 5 x 1
#       b
#   <int>
# 1    NA
# 2    NA
# 3    NA
# 4    NA
# 5    NA
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to