Hey folks,
In the process of trying to work out if something is a bug or not, I'm
trying to work out the PyArrow equivalent for some R code (see [1]),
which opens a dataset, projects a new column, and then writes the
dataset to disk using the new column as a partitioning variable.
All of the examples I have found so far in docs involve converting the
dataset to a table in an intermediate step before projecting the new
column, which I don't want to do, as the dataset is larger than
memory.
Is it possible to do this without converting the dataset into a table,
and if so, how?
Thanks,
Nic
[1]
```
open_dataset("data/pums/person") |>
mutate(
age_group = case_when(
AGEP < 25 ~ "Under 25",
AGEP < 35 ~ "25-34",
AGEP < 45 ~ "35-44",
AGEP < 55 ~ "45-54",
AGEP < 65 ~ "55-64",
TRUE ~ "65+"
)
)|>
write_dataset(
path = "./data/pums/person-age-partitions",
partitioning = c("year", "location", "age_group")
)
```