write dataset, without intermediate table?

Nic Crane Mon, 11 Mar 2024 14:09:59 -0700

Hey folks,

In the process of trying to work out if something is a bug or not, I'm
trying to work out the PyArrow equivalent for some R code (see [1]),
which opens a dataset, projects a new column, and then writes the
dataset to disk using the new column as a partitioning variable.


All of the examples I have found so far in docs involve converting the
dataset to a table in an intermediate step before projecting the new
column, which I don't want to do, as the dataset is larger than
memory.

Is it possible to do this without converting the dataset into a table,
and if so, how?

Thanks,

Nic


[1]
```
open_dataset("data/pums/person") |>
  mutate(
    age_group = case_when(
      AGEP < 25 ~ "Under 25",
      AGEP < 35 ~ "25-34",
      AGEP < 45 ~ "35-44",
      AGEP < 55 ~ "45-54",
      AGEP < 65 ~ "55-64",
      TRUE ~ "65+"
    )
  )|>
  write_dataset(
    path = "./data/pums/person-age-partitions",
    partitioning = c("year", "location", "age_group")
  )
```

[Python] Read dataset -> project -> write dataset, without intermediate table?

Reply via email to