thisisnic commented on a change in pull request #1: URL: https://github.com/apache/arrow-cookbook/pull/1#discussion_r674992429
########## File path: r/content/reading_and_writing_data.Rmd ########## @@ -0,0 +1,255 @@ +# Reading and Writing Data + +This chapter contains recipes related to reading and writing data from disk using Apache Arrow. + +## Reading and Writing Parquet Files + +### Writing a Parquet file + +You can write Parquet files to disk using `arrow::write_parquet()`. +```{r, write_parquet} +# Create table +my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) +# Write to Parquet +write_parquet(my_table, "my_table.parquet") +``` +```{r, test_write_parquet, opts.label = "test"} +test_that("write_parquet chunk works as expected", { + expect_true(file.exists("my_table.parquet")) +}) +``` + +### Reading a Parquet file + +Given a Parquet file, it can be read back to an Arrow Table by using `arrow::read_parquet()`. + +```{r, read_parquet} +parquet_tbl <- read_parquet("my_table.parquet") +head(parquet_tbl) +``` +```{r, test_read_parquet, opts.label = "test"} +test_that("read_parquet works as expected", { + expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) +}) +``` + +If the argument `as_data_frame` was set to `TRUE` (the default), the file was read in as a `data.frame` object. + +```{r, read_parquet_2} +class(parquet_tbl) +``` +```{r, test_read_parquet_2, opts.label = "test"} +test_that("read_parquet_2 works as expected", { + expect_s3_class(parquet_tbl, "data.frame") +}) +``` +If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow Table. + +```{r, read_parquet_table} +my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE) +head(my_table_arrow_table) +``` + +```{r, read_parquet_table_class} +class(my_table_arrow_table) +``` +```{r, test_read_parquet_table_class, opts.label = "test"} +test_that("read_parquet_table_class works as expected", { + expect_s3_class(my_table_arrow_table, "Table") +}) +``` + +## How to read a (partitioned) Parquet file from S3 + +You can open a Parquet file saved on S3 by calling `read_parquet()` and passing the relevant URI as the `file` argument. + +```{r, read_parquet_s3, eval = FALSE} +df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet") +``` +For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html. + +## How to filter rows or columns while reading a Parquet file + +When reading in a Parquet file, you can specify which columns to read in via the `col_select` argument. + +```{r, read_parquet_filter} +# Create table to read back in +dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time = c(43, 44, 40))) +# Write to Parquet +write_parquet(dist_time, "dist_time.parquet") + +# Read in only the "time" column +time_only <- read_parquet("dist_time.parquet", col_select = "time") +head(time_only) +``` +```{r, test_read_parquet_filter, opts.label = "test"} +test_that("read_parquet_filter works as expected", { + expect_identical(time_only, tibble::tibble(time = c(43, 44, 40))) +}) +``` + +## Reading and Writing CSV files + +You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV. + +```{r, write_csv_arrow} +write_csv_arrow(cars, "cars.csv") +``` +```{r, test_write_csv_arrow, opts.label = "test"} +test_that("write_csv_arrow chunk works as expected", { + expect_true(file.exists("cars.csv")) +}) +``` + +You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table. + +```{r, read_csv_arrow} +my_csv <- read_csv_arrow("cars.csv") +``` + +```{r, test_read_csv_arrow, opts.label = "test"} +test_that("read_csv_arrow chunk works as expected", { + expect_equivalent(dplyr::collect(my_csv), cars) +}) +``` + +## Reading and Writing Partitioned Data + +### Writing Partitioned Data + +You can use `write_dataset()` to save data to disk in partitions based on columns in the data. + +```{r, write_dataset} +write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day")) +list.files("airquality_partitioned") +``` +```{r, test_write_dataset, opts.label = "test"} +test_that("write_dataset chunk works as expected", { + # Partition by month + expect_identical(list.files("airquality_partitioned"), c("Month=5", "Month=6", "Month=7", "Month=8", "Month=9")) + # We have enough files + expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 153) +}) +``` +As you can see, this has created folders based on the first partition variable supplied, `Month`. + +If you take a look in one of these folders, you will see that the data is then partitioned by the second partition variable, `Day`. + +```{r} +list.files("airquality_partitioned/Month=5") +``` + +Each of these folders contains 1 or more Parquet files containing the relevant partition of the data. + +```{r} +list.files("airquality_partitioned/Month=5/Day=10") +``` + +### Reading Partitioned Data + +You can use `open_dataset()` to read partitioned data. + +```{r, open_dataset} +# Write some partitioned data to disk to read back in +write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day")) + +# Read data from directory +air_data <- open_dataset("airquality_partitioned") + +# View data +air_data +``` +```{r, test_open_dataset, opts.label = "test"} +test_that("open_dataset chunk works as expected", { + expect_equal(nrow(air_data), 153) + expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, Month, Day), ignore_attr = TRUE) +}) +``` + +## Reading and Writing Feather files + +### Write an IPC/Feather V2 file + +The Arrow IPC file format is identical to the Feather version 2 format. If you call `write_arrow()`, you will get a warning telling you to use `write_feather()` instead. + +```{r, write_arrow} +# Create table +my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) +write_arrow(my_table, "my_table.arrow") +``` +```{r, test_write_arrow, opts.label = "test"} +test_that("write_arrow chunk works as expected", { + expect_true(file.exists("my_table.arrow")) + expect_warning( + write_arrow(iris, "my_table.arrow"), + regexp = "Use 'write_ipc_stream' or 'write_feather' instead." + ) +}) +``` + +Instead, you can use `write_feather()`. + +```{r, write_feather} +my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))) +write_feather(my_table, "my_table.arrow") +``` +```{r, test_write_feather, opts.label = "test"} +test_that("write_feather chunk works as expected", { + expect_true(file.exists("my_table.arrow")) +}) +``` +### Write a Feather (version 1) file + +You can write data in the original Feather format by setting the `version` parameter to `1`. Review comment: Yeah, I'm only thinking in case for some obscure reason someone does need to use it - I think the clarification that it's a legacy format is helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org