[GitHub] [arrow-cookbook] thisisnic commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

GitBox Thu, 22 Jul 2021 10:03:21 -0700


thisisnic commented on a change in pull request #1:
URL: https://github.com/apache/arrow-cookbook/pull/1#discussion_r674992429




##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -0,0 +1,255 @@
+# Reading and Writing Data
+
+This chapter contains recipes related to reading and writing data from disk 
using Apache Arrow.
+
+## Reading and Writing Parquet Files
+
+### Writing a Parquet file
+
+You can write Parquet files to disk using `arrow::write_parquet()`.
+```{r, write_parquet}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+# Write to Parquet
+write_parquet(my_table, "my_table.parquet")
+```
+```{r, test_write_parquet, opts.label = "test"}
+test_that("write_parquet chunk works as expected", {
+  expect_true(file.exists("my_table.parquet"))
+})
+```
+ 
+### Reading a Parquet file
+
+Given a Parquet file, it can be read back to an Arrow Table by using 
`arrow::read_parquet()`.
+
+```{r, read_parquet}
+parquet_tbl <- read_parquet("my_table.parquet")
+head(parquet_tbl)
+```
+```{r, test_read_parquet, opts.label = "test"}
+test_that("read_parquet works as expected", {
+  expect_equivalent(dplyr::collect(parquet_tbl), tibble::tibble(group = c("A", 
"B", "C"), score = c(99, 97, 99)))
+})
+```
+
+If the argument `as_data_frame` was set to `TRUE` (the default), the file was 
read in as a `data.frame` object.
+
+```{r, read_parquet_2}
+class(parquet_tbl)
+```
+```{r, test_read_parquet_2, opts.label = "test"}
+test_that("read_parquet_2 works as expected", {
+  expect_s3_class(parquet_tbl, "data.frame")
+})
+```
+If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow 
Table.
+
+```{r, read_parquet_table}
+my_table_arrow_table <- read_parquet("my_table.parquet", as_data_frame = FALSE)
+head(my_table_arrow_table)
+```
+
+```{r, read_parquet_table_class}
+class(my_table_arrow_table)
+```
+```{r, test_read_parquet_table_class, opts.label = "test"}
+test_that("read_parquet_table_class works as expected", {
+  expect_s3_class(my_table_arrow_table, "Table")
+})
+```
+
+## How to read a (partitioned) Parquet file from S3 
+
+You can open a Parquet file saved on S3 by calling `read_parquet()` and 
passing the relevant URI as the `file` argument.
+
+```{r, read_parquet_s3, eval = FALSE}
+df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+```
+For more in-depth instructions, including how to work with S3 buckets which 
require authentication, you can find a guide to reading and writing to/from S3 
buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
+
+## How to filter rows or columns while reading a Parquet file 
+
+When reading in a Parquet file, you can specify which columns to read in via 
the `col_select` argument.
+
+```{r, read_parquet_filter}
+# Create table to read back in 
+dist_time <- Table$create(tibble::tibble(distance = c(12.2, 15.7, 14.2), time 
= c(43, 44, 40)))
+# Write to Parquet
+write_parquet(dist_time, "dist_time.parquet")
+
+# Read in only the "time" column
+time_only <- read_parquet("dist_time.parquet", col_select = "time")
+head(time_only)
+```
+```{r, test_read_parquet_filter, opts.label = "test"}
+test_that("read_parquet_filter works as expected", {
+  expect_identical(time_only, tibble::tibble(time = c(43, 44, 40)))
+})
+```
+
+## Reading and Writing CSV files 
+
+You can use `write_csv_arrow()` to save an Arrow Table to disk as a CSV.
+
+```{r, write_csv_arrow}
+write_csv_arrow(cars, "cars.csv")
+```
+```{r, test_write_csv_arrow, opts.label = "test"}
+test_that("write_csv_arrow chunk works as expected", {
+  expect_true(file.exists("cars.csv"))
+})
+```
+
+You can use `read_csv_arrow()` to read in a CSV file as an Arrow Table.
+
+```{r, read_csv_arrow}
+my_csv <- read_csv_arrow("cars.csv")
+```
+
+```{r, test_read_csv_arrow, opts.label = "test"}
+test_that("read_csv_arrow chunk works as expected", {
+  expect_equivalent(dplyr::collect(my_csv), cars)
+})
+```
+
+## Reading and Writing Partitioned Data 
+
+### Writing Partitioned Data
+
+You can use `write_dataset()` to save data to disk in partitions based on 
columns in the data.
+
+```{r, write_dataset}
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+list.files("airquality_partitioned")
+```
+```{r, test_write_dataset, opts.label = "test"}
+test_that("write_dataset chunk works as expected", {
+  # Partition by month
+  expect_identical(list.files("airquality_partitioned"), c("Month=5", 
"Month=6", "Month=7", "Month=8", "Month=9"))
+  # We have enough files
+  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 
153)
+})
+```
+As you can see, this has created folders based on the first partition variable 
supplied, `Month`.
+
+If you take a look in one of these folders, you will see that the data is then 
partitioned by the second partition variable, `Day`.
+
+```{r}
+list.files("airquality_partitioned/Month=5")
+```
+
+Each of these folders contains 1 or more Parquet files containing the relevant 
partition of the data.
+
+```{r}
+list.files("airquality_partitioned/Month=5/Day=10")
+```
+
+### Reading Partitioned Data
+
+You can use `open_dataset()` to read partitioned data.
+
+```{r, open_dataset}
+# Write some partitioned data to disk to read back in
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
+
+# Read data from directory
+air_data <- open_dataset("airquality_partitioned")
+
+# View data
+air_data
+```
+```{r, test_open_dataset, opts.label = "test"}
+test_that("open_dataset chunk works as expected", {
+  expect_equal(nrow(air_data), 153)
+  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, 
Month, Day), ignore_attr = TRUE)
+})
+```
+
+## Reading and Writing Feather files 
+
+### Write an IPC/Feather V2 file
+
+The Arrow IPC file format is identical to the Feather version 2 format.  If 
you call `write_arrow()`, you will get a warning telling you to use 
`write_feather()` instead.
+
+```{r, write_arrow}
+# Create table
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+write_arrow(my_table, "my_table.arrow")
+```
+```{r, test_write_arrow, opts.label = "test"}
+test_that("write_arrow chunk works as expected", {
+  expect_true(file.exists("my_table.arrow"))
+  expect_warning(
+    write_arrow(iris, "my_table.arrow"),
+    regexp = "Use 'write_ipc_stream' or 'write_feather' instead."
+  )
+})
+```
+
+Instead, you can use `write_feather()`.
+
+```{r, write_feather}
+my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = 
c(99, 97, 99)))
+write_feather(my_table, "my_table.arrow")
+```
+```{r, test_write_feather, opts.label = "test"}
+test_that("write_feather chunk works as expected", {
+  expect_true(file.exists("my_table.arrow"))
+})
+```
+### Write a Feather (version 1) file
+
+You can write data in the original Feather format by setting the `version` 
parameter to `1`.

Review comment:
       Yeah, I'm only thinking in case for some obscure reason someone does 
need to use it - I think the clarification that it's a legacy format is 
helpful.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] thisisnic commented on a change in pull request #1: Initial content for Arrow Cookbook for Python and R

Reply via email to