[arrow-cookbook] branch main updated: [R] Initial datasets content (#159)

thisisnic Tue, 08 Nov 2022 04:49:17 -0800

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git



The following commit(s) were added to refs/heads/main by this push:
     new 7df8c28  [R] Initial datasets content (#159)
7df8c28 is described below

commit 7df8c28e31067f8b10b54e018efd588b4cbfce8b
Author: Nic Crane <[email protected]>
AuthorDate: Tue Nov 8 12:49:04 2022 +0000

    [R] Initial datasets content (#159)
---
 r/content/_bookdown.yml                |   1 +
 r/content/datasets.Rmd                 | 399 +++++++++++++++++++++++++++++++++
 r/content/reading_and_writing_data.Rmd | 195 +++++-----------
 r/content/tables.Rmd                   |  38 ++--
 4 files changed, 474 insertions(+), 159 deletions(-)

diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index a76108b..06a5f3e 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml
@@ -25,6 +25,7 @@ edit: 
https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
 rmd_files: [
   "index.Rmd",
   "reading_and_writing_data.Rmd",
+  "datasets.Rmd",
   "creating_arrow_objects.Rmd",
   "specify_data_types_and_schemas.Rmd",
   "arrays.Rmd",
diff --git a/r/content/datasets.Rmd b/r/content/datasets.Rmd
new file mode 100644
index 0000000..c9baf02
--- /dev/null
+++ b/r/content/datasets.Rmd
@@ -0,0 +1,399 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Reading and Writing Data - Multiple Files
+
+## Introduction
+
+When reading files into R using Apache Arrow, you can read:
+
+* a single file into memory as a data frame or an Arrow Table
+* a single file that is too large to fit in memory as an Arrow Dataset
+* multiple and partitioned files as an Arrow Dataset
+
+This chapter contains recipes related to using Apache Arrow to read and 
+write files too large for memory and multiple or partitioned files as an 
+Arrow Dataset. There are a number of 
+circumstances in which you may want to read in the data as an Arrow Dataset:
+
+* your single data file is too large to load into memory
+* your data are partitioned among numerous files
+* you want faster performance from your `dplyr` queries
+* you want to be able to take advantage of Arrow's compute functions
+
+It is possible to read in partitioned data in Parquet, Feather (also known as 
Arrow IPC), and CSV or 
+other text-delimited formats.  If you are choosing a partitioned multiple file 
format, we 
+recommend Parquet or Feather (Arrow IPC ), both of which can have improved 
performance 
+when compared to CSVs due to their capabilities around metadata and 
compression.
+
+## Write data to disk - Parquet
+
+You want to write data to disk in a single Parquet file.
+
+### Solution
+
+```{r, write_dataset_basic}
+write_dataset(dataset = airquality, path = "airquality_data")
+```
+
+```{r, test_write_dataset_basic, opts.label = "test"}
+test_that("write_dataset_basic works as expected", {
+  expect_true(file.exists("airquality_data"))
+  expect_length(list.files("airquality_data"), 1)
+})
+```
+
+### Discussion
+
+The default format for `open_dataset()` and `write_dataset()` is Parquet. 
+
+## Write partitioned data - Parquet
+
+You want to save multiple Parquet data files to disk in partitions based on 
columns in the data.
+
+### Solution
+
+```{r, write_dataset}
+write_dataset(airquality, "airquality_partitioned", partitioning = c("Month"))
+```
+
+```{r, test_write_dataset, opts.label = "test"}
+test_that("write_dataset chunk works as expected", {
+  # Partition by month
+  expect_identical(list.files("airquality_partitioned"), c("Month=5", 
"Month=6", "Month=7", "Month=8", "Month=9"))
+  # We have enough files
+  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 
5)
+})
+```
+
+As you can see, this has created folders based on the supplied partition 
variable `Month`.
+
+```{r}
+list.files("airquality_partitioned")
+```
+
+### Discussion
+
+The data is written to separate folders based on the values in the `Month` 
+column.  The default behaviour is to use Hive-style (i.e. "col_name=value" 
folder names)
+partitions.
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_partitioned", recursive = TRUE)
+```
+
+You can specify multiple partitioning variables to add extra levels of 
partitioning.
+
+```{r, write_dataset_partitioned_deeper}
+write_dataset(airquality, "airquality_partitioned_deeper", partitioning = 
c("Month", "Day"))
+list.files("airquality_partitioned_deeper")
+```
+
+```{r, test_write_dataset_partitioned_deeper, opts.label = "test"}
+test_that("write_dataset_partitioned_deeper works as expected", {
+  expect_true(file.exists("airquality_partitioned_deeper"))
+  expect_length(list.files("airquality_partitioned_deeper", recursive = TRUE), 
153)
+})
+```
+
+If you take a look in one of these folders, you will see that the data is then 
partitioned by the second partition variable, `Day`.
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_partitioned_deeper/Month=5", recursive = TRUE)
+```
+
+There are two different ways to specify variables to use for partitioning - 
+either via the `partitioning` variable as above, or by using 
`dplyr::group_by()` on your data - the group variables will form the partitions.
+
+```{r, write_dataset_partitioned_groupby}
+write_dataset(dataset = group_by(airquality, Month, Day),
+  path = "airquality_groupby")
+```
+
+```{r, test_write_dataset_partitioned_groupby, opts.label = "test"}
+test_that("write_dataset_partitioned_groupby works as expected", {
+  expect_true(file.exists("airquality_groupby"))
+  expect_length(list.files("airquality_groupby", recursive = TRUE), 153)
+})
+
+```
+
+```{r}
+# Take a look at the files in this directory
+list.files("airquality_groupby", recursive = TRUE)
+```
+
+Each of these folders contains 1 or more Parquet files containing the relevant 
partition of the data.
+
+```{r}
+list.files("airquality_groupby/Month=5/Day=10")
+```
+
+Note that when there was an `NA` value in the partition column, 
+these values are written to the `col_name=__HIVE_DEFAULT_PARTITION__`
+directory.
+
+
+## Read partitioned data
+
+You want to read partitioned data files as an Arrow Dataset.
+
+### Solution
+
+```{r, open_dataset}
+# Read data from directory
+air_data <- open_dataset("airquality_partitioned_deeper")
+
+# View data
+air_data
+```
+```{r, test_open_dataset, opts.label = "test"}
+test_that("open_dataset chunk works as expected", {
+  expect_equal(nrow(air_data), 153)
+  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, 
Month, Day), ignore_attr = TRUE)
+})
+```
+
+### Discussion
+
+Partitioning allows you to split data across 
+multiple files and folders, avoiding problems associated with storing all your 
data 
+in a single file.  This can provide further advantages when using Arrow, as 
Arrow will only 
+read in the necessary partitioned files needed for any given analysis.
+
+## Write data to disk - Feather/Arrow IPC format
+
+You want to write data to disk in a single Feather/Arrow IPC file.
+
+### Solution
+
+```{r, write_dataset_feather}
+write_dataset(dataset = airquality,
+  path = "airquality_data_feather",
+  format = "feather")
+```
+```{r, test_write_dataset_feather, opts.label = "test"}
+test_that("write_dataset_feather works as expected", {
+  expect_true(file.exists("airquality_data_feather"))
+  expect_length(list.files("airquality_data_feather"), 1)
+})
+```
+
+## Read in Feather/Arrow IPC data as an Arrow Dataset
+
+You want to read in Feather/Arrow IPC data as an Arrow Dataset
+
+### Solution
+
+```{r, read_arrow_datset}
+# write Arrow file to use in this example
+write_dataset(dataset = airquality,
+  path = "airquality_data_arrow",
+  format = "arrow")
+
+# read into R
+open_dataset("airquality_data_arrow", format = "arrow")
+```
+
+```{r, test_read_arrow_datset, opts.label = "test"}
+test_that("read_arrow_dataset works as expected", {
+  dataset <- open_dataset("airquality_data_arrow", format = "arrow")
+  expect_s3_class(dataset, "FileSystemDataset")
+  expect_identical(dim(dataset), c(153L, 6L))
+})
+```
+
+## Write data to disk - CSV format
+
+You want to write data to disk in a single CSV file.
+
+### Solution
+
+```{r, write_dataset_csv}
+write_dataset(dataset = airquality,
+  path = "airquality_data_csv",
+  format = "csv")
+```
+
+```{r, test_write_dataset_csv, opts.label = "test"}
+test_that("write_dataset_csv works as expected", {
+  expect_true(file.exists("airquality_data_csv"))
+  expect_length(list.files("airquality_data_csv"), 1)
+})
+```
+
+
+## Read in CSV data as an Arrow Dataset
+
+You want to read in CSV data as an Arrow Dataset
+
+### Solution
+
+```{r, read_csv_datset}
+# write CSV file to use in this example
+write_dataset(dataset = airquality,
+  path = "airquality_data_csv",
+  format = "csv")
+
+# read into R
+open_dataset("airquality_data_csv", format = "csv")
+```
+
+```{r, test_read_csv_datset, opts.label = "test"}
+test_that("read_csv_dataset works as expected", {
+  dataset <- open_dataset("airquality_data_csv", format = "csv")
+  expect_s3_class(dataset, "FileSystemDataset")
+  expect_identical(dim(dataset), c(153L, 6L))
+})
+```
+
+## Read in a CSV dataset (no headers)
+
+You want to read in a dataset containing CSVs with no headers
+
+### Solution
+
+```{r, read_headerless_csv_datset}
+# write CSV file to use in this example
+dataset_1 <- airquality[1:40, c("Month", "Day", "Temp")]
+dataset_2 <- airquality[41:80, c("Month", "Day", "Temp")]
+
+dir.create("airquality")
+write.table(dataset_1, "airquality/part-1.csv", sep = ",", row.names = FALSE, 
col.names = FALSE)
+write.table(dataset_2, "airquality/part-2.csv", sep = ",", row.names = FALSE, 
col.names = FALSE)
+
+# read into R
+open_dataset("airquality", format = "csv", column_names = c("Month", "Day", 
"Temp"))
+```
+
+```{r, test_read_headerless_csv_datset, opts.label = "test"}
+test_that("read_headerless_csv_datset works as expected", {
+  data_in <- open_dataset("airquality", format = "csv", column_names = 
c("Month", "Day", "Temp"))
+  expect_s3_class(data_in, "FileSystemDataset")
+  expect_identical(dim(data_in), c(80L, 3L))
+  expect_named(data_in, c("Month", "Day", "Temp"))
+})
+```
+
+### Discussion
+
+If your dataset is made up of headerless CSV files, you must supply the names 
of
+each column.  You can do this in multiple ways - either via the `column_names` 
+parameter (as shown above) or via a schema:
+
+```{r, read_headerless_csv_datset_schema}
+open_dataset("airquality", format = "csv", schema = schema("Month" = int32(), 
"Day" = int32(), "Temp" = int32()))
+```
+
+```{r, test_read_headerless_csv_datset_schema, opts.label = "test"}
+test_that("read_headerless_csv_datset_schema works as expected", {
+  data_in <- open_dataset("airquality", format = "csv", schema = 
schema("Month" = int32(), "Day" = int32(), "Temp" = int32()))
+  expect_s3_class(data_in, "FileSystemDataset")
+  expect_identical(dim(data_in), c(80L, 3L))
+  expect_named(data_in, c("Month", "Day", "Temp"))
+  expect_equal(data_in$schema, schema("Month" = int32(), "Day" = int32(), 
"Temp" = int32()))
+})
+```
+
+One additional advantage of using a schema is that you also have control of 
the 
+data types of the columns. If you provide both column names and a schema, the 
values 
+in `column_names` must match the `schema` field names.
+
+
+## Write compressed partitioned data
+
+You want to save partitioned files, compressed with a specified compression 
algorithm.
+
+### Solution
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  expect_true(file.exists(file.path(td, "part-0.parquet")))
+})
+```
+
+### Discussion
+
+You can supply the `compression` argument to `write_dataset()` as long as 
+the compression algorithm is compatible with the chosen format. See 
`?write_dataset()` 
+for more information on supported compression algorithms and default settings.
+
+## Read compressed data
+
+You want to read in data which has been compressed.
+
+### Solution
+
+```{r, opendataset_compressed}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+
+# Read in data
+ds <- open_dataset(td) %>%
+  collect()
+
+ds
+```
+
+```{r, test_opendataset_compressed, opts.label = "test"}
+test_that("opendataset_compressed", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+### Discussion
+
+Note that Arrow automatically detects the compression and you do not have to 
+supply it in the call to `open_dataset()` or the `read_*()` functions.
+
+
+```{r cleanup_multifile, include = FALSE}
+#cleanup
+unlink("airquality", recursive = TRUE)
+unlink("airquality_data_csv", recursive = TRUE)
+unlink("airquality_data", recursive = TRUE)
+unlink("airquality_data_arrow", recursive = TRUE)
+unlink("airquality_data_feather", recursive = TRUE)
+unlink("airquality_partitioned", recursive = TRUE)
+unlink("airquality_groupby", recursive = TRUE)
+unlink("airquality_partitioned_deeper", recursive = TRUE)
+```
\ No newline at end of file
diff --git a/r/content/reading_and_writing_data.Rmd 
b/r/content/reading_and_writing_data.Rmd
index ef097b3..a089eb8 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -17,22 +17,29 @@
   under the License.
 -->
 
-# Reading and Writing Data
+# Reading and Writing Data - Single Files
 
 ## Introduction
 
-This chapter contains recipes related to reading and writing data using Apache 
-Arrow.  When reading files into R using Apache Arrow, you can choose to read 
in 
-your file as either a data frame or as an Arrow Table object.
+When reading files into R using Apache Arrow, you can read:
 
+* a single file into memory as a data frame or an Arrow Table
+* a single file that is too large to fit in memory as an Arrow Dataset
+* multiple and partitioned files as an Arrow Dataset
 
-There are a number of circumstances in which you may want to read in the data 
as an Arrow Table:
+This chapter contains recipes related to using Apache Arrow to read and 
+write single file data into memory as an Arrow Table. There are a number of 
circumstances in
+which you may want to read in single file data as an Arrow Table:
 
-* your dataset is large and if you load it into memory, it may lead to 
performance issues
+* your data file is large and having performance issues
 * you want faster performance from your `dplyr` queries
 * you want to be able to take advantage of Arrow's compute functions
 
-## Convert from a data frame to an Arrow Table
+If a single data file is too large to load into memory, you can use the Arrow 
Dataset API. 
+Recipes for using `open_dataset()` and `write_dataset()` are in the Reading 
and Writing Data - Multiple Files
+chapter.
+
+## Convert data from a data frame to an Arrow Table
 
 You want to convert an existing `data.frame` or `tibble` object into an Arrow 
Table.
 
@@ -61,7 +68,7 @@ air_df
 ```
 ```{r, test_asdf_table, opts.label = "test"}
 test_that("asdf_table chunk works as expected", {
-  expect_identical(air_df, airquality) 
+  expect_identical(air_df, airquality)
 })
 ```
 
@@ -71,7 +78,7 @@ You can use either `as.data.frame()` or `dplyr::collect()` to 
do this.
 
 ## Write a Parquet file
 
-You want to write Parquet files to disk.
+You want to write a single Parquet file to disk.
 
 ### Solution
 
@@ -89,7 +96,7 @@ test_that("write_parquet chunk works as expected", {
  
 ## Read a Parquet file
 
-You want to read a Parquet file.
+You want to read a single Parquet file into memory.
 
 ### Solution
 
@@ -123,6 +130,7 @@ my_table_arrow <- read_parquet("my_table.parquet", 
as_data_frame = FALSE)
 my_table_arrow
 ```
 
+
 ```{r, read_parquet_table_class}
 class(my_table_arrow)
 ```
@@ -134,12 +142,12 @@ test_that("read_parquet_table_class works as expected", {
 
 ## Read a Parquet file from S3 
 
-You want to read a Parquet file from S3.
+You want to read a single Parquet file from S3 into memory.
 
 ### Solution
 
 ```{r, read_parquet_s3, eval = FALSE}
-df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet(file = 
"s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet")
 ```
 
 ### See also
@@ -148,12 +156,12 @@ For more in-depth instructions, including how to work 
with S3 buckets which requ
 
 ## Filter columns while reading a Parquet file 
 
-You want to specify which columns to include when reading in a Parquet file.
+You want to specify which columns to include when reading in a single Parquet 
file into memory.
 
 ### Solution
 
 ```{r, read_parquet_filter}
-# Create table to read back in 
+# Create table to read back in
 dist_time <- arrow_table(data.frame(distance = c(12.2, 15.7, 14.2), time = 
c(43, 44, 40)))
 # Write to Parquet
 write_parquet(dist_time, "dist_time.parquet")
@@ -168,9 +176,9 @@ test_that("read_parquet_filter works as expected", {
 })
 ```
 
-## Write an IPC/Feather V2 file
+## Write a Feather V2/Arrow IPC file
 
-You want to read in a Feather file.
+You want to write a single Feather V2 file (also called Arrow IPC file).
 
 ### Solution
 
@@ -197,13 +205,11 @@ write_feather(mtcars, "my_table.feather", version = 1)
 test_that("write_feather1 chunk works as expected", {
   expect_true(file.exists("my_table.feather"))
 })
-
-unlink("my_table.feather")
 ```
 
-## Read a Feather file
+## Read a Feather/Arrow IPC file
 
-You want to read a Feather file.
+You want to read a single Feather V1 or V2 file into memory (also called Arrow 
IPC file).
 
 ### Solution
 
@@ -217,9 +223,9 @@ test_that("read_feather chunk works as expected", {
 unlink("my_table.arrow")
 ```
 
-## Write streaming IPC files
+## Write streaming Arrow IPC files
 
-You want to write to the IPC stream format.
+You want to write to the Arrow IPC stream format.
 
 ### Solution
 
@@ -240,9 +246,9 @@ test_that("write_ipc_stream chunk works as expected", {
 })
 ```
 
-## Read streaming IPC files
+## Read streaming Arrow IPC files
 
-You want to read from the IPC stream format.
+You want to read from the Arrow IPC stream format.
 
 ### Solution
 ```{r, read_ipc_stream}
@@ -258,9 +264,9 @@ test_that("read_ipc_stream chunk works as expected", {
 unlink("my_table.arrows")
 ```
 
-## Write CSV files  
+## Write a CSV file 
 
-You want to write Arrow data to a CSV file.
+You want to write Arrow data to a single CSV file.
 
 ### Solution
 
@@ -273,9 +279,9 @@ test_that("write_csv_arrow chunk works as expected", {
 })
 ```
 
-## Read CSV files
+## Read a CSV file
 
-You want to read a CSV file.
+You want to read a single CSV file into memory.
 
 ### Solution
 
@@ -290,14 +296,14 @@ test_that("read_csv_arrow chunk works as expected", {
 unlink("cars.csv")
 ```
 
-## Read JSON files 
+## Read a JSON file
 
-You want to read a JSON file.
+You want to read a JSON file into memory.
 
 ### Solution
 
 ```{r, read_json_arrow}
-# Create a file to read back in 
+# Create a file to read back in
 tf <- tempfile()
 writeLines('
     {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
@@ -323,76 +329,9 @@ test_that("read_json_arrow chunk works as expected", {
 unlink(tf)
 ```
 
-## Write partitioned data
+## Write a compressed single data file
 
-You want to save data to disk in partitions based on columns in the data.
-
-### Solution
-
-```{r, write_dataset}
-write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", 
"Day"))
-list.files("airquality_partitioned")
-```
-```{r, test_write_dataset, opts.label = "test"}
-test_that("write_dataset chunk works as expected", {
-  # Partition by month
-  expect_identical(list.files("airquality_partitioned"), c("Month=5", 
"Month=6", "Month=7", "Month=8", "Month=9"))
-  # We have enough files
-  expect_equal(length(list.files("airquality_partitioned", recursive = TRUE)), 
153)
-})
-```
-As you can see, this has created folders based on the first partition variable 
supplied, `Month`.
-
-If you take a look in one of these folders, you will see that the data is then 
partitioned by the second partition variable, `Day`.
-
-```{r}
-list.files("airquality_partitioned/Month=5")
-```
-
-Each of these folders contains 1 or more Parquet files containing the relevant 
partition of the data.
-
-```{r}
-list.files("airquality_partitioned/Month=5/Day=10")
-```
-
-## Read partitioned data
-
-You want to read partitioned data.
-
-### Solution
-
-```{r, open_dataset}
-# Read data from directory
-air_data <- open_dataset("airquality_partitioned")
-
-# View data
-air_data
-```
-```{r, test_open_dataset, opts.label = "test"}
-test_that("open_dataset chunk works as expected", {
-  expect_equal(nrow(air_data), 153)
-  expect_equal(arrange(collect(air_data), Month, Day), arrange(airquality, 
Month, Day), ignore_attr = TRUE)
-})
-```
-
-```{r}
-unlink("airquality_partitioned", recursive = TRUE)
-```
-
-```{r, include = FALSE}
-# cleanup
-unlink("my_table.arrow")
-unlink("my_table.arrows")
-unlink("cars.csv")
-unlink("my_table.feather")
-unlink("my_table.parquet")
-unlink("dist_time.parquet")
-unlink("airquality_partitioned", recursive = TRUE)
-```
-
-## Write compressed data
-
-You want to save a file, compressed with a specified compression algorithm.
+You want to save a single file, compressed with a specified compression 
algorithm.
 
 ### Solution
 
@@ -407,35 +346,7 @@ write_parquet(iris, file.path(td, "iris.parquet"), 
compression = "gzip")
 
 ```{r, test_parquet_gzip, opts.label = "test"}
 test_that("parquet_gzip", {
-  file.exists(file.path(td, "iris.parquet"))
-})
-```
-
-### Discussion
-
-Note that `write_parquet()` by default already uses compression.  See 
-`default_parquet_compression()` to see what the default configured on your 
-machine is.
-
-You can also supply the `compression` argument to `write_dataset()`, as long 
as 
-the compression algorithm is compatible with the chosen format.
-
-```{r, dataset_gzip}
-# Create a temporary directory
-td <- tempfile()
-dir.create(td)
-
-# Write dataset to file
-write_dataset(iris, path = td, compression = "gzip")
-```
-
-```{r}
-# View files in the directory
-list.files(td, recursive = TRUE)
-```
-```{r, test_dataset_gzip, opts.label = "test"}
-test_that("dataset_gzip", {
-  file.exists(file.path(td, "part-0.parquet"))
+  expect_true(file.exists(file.path(td, "iris.parquet")))
 })
 ```
 
@@ -446,11 +357,10 @@ on the supported compression algorithms and default 
settings, see:
 
 * `?write_parquet()`
 * `?write_feather()`
-* `?write_dataset()`
 
 ## Read compressed data
 
-You want to read in data which has been compressed.
+You want to read in a single data file which has been compressed.
 
 ### Solution
 
@@ -459,13 +369,11 @@ You want to read in data which has been compressed.
 td <- tempfile()
 dir.create(td)
 
-# Write dataset which is to be read back in
+# Write data which is to be read back in
 write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
 
 # Read in data
-ds <- read_parquet(file.path(td, "iris.parquet")) %>%
-  collect()
-
+ds <- read_parquet(file.path(td, "iris.parquet"))
 ds
 ```
 
@@ -482,7 +390,7 @@ test_that("read_parquet_compressed", {
 ### Discussion
 
 Note that Arrow automatically detects the compression and you do not have to 
-supply it in the call to `open_dataset()` or the `read_*()` functions.
+supply it in the call to the `read_*()` or the `open_dataset()` functions.
 
 Although the CSV format does not support compression itself, Arrow supports 
 reading in CSV data which has been compressed, if the file extension is `.gz`.
@@ -492,12 +400,11 @@ reading in CSV data which has been compressed, if the 
file extension is `.gz`.
 td <- tempfile()
 dir.create(td)
 
-# Write dataset which is to be read back in
+# Write data which is to be read back in
 write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote 
= FALSE)
 
 # Read in data
-ds <- open_dataset(td, format = "csv") %>%
-  collect()
+ds <- read_csv_arrow(file.path(td, "iris.csv.gz"))
 ds
 ```
 
@@ -511,4 +418,12 @@ test_that("read_compressed_csv", {
 })
 ```
 
-
+```{r cleanup_singlefiles, include = FALSE}
+# cleanup
+unlink("my_table.arrow")
+unlink("my_table.arrows")
+unlink("cars.csv")
+unlink("my_table.feather")
+unlink("my_table.parquet")
+unlink("dist_time.parquet")
+```
\ No newline at end of file
diff --git a/r/content/tables.Rmd b/r/content/tables.Rmd
index 127a5b1..75078c2 100644
--- a/r/content/tables.Rmd
+++ b/r/content/tables.Rmd
@@ -23,13 +23,13 @@
 
 One of the aims of the Arrow project is to reduce duplication between 
different 
 data frame implementations.  The underlying implementation of a data frame is 
a 
-conceptually different thing to the code that you run to work with it - the 
API.
+conceptually different thing to the code- or the application programming 
interface (API)-that you write to work with it.
 
-You may have seen this before in packages like `dbplyr` which allow you to use 
+You may have seen this before in packages like dbplyr which allow you to use 
 the dplyr API to interact with SQL databases.
 
-The `arrow` package has been written so that the underlying Arrow table-like 
-objects can be manipulated via use of the dplyr API via the dplyr verbs.
+The Arrow R package has been written so that the underlying Arrow Table-like 
+objects can be manipulated using the dplyr API, which allows you to use dplyr 
verbs.
 
 For example, here's a short pipeline of data manipulation which uses dplyr 
exclusively:
   
@@ -41,7 +41,7 @@ starwars %>%
   select(name, height_ft)
 ```
 
-And the same results as using arrow with dplyr syntax:
+And the same results as using Arrow with dplyr syntax:
   
 ```{r, dplyr_arrow}
 arrow_table(starwars) %>%
@@ -73,11 +73,11 @@ test_that("dplyr_raw and dplyr_arrow chunk provide the same 
results", {
 
 
 You'll notice we've used `collect()` in the Arrow pipeline above.  That's 
because 
-one of the ways in which `arrow` is efficient is that it works out the 
instructions
+one of the ways in which Arrow is efficient is that it works out the 
instructions
 for the calculations it needs to perform (_expressions_) and only runs them 
-using arrow once you actually pull the data into your R session.  This means 
+using Arrow once you actually pull the data into your R session.  This means 
 instead of doing lots of separate operations, it does them all at once in a 
-more optimised way, _lazy evaluation_.
+more optimised way. This is called _lazy evaluation_.
 
 It also means that you are able to manipulate data that is larger than you can 
 fit into memory on the machine you're running your code on, if you only pull 
@@ -86,13 +86,13 @@ which can operate on chunks of data.
 
 You can also have data which is split across multiple files.  For example, you
 might have files which are stored in multiple Parquet or Feather files, 
-partitioned across different directories.  You can open multi-file datasets 
+partitioned across different directories.  You can open partitioned or 
multi-file datasets 
 using `open_dataset()` as discussed in a previous chapter, and then manipulate 
-this data using arrow before even reading any of it into R.
+this data using Arrow before even reading any of the data into R.
 
-## Use dplyr verbs in arrow
+## Use dplyr verbs in Arrow
 
-You want to use a dplyr verb in arrow.
+You want to use a dplyr verb in Arrow.
 
 ### Solution
 
@@ -120,7 +120,7 @@ test_that("dplyr_verb works as expected", {
 
 ### Discussion
 
-You can use most of the dplyr verbs directly from arrow.  
+You can use most of the dplyr verbs directly from Arrow.  
 
 ### See also
 
@@ -131,9 +131,9 @@ the [pkgdown 
site](https://dplyr.tidyverse.org/articles/dplyr.html).
 You can see more information about using `arrow_table()` to create Arrow Tables
 and `collect()` to view them as R data frames in [Creating Arrow 
Objects](creating-arrow-objects.html#creating-arrow-objects).
 
-## Use R functions in dplyr verbs in arrow
+## Use R functions in dplyr verbs in Arrow
 
-You want to use an R function inside a dplyr verb in arrow.
+You want to use an R function inside a dplyr verb in Arrow.
 
 ### Solution
 
@@ -159,10 +159,10 @@ test_that("dplyr_str_detect", {
 
 ### Discussion
 
-The arrow package allows you to use dplyr verbs containing expressions which 
+The Arrow R package allows you to use dplyr verbs containing expressions which 
 include base R and many tidyverse functions, but call Arrow functions under 
the hood.
 If you find any base R or tidyverse functions which you would like to see a 
-mapping of in arrow, please 
+mapping of in Arrow, please 
 [open an issue on the project 
JIRA](https://issues.apache.org/jira/projects/ARROW/issues).
 
 The following packages (amongst some from others) have had many function 
@@ -199,7 +199,7 @@ test_that("dplyr_func_warning", {
 ```
 
 
-## Use arrow functions in dplyr verbs in arrow
+## Use Arrow functions in dplyr verbs in Arrow
 
 You want to use a function which is implemented in Arrow's C++ library but 
either:
 
@@ -313,7 +313,7 @@ Although not all Arrow C++ compute functions require 
options to be specified,
 most do.  For these functions to work in R, they must be linked up 
 with the appropriate libarrow options C++ class via the R 
 package's C++ code.  At the time of writing, all compute functions available in
-the development version of the arrow R package had been associated with their 
options
+the development version of the Arrow R package had been associated with their 
options
 classes.  However, as the Arrow C++ library's functionality extends, compute 
 functions may be added which do not yet have an R binding.  If you find a C++ 
 compute function which you wish to use from the R package, please [open an 
issue

[arrow-cookbook] branch main updated: [R] Initial datasets content (#159)

Reply via email to