[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #91: ARROW-13713: [Doc][Cookbook] Reading and Writing Compressed Data - R

GitBox Wed, 27 Oct 2021 14:23:50 -0700


westonpace commented on a change in pull request #91:
URL: https://github.com/apache/arrow-cookbook/pull/91#discussion_r737854751




##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -359,3 +358,121 @@ unlink("my_table.parquet")
 unlink("dist_time.parquet")
 unlink("airquality_partitioned", recursive = TRUE)
 ```
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+You can also supply the `compression` argument to `write_dataset()`, as long 
as 
+the compression algorithm is compatible with the chosen format.
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, format = "feather", compression = "gzip")

Review comment:
       Is `gzip` supported with `feather`?  I thought it was only `lz4` and 
`zstd`

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -359,3 +358,121 @@ unlink("my_table.parquet")
 unlink("dist_time.parquet")
 unlink("airquality_partitioned", recursive = TRUE)
 ```
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+You can also supply the `compression` argument to `write_dataset()`, as long 
as 
+the compression algorithm is compatible with the chosen format.
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, format = "feather", compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  file.exists(file.path(td, "part-0.parquet"))

Review comment:
       You specified `format = "feather"` above but you are looking for 
`part-0.parquet`.  Something seems off.

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -359,3 +358,121 @@ unlink("my_table.parquet")
 unlink("dist_time.parquet")
 unlink("airquality_partitioned", recursive = TRUE)
 ```
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+You can also supply the `compression` argument to `write_dataset()`, as long 
as 

Review comment:
       I know the recipe is "specified compression algorithm" but the default 
behavior may deserve a link or a callout here.  Arrow compresses parquet by 
default so the above example would only be used if `gzip` was preferred over 
the default algorithm for some reason.

##########
File path: r/content/reading_and_writing_data.Rmd
##########
@@ -359,3 +358,121 @@ unlink("my_table.parquet")
 unlink("dist_time.parquet")
 unlink("airquality_partitioned", recursive = TRUE)
 ```
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+You can also supply the `compression` argument to `write_dataset()`, as long 
as 
+the compression algorithm is compatible with the chosen format.
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, format = "feather", compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  file.exists(file.path(td, "part-0.parquet"))
+})
+```
+
+### See also
+
+For more information on the supported compression algorithms, see:
+
+* `?write_parquet()`
+* `?write_feather()`
+* `?write_dataset()`
+
+## Read compressed data
+
+You want to read in data which has been compressed.
+
+### Solution
+
+```{r, read_parquet_compressed}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+
+# Read in data
+ds <- read_parquet(file.path(td, "iris.parquet")) %>%
+  collect()
+
+ds
+```
+
+```{r, test_read_parquet_compressed, opts.label = "test"}
+test_that("read_parquet_compressed", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+### Discussion
+
+Note that Arrow automatically detects the compression and you do not have to 
+supply it in the call to `open_dataset()` or the `read_*()` functions.
+
+Although the CSV format does not support compression itself, Arrow supports 
+reading in CSV data which has been compressed.
+
+```{r, read_compressed_csv}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote 
= FALSE)

Review comment:
       This only works if the file extension is `.gz`.  We should probably 
mention that somewhere.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #91: ARROW-13713: [Doc][Cookbook] Reading and Writing Compressed Data - R

Reply via email to