djnavarro commented on code in PR #14514: URL: https://github.com/apache/arrow/pull/14514#discussion_r1023516220
########## r/vignettes/dataset.Rmd: ########## @@ -1,157 +1,95 @@ --- -title: "Working with Arrow Datasets and dplyr" +title: "Working with multi-file data sets" +description: > + Learn how to use Datasets to read, write, and analyze + multi-file larger-than-memory data output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Working with Arrow Datasets and dplyr} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} --- -Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets, -and other tools for interactive exploration of Arrow data. - -This vignette introduces Datasets and shows how to use dplyr to analyze them. - -## Example: NYC taxi data - -The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) -is widely used in big data exercises and competitions. -For demonstration purposes, we have hosted a Parquet-formatted version -of about ten years of the trip data in a public Amazon S3 bucket. - -The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so you can't just -read it all in and stack it into a single data frame. - -In Windows and macOS binary packages, S3 support is included. -On Linux, when installing from source, S3 support is not enabled by default, -and it has additional system requirements. -See `vignette("install", package = "arrow")` for details. -To see if your arrow installation has S3 support, run: +Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar [`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll start by ensuring both packages are loaded ```{r} -arrow::arrow_with_s3() +library(arrow, warn.conflicts = FALSE) +library(dplyr, warn.conflicts = FALSE) ``` -Even with S3 support enabled, network speed will be a bottleneck unless your -machine is located in the same AWS region as the data. So, for this vignette, -we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi" -directory. +## Example: NYC taxi data -### Retrieving data from a public Amazon S3 bucket +The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. -If your arrow build has S3 support, you can sync the data locally with: +This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂 -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) -```{r, eval = FALSE} -arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") -# Alternatively, with GCS: -arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi") -``` +If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` (true for most users; see links at the end of this article if you need to troubleshoot this), you can connect to the "tiny taxi data" with either of the following commands: Review Comment: updated to clarify that `s3_bucket()` refers to Amazon S3 copy of the data and `gs_bucket()` refers to Google Cloud copy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org