This is an automated email from the ASF dual-hosted git repository. npr pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push: new 6a3f9eb ARROW-9473: [Doc] Polishing for 1.0 6a3f9eb is described below commit 6a3f9ebb0eb118b9289694c5085f6d9ab2aa575d Author: Neal Richardson <neal.p.richard...@gmail.com> AuthorDate: Tue Jul 14 16:06:17 2020 -0700 ARROW-9473: [Doc] Polishing for 1.0 Closes #7766 from nealrichardson/docs-1.0 Authored-by: Neal Richardson <neal.p.richard...@gmail.com> Signed-off-by: Neal Richardson <neal.p.richard...@gmail.com> --- docs/source/format/Versioning.rst | 12 +++++----- r/NEWS.md | 35 ++++++++++++++++++++++++----- r/R/ipc_stream.R | 3 +-- r/R/record-batch.R | 5 ----- r/R/schema.R | 2 +- r/R/table.R | 5 ----- r/README.md | 1 - r/_pkgdown.yml | 2 +- r/man/RecordBatch.Rd | 5 ----- r/man/Table.Rd | 5 ----- r/man/read_ipc_stream.Rd | 3 +-- r/man/unify_schemas.Rd | 2 +- r/tests/testthat/test-read-record-batch.R | 2 +- r/tests/testthat/test-record-batch-reader.R | 7 +++++- r/vignettes/install.Rmd | 4 ++-- 15 files changed, 50 insertions(+), 43 deletions(-) diff --git a/docs/source/format/Versioning.rst b/docs/source/format/Versioning.rst index 2ed6670..b706569 100644 --- a/docs/source/format/Versioning.rst +++ b/docs/source/format/Versioning.rst @@ -18,8 +18,8 @@ Format Versioning and Stability =============================== -Starting with version 1.0.0 (not yet released), Apache Arrow utilizes -**two versions** to describe each release of the project. These are +Starting with version 1.0.0, Apache Arrow utilizes +**two versions** to describe each release of the project: the **Format Version** and the **Library Version**. Each Library Version has a corresponding Format Version, and multiple versions of the library may have the same format version. For example, library @@ -56,15 +56,15 @@ Long-Term Stability A change in the format major version (e.g. from 1.0.0 to 2.0.0) indicates a disruption to these compatibility guarantees in some way. -We **do not expect** this to be a frequent occurrence starting with -the 1.0.0 library and format release. This would be an exceptional +We **do not expect** this to be a frequent occurrence. +This would be an exceptional event and, should this come to pass, we would exercise caution in ensuring that production applications are not harmed. Pre-1.0.0 Versions ------------------ -We have made no forward or backward compatibility guarantees for -versions prior to 1.0.0. However, we are making every effort to ensure +We made no forward or backward compatibility guarantees for +versions prior to 1.0.0. However, we made every effort to ensure that new clients can read serialized data produced by library version 0.8.0 and onward. diff --git a/r/NEWS.md b/r/NEWS.md index 1679e9a..c810487 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -19,22 +19,47 @@ # arrow 0.17.1.9000 +## Arrow format conversion + +* `vignette("arrow", package = "arrow")` includes tables that explain how R types are converted to Arrow types and vice versa. +* Support added for converting to/from more Arrow types: `uint64`, `binary`, `fixed_size_binary`, `large_binary`, `large_utf8`, `large_list`, `list` of `structs`. +* `character` vectors that exceed 2GB are converted to Arrow `large_utf8` type +* `POSIXlt` objects can now be converted to Arrow (`struct`) +* R `attributes()` are preserved in Arrow metadata when converting to Arrow RecordBatch and table and are restored when converting from Arrow. This means that custom subclasses, such as `haven::labelled`, are preserved in round trip through Arrow. +* Schema metadata is now exposed as a named list, and it can be modified by assignment like `batch$metadata$new_key <- "new value"` +* Arrow types `int64`, `uint32`, and `uint64` now are converted to R `integer` if all values fit in bounds +* Arrow `date32` is now converted to R `Date` with `double` underlying storage. Even though the data values themselves are integers, this provides more strict round-trip fidelity +* When converting to R `factor`, `dictionary` ChunkedArrays that do not have identical dictionaries are properly unified +* In the 1.0 release, the Arrow IPC metadata version is increased from V4 to V5. By default, `RecordBatch{File,Stream}Writer` will write V5, but you can specify an alternate `metadata_version`. For convenience, if you know the consumer you're writing to cannot read V5, you can set the environment variable `ARROW_PRE_1_0_METADATA_VERSION=1` to write V4 without changing any other code. + ## Datasets * CSV and other text-delimited datasets are now supported -* Read datasets directly on S3 by passing a URL like `ds <- open_dataset("s3://...")`. Note that this currently requires a special C++ library build with additional dependencies; that is, this is not yet available in CRAN releases or in nightly packages. +* With a custom C++ build, it is possible to read datasets directly on S3 by passing a URL like `ds <- open_dataset("s3://...")`. Note that this currently requires a special C++ library build with additional dependencies--this is not yet available in CRAN releases or in nightly packages. * When reading individual CSV and JSON files, compression is automatically detected from the file extension -## Other +## Other enhancements * Initial support for C++ aggregation methods: `sum()` and `mean()` are implemented for `Array` and `ChunkedArray` -* Schema metadata is now exposed as a named list, and it can be modified by assignment like `batch$metadata$new_key <- "new value"` * Tables and RecordBatches have additional data.frame-like methods, including `dimnames()` and `as.list()` -* Linux installation: some tweaks to OS detection for binaries, some updates to known installation issues in the vignette. -* Various streamlining efforts to reduce library size and compile time. +* Tables and ChunkedArrays can now be moved to/from Python via `reticulate` + +## Bug fixes and deprecations + +* Non-UTF-8 strings (common on Windows) are correctly coerced to UTF-8 when passing to Arrow memory and appropriately re-localized when converting to R +* The `coerce_timestamps` option to `write_parquet()` is now correctly implemented. +* Creating a Dictionary array respects the `type` definition if provided by the user * `read_arrow` and `write_arrow` are now deprecated; use the `read/write_feather()` and `read/write_ipc_stream()` functions depending on whether you're working with the Arrow IPC file or stream format, respectively. * Previously deprecated `FileStats`, `read_record_batch`, and `read_table` have been removed. +## Installation and packaging + +* For improved performance in memory allocation, macOS and Linux binaries now have `jemalloc` included, and Windows packages use `mimalloc` +* Linux installation: some tweaks to OS detection for binaries, some updates to known installation issues in the vignette +* The bundled libarrow is built with the same `CC` and `CXX` values that R uses +* Failure to build the bundled libarrow yields a clear message +* Various streamlining efforts to reduce library size and compile time + # arrow 0.17.1 * Updates for compatibility with `dplyr` 1.0 diff --git a/r/R/ipc_stream.R b/r/R/ipc_stream.R index ebc5b77..0c728b2 100644 --- a/r/R/ipc_stream.R +++ b/r/R/ipc_stream.R @@ -82,8 +82,7 @@ write_to_raw <- function(x, format = c("stream", "file")) { #' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`, #' is deprecated. You should explicitly choose #' the function that will read the desired IPC format (stream or file) since -#' a file or `InputStream` may contain either. `read_table()`, a wrapper around -#' `read_arrow()`, is also deprecated +#' a file or `InputStream` may contain either. #' #' @param file A character file name, `raw` vector, or an Arrow input stream. #' If a file name, a memory-mapped Arrow [InputStream] will be opened and diff --git a/r/R/record-batch.R b/r/R/record-batch.R index 6e4705f..cc68348 100644 --- a/r/R/record-batch.R +++ b/r/R/record-batch.R @@ -38,11 +38,6 @@ #' "Slice" method function even if there were a column in the table called #' "Slice". #' -#' A caveat about the `[` method for row operations: only "slicing" is -#' currently supported. That is, you can select a continuous range of rows -#' from the table, but you can't filter with a `logical` vector or take an -#' arbitrary selection of rows by integer indices. -#' #' @section R6 Methods: #' In addition to the more R-friendly S3 methods, a `RecordBatch` object has #' the following R6 methods that map onto the underlying C++ methods: diff --git a/r/R/schema.R b/r/R/schema.R index 839326a..963e5f4 100644 --- a/r/R/schema.R +++ b/r/R/schema.R @@ -189,7 +189,7 @@ read_schema <- function(stream, ...) { #' \dontrun{ #' a <- schema(b = double(), c = bool()) #' z <- schema(b = double(), k = utf8()) -#' unify_schemas(a, z), +#' unify_schemas(a, z) #' } unify_schemas <- function(..., schemas = list(...)) { shared_ptr(Schema, arrow__UnifySchemas(schemas)) diff --git a/r/R/table.R b/r/R/table.R index 64095f8..1391eee 100644 --- a/r/R/table.R +++ b/r/R/table.R @@ -47,11 +47,6 @@ #' "Slice" method function even if there were a column in the table called #' "Slice". #' -#' A caveat about the `[` method for row operations: only "slicing" is -#' currently supported. That is, you can select a continuous range of rows -#' from the table, but you can't filter with a `logical` vector or take an -#' arbitrary selection of rows by integer indices. -#' #' @section R6 Methods: #' In addition to the more R-friendly S3 methods, a `Table` object has #' the following R6 methods that map onto the underlying C++ methods: diff --git a/r/README.md b/r/README.md index e8972a0..a0e2034 100644 --- a/r/README.md +++ b/r/README.md @@ -3,7 +3,6 @@ [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow) [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush) [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow) -[![codecov](https://codecov.io/gh/ursa-labs/arrow-r-nightly/branch/master/graph/badge.svg)](https://codecov.io/gh/ursa-labs/arrow-r-nightly) [Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data. It specifies a standardized diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml index c68d153..ff48eef 100644 --- a/r/_pkgdown.yml +++ b/r/_pkgdown.yml @@ -101,6 +101,7 @@ reference: - record_batch - RecordBatch - Table + - Scalar - read_message - title: Arrow data types and schema contents: @@ -130,7 +131,6 @@ reference: - default_memory_pool - FileSystem - FileInfo - - FileStats - FileSelector - title: Configuration contents: diff --git a/r/man/RecordBatch.Rd b/r/man/RecordBatch.Rd index a57bb0c..40c3496 100644 --- a/r/man/RecordBatch.Rd +++ b/r/man/RecordBatch.Rd @@ -35,11 +35,6 @@ A caveat about the \code{$} method: because \code{RecordBatch} is an \code{R6} o precedence over the table's columns. So, \code{batch$Slice} would return the "Slice" method function even if there were a column in the table called "Slice". - -A caveat about the \code{[} method for row operations: only "slicing" is -currently supported. That is, you can select a continuous range of rows -from the table, but you can't filter with a \code{logical} vector or take an -arbitrary selection of rows by integer indices. } \section{R6 Methods}{ diff --git a/r/man/Table.Rd b/r/man/Table.Rd index aebb2b3..2014a30 100644 --- a/r/man/Table.Rd +++ b/r/man/Table.Rd @@ -35,11 +35,6 @@ A caveat about the \code{$} method: because \code{Table} is an \code{R6} object, precedence over the table's columns. So, \code{tab$Slice} would return the "Slice" method function even if there were a column in the table called "Slice". - -A caveat about the \code{[} method for row operations: only "slicing" is -currently supported. That is, you can select a continuous range of rows -from the table, but you can't filter with a \code{logical} vector or take an -arbitrary selection of rows by integer indices. } \section{R6 Methods}{ diff --git a/r/man/read_ipc_stream.Rd b/r/man/read_ipc_stream.Rd index 0ea54f6..1cc969b 100644 --- a/r/man/read_ipc_stream.Rd +++ b/r/man/read_ipc_stream.Rd @@ -33,8 +33,7 @@ and \code{\link[=read_feather]{read_feather()}} read those formats, respectively \code{read_arrow()}, a wrapper around \code{read_ipc_stream()} and \code{read_feather()}, is deprecated. You should explicitly choose the function that will read the desired IPC format (stream or file) since -a file or \code{InputStream} may contain either. \code{read_table()}, a wrapper around -\code{read_arrow()}, is also deprecated +a file or \code{InputStream} may contain either. } \seealso{ \code{\link[=read_feather]{read_feather()}} for writing IPC files. \link{RecordBatchReader} for a diff --git a/r/man/unify_schemas.Rd b/r/man/unify_schemas.Rd index f7d01a1..a6b7ec0 100644 --- a/r/man/unify_schemas.Rd +++ b/r/man/unify_schemas.Rd @@ -21,6 +21,6 @@ Combine and harmonize schemas \dontrun{ a <- schema(b = double(), c = bool()) z <- schema(b = double(), k = utf8()) -unify_schemas(a, z), +unify_schemas(a, z) } } diff --git a/r/tests/testthat/test-read-record-batch.R b/r/tests/testthat/test-read-record-batch.R index 2412743..8eb196a 100644 --- a/r/tests/testthat/test-read-record-batch.R +++ b/r/tests/testthat/test-read-record-batch.R @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -context("read_record_batch()") +context("reading RecordBatches") test_that("RecordBatchFileWriter / RecordBatchFileReader roundtrips", { tab <- Table$create( diff --git a/r/tests/testthat/test-record-batch-reader.R b/r/tests/testthat/test-record-batch-reader.R index 2b621ed..e03664e 100644 --- a/r/tests/testthat/test-record-batch-reader.R +++ b/r/tests/testthat/test-record-batch-reader.R @@ -80,8 +80,13 @@ test_that("MetadataFormat", { expect_identical(get_ipc_metadata_version("V4"), 3L) expect_identical(get_ipc_metadata_version(NULL), 4L) Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = 1) - on.exit(Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = "")) expect_identical(get_ipc_metadata_version(NULL), 3L) + Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = "") + + expect_identical(get_ipc_metadata_version(NULL), 4L) + Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = 1) + expect_identical(get_ipc_metadata_version(NULL), 3L) + Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = "") expect_error( get_ipc_metadata_version(99), diff --git a/r/vignettes/install.Rmd b/r/vignettes/install.Rmd index 2dad01e..d7b4156 100644 --- a/r/vignettes/install.Rmd +++ b/r/vignettes/install.Rmd @@ -264,8 +264,8 @@ See discussion [here](https://issues.apache.org/jira/browse/ARROW-8586). * If you have multiple versions of `zstd` installed on your system, installation by building the C++ from source may fail with an undefined symbols -error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; -(2) setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling +error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2) +setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling the conflicting `zstd`. See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).