[arrow] branch master updated: ARROW-9473: [Doc] Polishing for 1.0

npr Tue, 14 Jul 2020 16:07:06 -0700

This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/master by this push:
     new 6a3f9eb  ARROW-9473: [Doc] Polishing for 1.0
6a3f9eb is described below

commit 6a3f9ebb0eb118b9289694c5085f6d9ab2aa575d
Author: Neal Richardson <neal.p.richard...@gmail.com>
AuthorDate: Tue Jul 14 16:06:17 2020 -0700

    ARROW-9473: [Doc] Polishing for 1.0
    
    Closes #7766 from nealrichardson/docs-1.0
    
    Authored-by: Neal Richardson <neal.p.richard...@gmail.com>
    Signed-off-by: Neal Richardson <neal.p.richard...@gmail.com>
---
 docs/source/format/Versioning.rst           | 12 +++++-----
 r/NEWS.md                                   | 35 ++++++++++++++++++++++++-----
 r/R/ipc_stream.R                            |  3 +--
 r/R/record-batch.R                          |  5 -----
 r/R/schema.R                                |  2 +-
 r/R/table.R                                 |  5 -----
 r/README.md                                 |  1 -
 r/_pkgdown.yml                              |  2 +-
 r/man/RecordBatch.Rd                        |  5 -----
 r/man/Table.Rd                              |  5 -----
 r/man/read_ipc_stream.Rd                    |  3 +--
 r/man/unify_schemas.Rd                      |  2 +-
 r/tests/testthat/test-read-record-batch.R   |  2 +-
 r/tests/testthat/test-record-batch-reader.R |  7 +++++-
 r/vignettes/install.Rmd                     |  4 ++--
 15 files changed, 50 insertions(+), 43 deletions(-)

diff --git a/docs/source/format/Versioning.rst 
b/docs/source/format/Versioning.rst
index 2ed6670..b706569 100644
--- a/docs/source/format/Versioning.rst
+++ b/docs/source/format/Versioning.rst
@@ -18,8 +18,8 @@
 Format Versioning and Stability
 ===============================
 
-Starting with version 1.0.0 (not yet released), Apache Arrow utilizes
-**two versions** to describe each release of the project. These are
+Starting with version 1.0.0, Apache Arrow utilizes
+**two versions** to describe each release of the project:
 the **Format Version** and the **Library Version**. Each Library
 Version has a corresponding Format Version, and multiple versions of
 the library may have the same format version. For example, library
@@ -56,15 +56,15 @@ Long-Term Stability
 
 A change in the format major version (e.g. from 1.0.0 to 2.0.0)
 indicates a disruption to these compatibility guarantees in some way.
-We **do not expect** this to be a frequent occurrence starting with
-the 1.0.0 library and format release. This would be an exceptional
+We **do not expect** this to be a frequent occurrence.
+This would be an exceptional
 event and, should this come to pass, we would exercise caution in
 ensuring that production applications are not harmed.
 
 Pre-1.0.0 Versions
 ------------------
 
-We have made no forward or backward compatibility guarantees for
-versions prior to 1.0.0. However, we are making every effort to ensure
+We made no forward or backward compatibility guarantees for
+versions prior to 1.0.0. However, we made every effort to ensure
 that new clients can read serialized data produced by library version
 0.8.0 and onward.
diff --git a/r/NEWS.md b/r/NEWS.md
index 1679e9a..c810487 100644
--- a/r/NEWS.md
+++ b/r/NEWS.md
@@ -19,22 +19,47 @@
 
 # arrow 0.17.1.9000
 
+## Arrow format conversion
+
+* `vignette("arrow", package = "arrow")` includes tables that explain how R 
types are converted to Arrow types and vice versa.
+* Support added for converting to/from more Arrow types: `uint64`, `binary`, 
`fixed_size_binary`, `large_binary`, `large_utf8`, `large_list`, `list` of 
`structs`.
+* `character` vectors that exceed 2GB are converted to Arrow `large_utf8` type
+* `POSIXlt` objects can now be converted to Arrow (`struct`)
+* R `attributes()` are preserved in Arrow metadata when converting to Arrow 
RecordBatch and table and are restored when converting from Arrow. This means 
that custom subclasses, such as `haven::labelled`, are preserved in round trip 
through Arrow.
+* Schema metadata is now exposed as a named list, and it can be modified by 
assignment like `batch$metadata$new_key <- "new value"`
+* Arrow types `int64`, `uint32`, and `uint64` now are converted to R `integer` 
if all values fit in bounds
+* Arrow `date32` is now converted to R `Date` with `double` underlying 
storage. Even though the data values themselves are integers, this provides 
more strict round-trip fidelity
+* When converting to R `factor`, `dictionary` ChunkedArrays that do not have 
identical dictionaries are properly unified
+* In the 1.0 release, the Arrow IPC metadata version is increased from V4 to 
V5. By default, `RecordBatch{File,Stream}Writer` will write V5, but you can 
specify an alternate `metadata_version`. For convenience, if you know the 
consumer you're writing to cannot read V5, you can set the environment variable 
`ARROW_PRE_1_0_METADATA_VERSION=1` to write V4 without changing any other code.
+
 ## Datasets
 
 * CSV and other text-delimited datasets are now supported
-* Read datasets directly on S3 by passing a URL like `ds <- 
open_dataset("s3://...")`. Note that this currently requires a special C++ 
library build with additional dependencies; that is, this is not yet available 
in CRAN releases or in nightly packages.
+* With a custom C++ build, it is possible to read datasets directly on S3 by 
passing a URL like `ds <- open_dataset("s3://...")`. Note that this currently 
requires a special C++ library build with additional dependencies--this is not 
yet available in CRAN releases or in nightly packages.
 * When reading individual CSV and JSON files, compression is automatically 
detected from the file extension
 
-## Other
+## Other enhancements
 
 * Initial support for C++ aggregation methods: `sum()` and `mean()` are 
implemented for `Array` and `ChunkedArray`
-* Schema metadata is now exposed as a named list, and it can be modified by 
assignment like `batch$metadata$new_key <- "new value"`
 * Tables and RecordBatches have additional data.frame-like methods, including 
`dimnames()` and `as.list()`
-* Linux installation: some tweaks to OS detection for binaries, some updates 
to known installation issues in the vignette.
-* Various streamlining efforts to reduce library size and compile time.
+* Tables and ChunkedArrays can now be moved to/from Python via `reticulate`
+
+## Bug fixes and deprecations
+
+* Non-UTF-8 strings (common on Windows) are correctly coerced to UTF-8 when 
passing to Arrow memory and appropriately re-localized when converting to R
+* The `coerce_timestamps` option to `write_parquet()` is now correctly 
implemented.
+* Creating a Dictionary array respects the `type` definition if provided by 
the user  
 * `read_arrow` and `write_arrow` are now deprecated; use the 
`read/write_feather()` and `read/write_ipc_stream()` functions depending on 
whether you're working with the Arrow IPC file or stream format, respectively.
 * Previously deprecated `FileStats`, `read_record_batch`, and `read_table` 
have been removed.
 
+## Installation and packaging
+
+* For improved performance in memory allocation, macOS and Linux binaries now 
have `jemalloc` included, and Windows packages use `mimalloc`
+* Linux installation: some tweaks to OS detection for binaries, some updates 
to known installation issues in the vignette
+* The bundled libarrow is built with the same `CC` and `CXX` values that R uses
+* Failure to build the bundled libarrow yields a clear message
+* Various streamlining efforts to reduce library size and compile time
+
 # arrow 0.17.1
 
 * Updates for compatibility with `dplyr` 1.0
diff --git a/r/R/ipc_stream.R b/r/R/ipc_stream.R
index ebc5b77..0c728b2 100644
--- a/r/R/ipc_stream.R
+++ b/r/R/ipc_stream.R
@@ -82,8 +82,7 @@ write_to_raw <- function(x, format = c("stream", "file")) {
 #' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`,
 #' is deprecated. You should explicitly choose
 #' the function that will read the desired IPC format (stream or file) since
-#' a file or `InputStream` may contain either. `read_table()`, a wrapper around
-#' `read_arrow()`, is also deprecated
+#' a file or `InputStream` may contain either. 
 #'
 #' @param file A character file name, `raw` vector, or an Arrow input stream.
 #' If a file name, a memory-mapped Arrow [InputStream] will be opened and
diff --git a/r/R/record-batch.R b/r/R/record-batch.R
index 6e4705f..cc68348 100644
--- a/r/R/record-batch.R
+++ b/r/R/record-batch.R
@@ -38,11 +38,6 @@
 #' "Slice" method function even if there were a column in the table called
 #' "Slice".
 #'
-#' A caveat about the `[` method for row operations: only "slicing" is
-#' currently supported. That is, you can select a continuous range of rows
-#' from the table, but you can't filter with a `logical` vector or take an
-#' arbitrary selection of rows by integer indices.
-#'
 #' @section R6 Methods:
 #' In addition to the more R-friendly S3 methods, a `RecordBatch` object has
 #' the following R6 methods that map onto the underlying C++ methods:
diff --git a/r/R/schema.R b/r/R/schema.R
index 839326a..963e5f4 100644
--- a/r/R/schema.R
+++ b/r/R/schema.R
@@ -189,7 +189,7 @@ read_schema <- function(stream, ...) {
 #' \dontrun{
 #' a <- schema(b = double(), c = bool())
 #' z <- schema(b = double(), k = utf8())
-#' unify_schemas(a, z),
+#' unify_schemas(a, z)
 #' }
 unify_schemas <- function(..., schemas = list(...)) {
   shared_ptr(Schema, arrow__UnifySchemas(schemas))
diff --git a/r/R/table.R b/r/R/table.R
index 64095f8..1391eee 100644
--- a/r/R/table.R
+++ b/r/R/table.R
@@ -47,11 +47,6 @@
 #' "Slice" method function even if there were a column in the table called
 #' "Slice".
 #'
-#' A caveat about the `[` method for row operations: only "slicing" is
-#' currently supported. That is, you can select a continuous range of rows
-#' from the table, but you can't filter with a `logical` vector or take an
-#' arbitrary selection of rows by integer indices.
-#'
 #' @section R6 Methods:
 #' In addition to the more R-friendly S3 methods, a `Table` object has
 #' the following R6 methods that map onto the underlying C++ methods:
diff --git a/r/README.md b/r/README.md
index e8972a0..a0e2034 100644
--- a/r/README.md
+++ b/r/README.md
@@ -3,7 +3,6 @@
 
[![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
-[![codecov](https://codecov.io/gh/ursa-labs/arrow-r-nightly/branch/master/graph/badge.svg)](https://codecov.io/gh/ursa-labs/arrow-r-nightly)
 
 [Apache Arrow](https://arrow.apache.org/) is a cross-language
 development platform for in-memory data. It specifies a standardized
diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml
index c68d153..ff48eef 100644
--- a/r/_pkgdown.yml
+++ b/r/_pkgdown.yml
@@ -101,6 +101,7 @@ reference:
   - record_batch
   - RecordBatch
   - Table
+  - Scalar
   - read_message
 - title: Arrow data types and schema
   contents:
@@ -130,7 +131,6 @@ reference:
   - default_memory_pool
   - FileSystem
   - FileInfo
-  - FileStats
   - FileSelector
 - title: Configuration
   contents:
diff --git a/r/man/RecordBatch.Rd b/r/man/RecordBatch.Rd
index a57bb0c..40c3496 100644
--- a/r/man/RecordBatch.Rd
+++ b/r/man/RecordBatch.Rd
@@ -35,11 +35,6 @@ A caveat about the \code{$} method: because 
\code{RecordBatch} is an \code{R6} o
 precedence over the table's columns. So, \code{batch$Slice} would return the
 "Slice" method function even if there were a column in the table called
 "Slice".
-
-A caveat about the \code{[} method for row operations: only "slicing" is
-currently supported. That is, you can select a continuous range of rows
-from the table, but you can't filter with a \code{logical} vector or take an
-arbitrary selection of rows by integer indices.
 }
 
 \section{R6 Methods}{
diff --git a/r/man/Table.Rd b/r/man/Table.Rd
index aebb2b3..2014a30 100644
--- a/r/man/Table.Rd
+++ b/r/man/Table.Rd
@@ -35,11 +35,6 @@ A caveat about the \code{$} method: because \code{Table} is 
an \code{R6} object,
 precedence over the table's columns. So, \code{tab$Slice} would return the
 "Slice" method function even if there were a column in the table called
 "Slice".
-
-A caveat about the \code{[} method for row operations: only "slicing" is
-currently supported. That is, you can select a continuous range of rows
-from the table, but you can't filter with a \code{logical} vector or take an
-arbitrary selection of rows by integer indices.
 }
 
 \section{R6 Methods}{
diff --git a/r/man/read_ipc_stream.Rd b/r/man/read_ipc_stream.Rd
index 0ea54f6..1cc969b 100644
--- a/r/man/read_ipc_stream.Rd
+++ b/r/man/read_ipc_stream.Rd
@@ -33,8 +33,7 @@ and \code{\link[=read_feather]{read_feather()}} read those 
formats, respectively
 \code{read_arrow()}, a wrapper around \code{read_ipc_stream()} and 
\code{read_feather()},
 is deprecated. You should explicitly choose
 the function that will read the desired IPC format (stream or file) since
-a file or \code{InputStream} may contain either. \code{read_table()}, a 
wrapper around
-\code{read_arrow()}, is also deprecated
+a file or \code{InputStream} may contain either.
 }
 \seealso{
 \code{\link[=read_feather]{read_feather()}} for writing IPC files. 
\link{RecordBatchReader} for a
diff --git a/r/man/unify_schemas.Rd b/r/man/unify_schemas.Rd
index f7d01a1..a6b7ec0 100644
--- a/r/man/unify_schemas.Rd
+++ b/r/man/unify_schemas.Rd
@@ -21,6 +21,6 @@ Combine and harmonize schemas
 \dontrun{
 a <- schema(b = double(), c = bool())
 z <- schema(b = double(), k = utf8())
-unify_schemas(a, z),
+unify_schemas(a, z)
 }
 }
diff --git a/r/tests/testthat/test-read-record-batch.R 
b/r/tests/testthat/test-read-record-batch.R
index 2412743..8eb196a 100644
--- a/r/tests/testthat/test-read-record-batch.R
+++ b/r/tests/testthat/test-read-record-batch.R
@@ -15,7 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.
 
-context("read_record_batch()")
+context("reading RecordBatches")
 
 test_that("RecordBatchFileWriter / RecordBatchFileReader roundtrips", {
   tab <- Table$create(
diff --git a/r/tests/testthat/test-record-batch-reader.R 
b/r/tests/testthat/test-record-batch-reader.R
index 2b621ed..e03664e 100644
--- a/r/tests/testthat/test-record-batch-reader.R
+++ b/r/tests/testthat/test-record-batch-reader.R
@@ -80,8 +80,13 @@ test_that("MetadataFormat", {
   expect_identical(get_ipc_metadata_version("V4"), 3L)
   expect_identical(get_ipc_metadata_version(NULL), 4L)
   Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = 1)
-  on.exit(Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = ""))
   expect_identical(get_ipc_metadata_version(NULL), 3L)
+  Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = "")
+  
+  expect_identical(get_ipc_metadata_version(NULL), 4L)
+  Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = 1)
+  expect_identical(get_ipc_metadata_version(NULL), 3L)
+  Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = "")
 
   expect_error(
     get_ipc_metadata_version(99),
diff --git a/r/vignettes/install.Rmd b/r/vignettes/install.Rmd
index 2dad01e..d7b4156 100644
--- a/r/vignettes/install.Rmd
+++ b/r/vignettes/install.Rmd
@@ -264,8 +264,8 @@ See discussion 
[here](https://issues.apache.org/jira/browse/ARROW-8586).
 
 * If you have multiple versions of `zstd` installed on your system,
 installation by building the C++ from source may fail with an undefined symbols
-error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary;
-(2) setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
+error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; 
(2)
+setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
 the conflicting `zstd`.
 See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).

[arrow] branch master updated: ARROW-9473: [Doc] Polishing for 1.0

Reply via email to