[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

GitBox Tue, 03 May 2022 10:51:57 -0700


wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864037507



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = 
"ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` 
(semester), `dst()` (daylight savings time indicator), `date()` (extract date), 
`epiyear()` (epiyear), improvements to `month()`, which now works with integer 
inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to 
create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, 
`ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, 
`dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV 
datasets.
+  - can take a list of datasets with differing schemas and attempt to unify 
the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results 
from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing 
the pipeline.
+  - no longer need to materialize the entire result table before writing to a 
dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes 
when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column 
names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = 
"ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),

Review Comment:
   I think I'd rather limit these parenthetical to just explain abbreviations 
(tz, dst, epiyear), rather than try to function as docs. We link to the 
lubridate function docs directly for each bullet, so more detail is readily 
available to the reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Reply via email to