Re: [PR] fix(r): Collect array streams in C (not R) before conversion [arrow-nanoarrow]
paleolimbot commented on PR #828: URL: https://github.com/apache/arrow-nanoarrow/pull/828#issuecomment-3843530720 Sorry for being slow to respond here...this fell off my radar! You're completely correct that the ideal situation is accumulating R vectors in place, much like `std::vector()` would do. That ability sort of exists in the internal C code but isn't all the way supported (there exists the ability to "reserve" + "fill", but not "reallocate"). Ideally one could "just" implement the "reallocate" part which would solve something similar. I chose a slightly simpler approach in this PR because it was less error prone but accumulating it all in place is also an improvement worth pursuing. Specifically with respect to #823, when we do implement that we need to do it without an `rbind()` or `c()` because that can be slow. One other improvement will be to make the internal `protect`ion cheaper (can be done using cpp11). Even though we don't need it for collection any longer, it does show up in other cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] fix(r): Collect array streams in C (not R) before conversion [arrow-nanoarrow]
klin333 commented on PR #828: URL: https://github.com/apache/arrow-nanoarrow/pull/828#issuecomment-3758563149 Hi @paleolimbot thank you very much for the work on this. I tried this, and yep this indeed solves the crushingly slow gc() i observed in #822. It is still a little bit slower than what i tried to do in PR #823 though, but that's fine i suppose. My understanding is that we are still restreaming, just in C++ and no longer in R, sounds good. Just curious though, given we are collecting the stream into batches of arrays anyway, why not directly convert the arrays, like what i did in 823? Why the restream? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] fix(r): Collect array streams in C (not R) before conversion [arrow-nanoarrow]
paleolimbot merged PR #828: URL: https://github.com/apache/arrow-nanoarrow/pull/828 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] fix(r): Collect array streams in C (not R) before conversion [arrow-nanoarrow]
Copilot commented on code in PR #828:
URL: https://github.com/apache/arrow-nanoarrow/pull/828#discussion_r2587258551
##
r/tools/make-callentries.R:
##
@@ -21,7 +21,7 @@
library(tidyverse)
-src_files <- list.files("src", "\\.(c|cpp)$", full.names = TRUE) %>%
+src_files <- list.files("src", "\\.(c|cc)$", full.names = TRUE) %>%
Review Comment:
The regex pattern `"\\.(c|cc)$"` will match files ending in `.c` or `.cc`,
but will not match `.cpp` files. If the codebase contains `.cpp` files, they
will be excluded. Consider using `"\\.(c|cc|cpp)$"` to include all C/C++ source
files, or verify that only `.c` and `.cc` files should be processed.
```suggestion
src_files <- list.files("src", "\\.(c|cc|cpp)$", full.names = TRUE) %>%
```
##
r/src/nanoarrow_cpp.cc:
##
@@ -201,3 +204,88 @@ extern "C" void
nanoarrow_preserve_and_release_on_other_thread(SEXP obj) {
std::thread worker([obj] { nanoarrow_release_sexp(obj); });
worker.join();
}
+
+// Collector utility for iterating over and collecting batches
+// Keeping this all in a single object reduces the amount of C++ deletion
+// we need to keep track of.
+struct ArrayVector {
+ nanoarrow::UniqueSchema schema;
+ nanoarrow::UniqueArray batch;
+ std::vector vec;
+};
+
+// Use an external pointer to handle deleting the ArrayVector in
+// the event of a longjmp
+static void release_array_vector_xptr(SEXP array_vector_xptr) {
+ auto ptr =
reinterpret_cast(R_ExternalPtrAddr(array_vector_xptr));
+ if (ptr != NULL) {
Review Comment:
Use `nullptr` instead of `NULL` in C++ code for null pointer comparisons.
This is the modern C++ standard and provides better type safety.
```suggestion
if (ptr != nullptr) {
```
##
r/src/nanoarrow_cpp.cc:
##
@@ -201,3 +204,88 @@ extern "C" void
nanoarrow_preserve_and_release_on_other_thread(SEXP obj) {
std::thread worker([obj] { nanoarrow_release_sexp(obj); });
worker.join();
}
+
+// Collector utility for iterating over and collecting batches
+// Keeping this all in a single object reduces the amount of C++ deletion
+// we need to keep track of.
+struct ArrayVector {
+ nanoarrow::UniqueSchema schema;
+ nanoarrow::UniqueArray batch;
+ std::vector vec;
+};
+
+// Use an external pointer to handle deleting the ArrayVector in
+// the event of a longjmp
+static void release_array_vector_xptr(SEXP array_vector_xptr) {
+ auto ptr =
reinterpret_cast(R_ExternalPtrAddr(array_vector_xptr));
+ if (ptr != NULL) {
+delete ptr;
+ }
+}
+
+// Collects the entire array stream and collects the total number of rows and
+// total number of batches so that the R code on the end of this can decide
+// how best to proceed.
+extern "C" SEXP nanoarrow_c_collect_array_stream(SEXP array_stream_xptr, SEXP
n_sexp) {
+ struct ArrowArrayStream* array_stream =
+ nanoarrow_array_stream_from_xptr(array_stream_xptr);
+
+ double n_real = REAL(n_sexp)[0];
+ int n;
+ if (R_FINITE(n_real)) {
+n = (int)n_real;
Review Comment:
Casting `double` to `int` can result in undefined behavior if `n_real`
exceeds the range of `int` (approximately ±2.1 billion). Add a range check
before casting to ensure `n_real` is within `INT_MIN` and `INT_MAX`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
[PR] fix(r): Collect array streams in C (not R) before conversion [arrow-nanoarrow]
paleolimbot opened a new pull request, #828:
URL: https://github.com/apache/arrow-nanoarrow/pull/828
This PR moves the proecess of collecting an array stream from R (where we
had preserve/protect volume issues that made garbage collection very, very
slow) into C/C++.
Reproducer for generating an IPC file with a lot of strings:
```r
library(nanoarrow)
ascii_bytes <- vapply(letters, charToRaw, raw(1), USE.NAMES = FALSE)
random_string_array <- function(n = 1, n_chars = 16) {
data_buffer <- sample(ascii_bytes, n_chars * n, replace = TRUE)
offsets_buffer <- as.integer(seq(0, n * n_chars, length.out = n + 1))
nanoarrow_array_modify(
nanoarrow_array_init(na_string()),
list(
length = n,
null_count = 0,
buffers = list(NULL, offsets_buffer, data_buffer)
)
)
}
random_string_struct <- function(n_rows = 1024, n_cols = 1, n_chars = 16) {
col_names <- sprintf("col%03d", seq_len(n_cols))
col_types <- rep(list(na_string()), n_cols)
names(col_types) <- col_names
schema <- na_struct(col_types)
columns <- lapply(
col_names,
function(...) random_string_array(n_rows, n_chars = n_chars)
)
nanoarrow_array_modify(
nanoarrow_array_init(schema),
list(
length = n_rows,
null_count = 0,
children = columns
)
)
}
random_string_batches <- function(n_batches = 1, n_rows = 1, n_cols = 1,
n_chars = 16) {
lapply(
seq_len(n_batches),
function(...) random_string_struct(n_rows, n_cols, n_chars)
)
}
batches <- random_string_batches(n_batches = 100, n_cols = 160)
stream <- basic_array_stream(batches)
write_nanoarrow(stream, "many_strings.arrows")
```
...in a separate R session, the issues around taking a long time for the GC
to run seemed to go away (but it would be great to have a check!)
```r
library(nanoarrow)
df <- read_nanoarrow("many_strings.arrows") |>
convert_array_stream()
f
nanoarrow:::preserved_count()
#> [1] 0
system.time(gc(), gcFirst = FALSE)
#> user system elapsed
#> 0.036 0.001 0.037
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
