romainfrancois commented on pull request #9615: URL: https://github.com/apache/arrow/pull/9615#issuecomment-834401547
What is the schema of the fanni mae data set ? Does it have some missing values ? Maybe the code goes through this branch: ```r if (arrow::r::can_reuse_memory(x, options.type)) { columns[j] = std::make_shared<arrow::ChunkedArray>( arrow::r::vec_to_arrow__reuse_memory(x)); } ``` which for now does not benefit from parallelization, and perhaps should, at least when there are some NA to deal with: ```cpp // this is only used on some special cases when the arrow Array can just use the memory of // the R object, via an RBuffer, hence be zero copy template <int RTYPE, typename RVector, typename Type> std::shared_ptr<Array> MakeSimpleArray(SEXP x) { using value_type = typename arrow::TypeTraits<Type>::ArrayType::value_type; RVector vec(x); auto n = vec.size(); auto p_vec_start = reinterpret_cast<const value_type*>(DATAPTR_RO(vec)); auto p_vec_end = p_vec_start + n; std::vector<std::shared_ptr<Buffer>> buffers{nullptr, std::make_shared<RBuffer<RVector>>(vec)}; int null_count = 0; auto first_na = std::find_if(p_vec_start, p_vec_end, is_NA<value_type>); if (first_na < p_vec_end) { auto null_bitmap = ValueOrStop(AllocateBuffer(BitUtil::BytesForBits(n), gc_memory_pool())); internal::FirstTimeBitmapWriter bitmap_writer(null_bitmap->mutable_data(), 0, n); // first loop to clear all the bits before the first NA auto j = std::distance(p_vec_start, first_na); int i = 0; for (; i < j; i++, bitmap_writer.Next()) { bitmap_writer.Set(); } auto p_vec = first_na; // then finish for (; i < n; i++, bitmap_writer.Next(), ++p_vec) { if (is_NA<value_type>(*p_vec)) { bitmap_writer.Clear(); null_count++; } else { bitmap_writer.Set(); } } bitmap_writer.Finish(); buffers[0] = std::move(null_bitmap); } auto data = ArrayData::Make(std::make_shared<Type>(), LENGTH(x), std::move(buffers), null_count, 0 /*offset*/); // return the right Array class return std::make_shared<typename TypeTraits<Type>::ArrayType>(data); } ``` The `find_if()` and the content of the `if (first_na < p_vec_end) {` branch is where this does some work, but all things are in place so that we could benefit from parallelization. Looking at this in the next few days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org