romainfrancois commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-834401547


   What is the schema of the fanni mae data set ? Does it have some missing 
values ? Maybe the code goes through this branch: 
   
   ```r
         if (arrow::r::can_reuse_memory(x, options.type)) {
           columns[j] = std::make_shared<arrow::ChunkedArray>(
               arrow::r::vec_to_arrow__reuse_memory(x));
         }
   ```
   
   which for now does not benefit from parallelization, and perhaps should, at 
least when there are some NA to deal with: 
   
   ```cpp
   // this is only used on some special cases when the arrow Array can just use 
the memory of
   // the R object, via an RBuffer, hence be zero copy
   template <int RTYPE, typename RVector, typename Type>
   std::shared_ptr<Array> MakeSimpleArray(SEXP x) {
     using value_type = typename arrow::TypeTraits<Type>::ArrayType::value_type;
     RVector vec(x);
     auto n = vec.size();
     auto p_vec_start = reinterpret_cast<const value_type*>(DATAPTR_RO(vec));
     auto p_vec_end = p_vec_start + n;
     std::vector<std::shared_ptr<Buffer>> buffers{nullptr,
                                                  
std::make_shared<RBuffer<RVector>>(vec)};
   
     int null_count = 0;
   
     auto first_na = std::find_if(p_vec_start, p_vec_end, is_NA<value_type>);
     if (first_na < p_vec_end) {
       auto null_bitmap =
           ValueOrStop(AllocateBuffer(BitUtil::BytesForBits(n), 
gc_memory_pool()));
       internal::FirstTimeBitmapWriter 
bitmap_writer(null_bitmap->mutable_data(), 0, n);
   
       // first loop to clear all the bits before the first NA
       auto j = std::distance(p_vec_start, first_na);
       int i = 0;
       for (; i < j; i++, bitmap_writer.Next()) {
         bitmap_writer.Set();
       }
   
       auto p_vec = first_na;
       // then finish
       for (; i < n; i++, bitmap_writer.Next(), ++p_vec) {
         if (is_NA<value_type>(*p_vec)) {
           bitmap_writer.Clear();
           null_count++;
         } else {
           bitmap_writer.Set();
         }
       }
   
       bitmap_writer.Finish();
       buffers[0] = std::move(null_bitmap);
     }
   
     auto data = ArrayData::Make(std::make_shared<Type>(), LENGTH(x), 
std::move(buffers),
                                 null_count, 0 /*offset*/);
   
     // return the right Array class
     return std::make_shared<typename TypeTraits<Type>::ArrayType>(data);
   }
   ```
   
   The `find_if()` and the content of the `if (first_na < p_vec_end) {` branch 
is where this does some work, but all things are in place so that we could 
benefit from parallelization. 
   
   Looking at this in the next few days. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to