Great thanks, I need to read RecordBatch docs to see if it fits my requirements.
On Mon, Aug 3, 2020 at 11:25 AM Micah Kornfield <[email protected]> wrote: >> >> I found that C++ read data buffer is >> [nullptr, 500 number, nullptr, 500 number], if chunk_size =10, >> I got [nullptr, 40 number, nullptr, 40 number ...] , which makes me >> confusing , why a usefulless nullptr Buffer before every Buffer ? > > The buffer is ArrayData() reflect the Arrow layout. The nullptr elides > validity buffers where there are no null values. > > Regarding pre-allocation, this has been discussed before but no-one has > contributed any implementation for it. The last conversation was [1]. It > doesn't mention memory mapping but I think that could potentially be fit in > with the right abstractions. > > [1] https://www.mail-archive.com/[email protected]/msg19862.html > > On Sun, Aug 2, 2020 at 6:42 PM comic fans <[email protected]> wrote: >> >> Hello everyone, I'm trying to write out a dataframe in feather format >> from R and read it in C++, >> >> my R code looks like this: >> >> arrow::write_feather(data.frame(a=1:1000, b= 1000:1), >> 'arrow.data', chunk_size=500, compression= 'uncompressed') >> >> and my C++ code looks like this: >> >> auto column0 = table->column(0); >> for(int i=0; i< column0->num_chunks();++i){ >> auto array = column0->chunk(i); >> auto buffers = array->data()->buffers; >> for(int j=0;j<buffers.size();++j){ >> if(!buffers[j]){ >> std::cout<<j<<" null"<<std::endl; >> }else{ >> std::cout<<j<<" "<<buffers[j]->size()<<std::endl; >> } >> } >> } >> >> I found that C++ read data buffer is >> [nullptr, 500 number, nullptr, 500 number], if chunk_size =10, >> I got [nullptr, 40 number, nullptr, 40 number ...] , which makes me >> confusing , why a usefulless nullptr Buffer before every Buffer ? >> >> another question is how to use arrow as a zero-copy TSDB, my >> intention: >> >> 1. historic and new written data must be in contiguous memory , >> can not be chunked (so I can't makes historic readonly part >> and newly writable part in different buffer) >> 2. historic data may be very big so I need it memory mapped >> 3. I also want to use memory map to persist new written data >> (don't have strict transaction requirements, OS scheduled flush >> is OK to me) >> 4. how many new data to write is known, so preallocated memory >> mapped file is OK. >> 5. all components live in same process, no cross-process >> communicate needed (so apache plasma not needed) >> 6. easily exchange data with R >> >> firstly I think arrow is a good fit , but with some docs reading , I >> realize the buffer in arrow can't be modified, if I a feather file >> with array size preallocated, all data became readonly when reload >> it (through memory mapped file interface) . I abuse arrow by >> const cast the data pointer and write into it , since it's memory >> mapped, modification do change the file as I intend, but I'd like >> to know if there is better way to achieve my goal ? does arrow >> intend to support such usecase and I missed some API ? >> any advise will be helpful.
