Hi All,

Over in the Apache Drill project, we developed some handy vector reader/writer 
abstractions. I wonder if they might be of interest to Apache Arrow. Key 
contributions of the "RowSet" abstractions:

* Control row batch size: the aggregate memory taken by a set of vectors (and 
all their sub-vectors for structured types.)
* Control the maximum per-vector size.
* Simple, highly optimized read/write interface that handles vector offset 
accounting, even for deeply nested types.
* Minimize vector internal fragmentation (wasted space.)

More information is available in [1]. Arrow improved and simplified Drill's 
original vector and metadata abstractions. As a result, work would be required 
to port the RowSet code from Drill's version of these classes to the Arrow 
versions.

Does Arrow already have a similar solution? If not, would the above be useful 
for Arrow?

Thanks,
- Paul


Apache Drill PMC member
Co-author of the upcoming O'Reilly book "Learning Apache Drill"
[1] https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow


Reply via email to