Hi Paul,

Thank you for the email. I think this is interesting.

Arrow (Java API) currently doesn't have the capability of automatically
limiting the memory size of record batches. In Spark we have similar needs
to limit the size of record batches and have talked about implementing some
kind of size estimator for record batches but haven't started to work on it.

I personally think it makes sense for Arrow to incorporate such
capabilities.



On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Over in the Apache Drill project, we developed some handy vector
> reader/writer abstractions. I wonder if they might be of interest to Apache
> Arrow. Key contributions of the "RowSet" abstractions:
>
> * Control row batch size: the aggregate memory taken by a set of vectors
> (and all their sub-vectors for structured types.)
> * Control the maximum per-vector size.
> * Simple, highly optimized read/write interface that handles vector offset
> accounting, even for deeply nested types.
> * Minimize vector internal fragmentation (wasted space.)
>
> More information is available in [1]. Arrow improved and simplified
> Drill's original vector and metadata abstractions. As a result, work would
> be required to port the RowSet code from Drill's version of these classes
> to the Arrow versions.
>
> Does Arrow already have a similar solution? If not, would the above be
> useful for Arrow?
>
> Thanks,
> - Paul
>
>
> Apache Drill PMC member
> Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> [1]
> https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
>
>
>

Reply via email to