> Ideally, we should be able to presize the array to a good enough
estimate.
You should be able to get away with a correct estimation because parquet
column metadata contains the uncompressed size. But is their anything wrong
with this idea of mmaping huge "runways" for our larger allocations ?


Le jeu. 4 juin 2020 à 17:58, Antoine Pitrou <solip...@pitrou.net> a écrit :

> On Thu, 4 Jun 2020 17:48:16 +0200
> Rémi Dettai <rdet...@gmail.com> wrote:
> > When creating large arrays, Arrow uses realloc quite intensively.
> >
> > I have an example where y read a gzipped parquet column (strings) that
> > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> > cannot anticipate this and every reallocate call above 1MB (the most
> > critical ones) ends up being a copy.
>
> Ideally, we should be able to presize the array to a good enough
> estimate. I don't know if Parquet gives us enough information for that,
> though.
>
> Regards
>
> Antoine.
>
>
>

Reply via email to