> Ideally, we should be able to presize the array to a good enough estimate. You should be able to get away with a correct estimation because parquet column metadata contains the uncompressed size. But is their anything wrong with this idea of mmaping huge "runways" for our larger allocations ?
Le jeu. 4 juin 2020 à 17:58, Antoine Pitrou <solip...@pitrou.net> a écrit : > On Thu, 4 Jun 2020 17:48:16 +0200 > Rémi Dettai <rdet...@gmail.com> wrote: > > When creating large arrays, Arrow uses realloc quite intensively. > > > > I have an example where y read a gzipped parquet column (strings) that > > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc > > cannot anticipate this and every reallocate call above 1MB (the most > > critical ones) ends up being a copy. > > Ideally, we should be able to presize the array to a good enough > estimate. I don't know if Parquet gives us enough information for that, > though. > > Regards > > Antoine. > > >