Le 05/06/2020 à 16:25, Uwe L. Korn a écrit : > > On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote: >> Hi Antoine ! >>> I would indeed have expected jemalloc to do that (remap the pages) >> I have no idea about the performance gain this would provide (if any). >> Could be interesting to explore. > > This would actually be the most interesting thing. In general, getting access > to the pages mapped into RAM would improve in a lot of more situations, not > just realloction. For example, when you take a small slice of a large array > and only pass this on, but don't an explicit reference to the array, you will > still indirectly hold on the larger memory size. Having an allocator that > would understand the mapping between pages and memory block would allow us to > free the pages that are not part of the view. > > Also, yes: For CSV and JSON, we don't have size estimates beforehand. There > this would be a great performance improvement.
For CSV we actually know the size after parsing: https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter.cc#L177-L178 It would be a shame if this were possible in CSV but not in Parquet, a storage format dedicated to big columnar data. Regards Antoine.