On Sat, Jul 11, 2020 at 4:10 AM Rémi Dettai <rdet...@gmail.com> wrote: > > Hi Micah, > > Thanks for the answer ! But it seems your email got split in half in some > way ;-) > > My use case mainly focuses on aggregations (with group by), and after > fighting quite a bit with the allocators I ended up thinking that it might > not be worth it materializing the raw data as arrow tables in that case. I > don't know exactly how I can objectively compare the two approaches, and I > have trouble getting a good feeling on whether the optimisations achievable > on kernels (vectorization...) might or might not compensate for the cost of > these allocations. I know a query engine is in the pipes on your side as > well, has this kind of short-circuiting of materialization been discussed ?
In general if it's possible to avoid materializing data that isn't used we would try to do that. > If I go with direct aggregations without materializing the un-aggregated > data as an arrow table, I still consider using Arrow as an exchange format, > but more for the standard as for its performance (I expect the aggregation > factor to be at least 1000x because the whole idea is to force the heavy > lifting into these aggregation workers). > > PS: I guess nobody answered me on my remarks about the Parquet > deserialization because it is quite hard to get into the code to verify > what's happening. It's true that Parquet decoding can be messy because > different types need different code paths, but on top of that in this > implementation there are different paths between the deserialization to > Arrow and the legacy one that are mixed together, which x2 the complexity. I haven't had a chance to look carefully at this thread, after the 1.0.0 release is out and I have a chance to catch my breath if you ask the questions again I may be able to take a look. > Remi > > Le ven. 10 juil. 2020 à 08:12, Micah Kornfield <emkornfi...@gmail.com> a > écrit : > > > Sorry for the delay. Clearing through my inbox backlog ... > > > > We should double check the code, but one thing that has bitten me in the > > past with variable-width data is the binary array builder ReserveData call > > [1], does not act the same way Reserve works. The former only grows the > > buffer by the exact size requested. The latter will grow buffers by a > > power of two. > > > > Also to answer one question. > > > > > > > I'm not sure what the true > > > contract of the memory pool and the associated allocator is. Does this > > > interface guarantee that the memory allocated will not be zeroed out or > > > touched in some similar fashion that would trigger physical memory > > mapping > > > > I don't think we've defined anything specifically by way of this contract. > > I don't know what the underlying allocators do, but I do not think the > > MemoryPool code path will actually touch any bytes by itself. So one > > possibility at least as a prototype is you could potentially provide an > > intercepting memory pool that handles larger > > > > [1] > > > > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253 > > > > On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai <rdet...@gmail.com> wrote: > > > > > Hi Antoine and all ! > > > > > > Sorry for the delay, I wanted to understand things a bit better before > > > getting back to you. As discussed, I focussed on the Parquet case. I've > > > looked into parquet/encoding.cc to see what could be done to have a > > better > > > memory reservation with ByteArrays. > > > > > > On my journey, I noticed a few things: > > > - The parquet dictionary is first deserialized into an array of views, > > then > > > converted to an Arrow style variable/fixed size binary array (here > > > < > > > > > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652 > > > >/ > > > here > > > < > > > > > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670 > > > >). > > > This later conversion is done even if it is not used afterwards, in fact > > it > > > seems to be only used here > > > < > > > > > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845 > > > >(never > > > used for fixed size binary and not used either for variable size binary > > if > > > the target Arrow array is not dict encoded). So if you confirm my > > > "discoveries", we could spare ourselves some dictionary conversions by > > > doing them lazily when *InsertDictionary *is called. > > > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull > > > < > > > > > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030 > > > > > > > and DictByteArrayDecoderImpl::DecodeArrow > > > < > > > > > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974 > > > > > > > have been short circuited inside > > > ByteArrayDictionaryRecordReader::ReadValuesDense > > > < > > > > > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532 > > > > > > > and ByteArrayDictionaryRecordReader::ReadValuesSpaced > > > < > > > > > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548 > > > > > > > (dictionary encoded data pages are always RLE in parquet). This would > > mean > > > that it is dead code now, isn't it ? Well, DecodeArrow methods are kind > > of > > > public, but it is very confusing as they are not used by Arrow itself ;-) > > > > > > Finaly, regarding your idea (@antoine) of using a simple heuristic to > > > pre-allocate the array, I'm not totally comfortable (maybe wrongly) with > > > doing the over-allocation at that level because I'm not sure what the > > true > > > contract of the memory pool and the associated allocator is. Does this > > > interface guarantee that the memory allocated will not be zeroed out or > > > touched in some similar fashion that would trigger physical memory > > mapping > > > ? With my suggestion of using mmap to do huge allocations, I am > > positioning > > > myself at a level where I know for sure that the underlying physical > > memory > > > is not mapped because we don't touch the memory. But as you noticed, I > > have > > > less knowledge about the context so it's very hard to decide when to > > > trigger the "runway" pre-allocation or not. > > > > > > Hope this all makes sense. Took me a while to understand how the decoding > > > works ;-) > > > > > > Remi > > > > > > > > > > > > Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a > > écrit : > > > > > > > > > > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit : > > > > > I looked into the details of why the decoder could not estimate the > > > > target > > > > > Arrow array size for my Parquet column. It's because I am decoding > > from > > > > > Parquet-Dictionary to Arrow-Plain (which is the default when loading > > > > > Parquet). In this case the size prediction is impossible :-( > > > > > > > > But we can probably make up a heuristic. For example > > > > avg(dictionary value size) * number of non-null values > > > > > > > > It would avoid a number of resizes, even though there may still be a > > > > couple of them at the end. It may oversize in some cases, but much > > less > > > > than your proposed strategy of reserving a huge chunk of virtual memory > > > :-) > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > >