On Sat, Jul 11, 2020 at 4:10 AM Rémi Dettai <rdet...@gmail.com> wrote:
>
> Hi Micah,
>
> Thanks for the answer ! But it seems your email got split in half in some
> way ;-)
>
> My use case mainly focuses on aggregations (with group by), and after
> fighting quite a bit with the allocators I ended up thinking that it might
> not be worth it materializing the raw data as arrow tables in that case. I
> don't know exactly how I can objectively compare the two approaches, and I
> have trouble getting a good feeling on whether the optimisations achievable
> on kernels (vectorization...) might or might not compensate for the cost of
> these allocations. I know a query engine is in the pipes on your side as
> well, has this kind of short-circuiting of materialization been discussed ?

In general if it's possible to avoid materializing data that isn't
used we would try to do that.

> If I go with direct aggregations without materializing the un-aggregated
> data as an arrow table, I still consider using Arrow as an exchange format,
> but more for the standard as for its performance (I expect the aggregation
> factor to be at least 1000x because the whole idea is to force the heavy
> lifting into these aggregation workers).
>
> PS: I guess nobody answered me on my remarks about the Parquet
> deserialization because it is quite hard to get into the code to verify
> what's happening. It's true that Parquet decoding can be messy because
> different types need different code paths, but on top of that in this
> implementation there are different paths between the deserialization to
> Arrow and the legacy one that are mixed together, which x2 the complexity.

I haven't had a chance to look carefully at this thread, after the
1.0.0 release is out and I have a chance to catch my breath if you ask
the questions again I may be able to take a look.

> Remi
>
> Le ven. 10 juil. 2020 à 08:12, Micah Kornfield <emkornfi...@gmail.com> a
> écrit :
>
> > Sorry for the delay.  Clearing through my inbox backlog ...
> >
> > We should double check the code, but one thing that has bitten me in the
> > past with variable-width data is the binary array builder ReserveData call
> > [1], does not act the same way Reserve works.  The former only grows the
> > buffer by the exact size requested.  The latter will grow buffers by a
> > power of two.
> >
> > Also to answer one question.
> >
> >
> > > I'm not sure what the true
> > > contract of the memory pool and the associated allocator is. Does this
> > > interface guarantee that the memory allocated will not be zeroed out or
> > > touched in some similar fashion that would trigger physical memory
> > mapping
> >
> > I don't think we've defined anything specifically by way of this contract.
> > I don't know what the underlying allocators do, but I do not think the
> > MemoryPool code path will actually touch any bytes by itself.  So one
> > possibility at least as a prototype is you could potentially provide an
> > intercepting memory pool that handles larger
> >
> > [1]
> >
> > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253
> >
> > On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai <rdet...@gmail.com> wrote:
> >
> > > Hi Antoine and all !
> > >
> > > Sorry for the delay, I wanted to understand things a bit better before
> > > getting back to you.  As discussed, I focussed on the Parquet case. I've
> > > looked into parquet/encoding.cc to see what could be done to have a
> > better
> > > memory reservation with ByteArrays.
> > >
> > > On my journey, I noticed a few things:
> > > - The parquet dictionary is first deserialized into an array of views,
> > then
> > > converted to an Arrow style variable/fixed size binary array (here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> > > >/
> > > here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> > > >).
> > > This later conversion is done even if it is not used afterwards, in fact
> > it
> > > seems to be only used here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> > > >(never
> > > used for fixed size binary and not used either for variable size binary
> > if
> > > the target Arrow array is not dict encoded). So if you confirm my
> > > "discoveries", we could spare ourselves some dictionary conversions by
> > > doing them lazily when *InsertDictionary *is called.
> > > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> > > >
> > > and DictByteArrayDecoderImpl::DecodeArrow
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> > > >
> > > have been short circuited inside
> > > ByteArrayDictionaryRecordReader::ReadValuesDense
> > > <
> > >
> > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532
> > > >
> > > and ByteArrayDictionaryRecordReader::ReadValuesSpaced
> > > <
> > >
> > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548
> > > >
> > > (dictionary encoded data pages are always RLE in parquet). This would
> > mean
> > > that it is dead code now, isn't it ? Well, DecodeArrow methods are kind
> > of
> > > public, but it is very confusing as they are not used by Arrow itself ;-)
> > >
> > > Finaly, regarding your idea (@antoine) of using a simple heuristic to
> > > pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
> > > doing the over-allocation at that level because I'm not sure what the
> > true
> > > contract of the memory pool and the associated allocator is. Does this
> > > interface guarantee that the memory allocated will not be zeroed out or
> > > touched in some similar fashion that would trigger physical memory
> > mapping
> > > ? With my suggestion of using mmap to do huge allocations, I am
> > positioning
> > > myself at a level where I know for sure that the underlying physical
> > memory
> > > is not mapped because we don't touch the memory. But as you noticed, I
> > have
> > > less knowledge about the context so it's very hard to decide when to
> > > trigger the "runway" pre-allocation or not.
> > >
> > > Hope this all makes sense. Took me a while to understand how the decoding
> > > works ;-)
> > >
> > > Remi
> > >
> > >
> > >
> > > Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a
> > écrit :
> > >
> > > >
> > > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > > > > I looked into the details of why the decoder could not estimate the
> > > > target
> > > > > Arrow array size for my Parquet column. It's because I am decoding
> > from
> > > > > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > > > > Parquet). In this case the size prediction is impossible :-(
> > > >
> > > > But we can probably make up a heuristic.  For example
> > > >    avg(dictionary value size) * number of non-null values
> > > >
> > > > It would avoid a number of resizes, even though there may still be a
> > > > couple of them at the end.  It may oversize in some cases, but much
> > less
> > > > than your proposed strategy of reserving a huge chunk of virtual memory
> > > :-)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >
> >

Reply via email to