Hi Micah,

Thanks for the answer ! But it seems your email got split in half in some
way ;-)

My use case mainly focuses on aggregations (with group by), and after
fighting quite a bit with the allocators I ended up thinking that it might
not be worth it materializing the raw data as arrow tables in that case. I
don't know exactly how I can objectively compare the two approaches, and I
have trouble getting a good feeling on whether the optimisations achievable
on kernels (vectorization...) might or might not compensate for the cost of
these allocations. I know a query engine is in the pipes on your side as
well, has this kind of short-circuiting of materialization been discussed ?

If I go with direct aggregations without materializing the un-aggregated
data as an arrow table, I still consider using Arrow as an exchange format,
but more for the standard as for its performance (I expect the aggregation
factor to be at least 1000x because the whole idea is to force the heavy
lifting into these aggregation workers).

PS: I guess nobody answered me on my remarks about the Parquet
deserialization because it is quite hard to get into the code to verify
what's happening. It's true that Parquet decoding can be messy because
different types need different code paths, but on top of that in this
implementation there are different paths between the deserialization to
Arrow and the legacy one that are mixed together, which x2 the complexity.

Remi

Le ven. 10 juil. 2020 à 08:12, Micah Kornfield <emkornfi...@gmail.com> a
écrit :

> Sorry for the delay.  Clearing through my inbox backlog ...
>
> We should double check the code, but one thing that has bitten me in the
> past with variable-width data is the binary array builder ReserveData call
> [1], does not act the same way Reserve works.  The former only grows the
> buffer by the exact size requested.  The latter will grow buffers by a
> power of two.
>
> Also to answer one question.
>
>
> > I'm not sure what the true
> > contract of the memory pool and the associated allocator is. Does this
> > interface guarantee that the memory allocated will not be zeroed out or
> > touched in some similar fashion that would trigger physical memory
> mapping
>
> I don't think we've defined anything specifically by way of this contract.
> I don't know what the underlying allocators do, but I do not think the
> MemoryPool code path will actually touch any bytes by itself.  So one
> possibility at least as a prototype is you could potentially provide an
> intercepting memory pool that handles larger
>
> [1]
>
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253
>
> On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai <rdet...@gmail.com> wrote:
>
> > Hi Antoine and all !
> >
> > Sorry for the delay, I wanted to understand things a bit better before
> > getting back to you.  As discussed, I focussed on the Parquet case. I've
> > looked into parquet/encoding.cc to see what could be done to have a
> better
> > memory reservation with ByteArrays.
> >
> > On my journey, I noticed a few things:
> > - The parquet dictionary is first deserialized into an array of views,
> then
> > converted to an Arrow style variable/fixed size binary array (here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> > >/
> > here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> > >).
> > This later conversion is done even if it is not used afterwards, in fact
> it
> > seems to be only used here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> > >(never
> > used for fixed size binary and not used either for variable size binary
> if
> > the target Arrow array is not dict encoded). So if you confirm my
> > "discoveries", we could spare ourselves some dictionary conversions by
> > doing them lazily when *InsertDictionary *is called.
> > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> > >
> > and DictByteArrayDecoderImpl::DecodeArrow
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> > >
> > have been short circuited inside
> > ByteArrayDictionaryRecordReader::ReadValuesDense
> > <
> >
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532
> > >
> > and ByteArrayDictionaryRecordReader::ReadValuesSpaced
> > <
> >
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548
> > >
> > (dictionary encoded data pages are always RLE in parquet). This would
> mean
> > that it is dead code now, isn't it ? Well, DecodeArrow methods are kind
> of
> > public, but it is very confusing as they are not used by Arrow itself ;-)
> >
> > Finaly, regarding your idea (@antoine) of using a simple heuristic to
> > pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
> > doing the over-allocation at that level because I'm not sure what the
> true
> > contract of the memory pool and the associated allocator is. Does this
> > interface guarantee that the memory allocated will not be zeroed out or
> > touched in some similar fashion that would trigger physical memory
> mapping
> > ? With my suggestion of using mmap to do huge allocations, I am
> positioning
> > myself at a level where I know for sure that the underlying physical
> memory
> > is not mapped because we don't touch the memory. But as you noticed, I
> have
> > less knowledge about the context so it's very hard to decide when to
> > trigger the "runway" pre-allocation or not.
> >
> > Hope this all makes sense. Took me a while to understand how the decoding
> > works ;-)
> >
> > Remi
> >
> >
> >
> > Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a
> écrit :
> >
> > >
> > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > > > I looked into the details of why the decoder could not estimate the
> > > target
> > > > Arrow array size for my Parquet column. It's because I am decoding
> from
> > > > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > > > Parquet). In this case the size prediction is impossible :-(
> > >
> > > But we can probably make up a heuristic.  For example
> > >    avg(dictionary value size) * number of non-null values
> > >
> > > It would avoid a number of resizes, even though there may still be a
> > > couple of them at the end.  It may oversize in some cases, but much
> less
> > > than your proposed strategy of reserving a huge chunk of virtual memory
> > :-)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>

Reply via email to