Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-11 Thread Wes McKinney
On Sat, Jul 11, 2020 at 4:10 AM Rémi Dettai  wrote:
>
> Hi Micah,
>
> Thanks for the answer ! But it seems your email got split in half in some
> way ;-)
>
> My use case mainly focuses on aggregations (with group by), and after
> fighting quite a bit with the allocators I ended up thinking that it might
> not be worth it materializing the raw data as arrow tables in that case. I
> don't know exactly how I can objectively compare the two approaches, and I
> have trouble getting a good feeling on whether the optimisations achievable
> on kernels (vectorization...) might or might not compensate for the cost of
> these allocations. I know a query engine is in the pipes on your side as
> well, has this kind of short-circuiting of materialization been discussed ?

In general if it's possible to avoid materializing data that isn't
used we would try to do that.

> If I go with direct aggregations without materializing the un-aggregated
> data as an arrow table, I still consider using Arrow as an exchange format,
> but more for the standard as for its performance (I expect the aggregation
> factor to be at least 1000x because the whole idea is to force the heavy
> lifting into these aggregation workers).
>
> PS: I guess nobody answered me on my remarks about the Parquet
> deserialization because it is quite hard to get into the code to verify
> what's happening. It's true that Parquet decoding can be messy because
> different types need different code paths, but on top of that in this
> implementation there are different paths between the deserialization to
> Arrow and the legacy one that are mixed together, which x2 the complexity.

I haven't had a chance to look carefully at this thread, after the
1.0.0 release is out and I have a chance to catch my breath if you ask
the questions again I may be able to take a look.

> Remi
>
> Le ven. 10 juil. 2020 à 08:12, Micah Kornfield  a
> écrit :
>
> > Sorry for the delay.  Clearing through my inbox backlog ...
> >
> > We should double check the code, but one thing that has bitten me in the
> > past with variable-width data is the binary array builder ReserveData call
> > [1], does not act the same way Reserve works.  The former only grows the
> > buffer by the exact size requested.  The latter will grow buffers by a
> > power of two.
> >
> > Also to answer one question.
> >
> >
> > > I'm not sure what the true
> > > contract of the memory pool and the associated allocator is. Does this
> > > interface guarantee that the memory allocated will not be zeroed out or
> > > touched in some similar fashion that would trigger physical memory
> > mapping
> >
> > I don't think we've defined anything specifically by way of this contract.
> > I don't know what the underlying allocators do, but I do not think the
> > MemoryPool code path will actually touch any bytes by itself.  So one
> > possibility at least as a prototype is you could potentially provide an
> > intercepting memory pool that handles larger
> >
> > [1]
> >
> > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253
> >
> > On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai  wrote:
> >
> > > Hi Antoine and all !
> > >
> > > Sorry for the delay, I wanted to understand things a bit better before
> > > getting back to you.  As discussed, I focussed on the Parquet case. I've
> > > looked into parquet/encoding.cc to see what could be done to have a
> > better
> > > memory reservation with ByteArrays.
> > >
> > > On my journey, I noticed a few things:
> > > - The parquet dictionary is first deserialized into an array of views,
> > then
> > > converted to an Arrow style variable/fixed size binary array (here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> > > >/
> > > here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> > > >).
> > > This later conversion is done even if it is not used afterwards, in fact
> > it
> > > seems to be only used here
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> > > >(never
> > > used for fixed size binary and not used either for variable size binary
> > if
> > > the target Arrow array is not dict encoded). So if you confirm my
> > > "discoveries", we could spare ourselves some dictionary conversions by
> > > doing them lazily when *InsertDictionary *is called.
> > > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> > > >
> > > and DictByteArrayDecoderImpl::DecodeArrow
> > > <
> > >
> > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> > > >
> > > have been short 

Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-11 Thread Rémi Dettai
Hi Micah,

Thanks for the answer ! But it seems your email got split in half in some
way ;-)

My use case mainly focuses on aggregations (with group by), and after
fighting quite a bit with the allocators I ended up thinking that it might
not be worth it materializing the raw data as arrow tables in that case. I
don't know exactly how I can objectively compare the two approaches, and I
have trouble getting a good feeling on whether the optimisations achievable
on kernels (vectorization...) might or might not compensate for the cost of
these allocations. I know a query engine is in the pipes on your side as
well, has this kind of short-circuiting of materialization been discussed ?

If I go with direct aggregations without materializing the un-aggregated
data as an arrow table, I still consider using Arrow as an exchange format,
but more for the standard as for its performance (I expect the aggregation
factor to be at least 1000x because the whole idea is to force the heavy
lifting into these aggregation workers).

PS: I guess nobody answered me on my remarks about the Parquet
deserialization because it is quite hard to get into the code to verify
what's happening. It's true that Parquet decoding can be messy because
different types need different code paths, but on top of that in this
implementation there are different paths between the deserialization to
Arrow and the legacy one that are mixed together, which x2 the complexity.

Remi

Le ven. 10 juil. 2020 à 08:12, Micah Kornfield  a
écrit :

> Sorry for the delay.  Clearing through my inbox backlog ...
>
> We should double check the code, but one thing that has bitten me in the
> past with variable-width data is the binary array builder ReserveData call
> [1], does not act the same way Reserve works.  The former only grows the
> buffer by the exact size requested.  The latter will grow buffers by a
> power of two.
>
> Also to answer one question.
>
>
> > I'm not sure what the true
> > contract of the memory pool and the associated allocator is. Does this
> > interface guarantee that the memory allocated will not be zeroed out or
> > touched in some similar fashion that would trigger physical memory
> mapping
>
> I don't think we've defined anything specifically by way of this contract.
> I don't know what the underlying allocators do, but I do not think the
> MemoryPool code path will actually touch any bytes by itself.  So one
> possibility at least as a prototype is you could potentially provide an
> intercepting memory pool that handles larger
>
> [1]
>
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253
>
> On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai  wrote:
>
> > Hi Antoine and all !
> >
> > Sorry for the delay, I wanted to understand things a bit better before
> > getting back to you.  As discussed, I focussed on the Parquet case. I've
> > looked into parquet/encoding.cc to see what could be done to have a
> better
> > memory reservation with ByteArrays.
> >
> > On my journey, I noticed a few things:
> > - The parquet dictionary is first deserialized into an array of views,
> then
> > converted to an Arrow style variable/fixed size binary array (here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> > >/
> > here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> > >).
> > This later conversion is done even if it is not used afterwards, in fact
> it
> > seems to be only used here
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> > >(never
> > used for fixed size binary and not used either for variable size binary
> if
> > the target Arrow array is not dict encoded). So if you confirm my
> > "discoveries", we could spare ourselves some dictionary conversions by
> > doing them lazily when *InsertDictionary *is called.
> > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> > >
> > and DictByteArrayDecoderImpl::DecodeArrow
> > <
> >
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> > >
> > have been short circuited inside
> > ByteArrayDictionaryRecordReader::ReadValuesDense
> > <
> >
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532
> > >
> > and ByteArrayDictionaryRecordReader::ReadValuesSpaced
> > <
> >
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548
> > >
> > (dictionary encoded data pages are always RLE in parquet). This would
> mean
> > that it is dead code now, isn't it ? Well, DecodeArrow methods are 

Re: [DISCUSS] [C++] custom allocator for large objects

2020-07-10 Thread Micah Kornfield
Sorry for the delay.  Clearing through my inbox backlog ...

We should double check the code, but one thing that has bitten me in the
past with variable-width data is the binary array builder ReserveData call
[1], does not act the same way Reserve works.  The former only grows the
buffer by the exact size requested.  The latter will grow buffers by a
power of two.

Also to answer one question.


> I'm not sure what the true
> contract of the memory pool and the associated allocator is. Does this
> interface guarantee that the memory allocated will not be zeroed out or
> touched in some similar fashion that would trigger physical memory mapping

I don't think we've defined anything specifically by way of this contract.
I don't know what the underlying allocators do, but I do not think the
MemoryPool code path will actually touch any bytes by itself.  So one
possibility at least as a prototype is you could potentially provide an
intercepting memory pool that handles larger

[1]
https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253

On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai  wrote:

> Hi Antoine and all !
>
> Sorry for the delay, I wanted to understand things a bit better before
> getting back to you.  As discussed, I focussed on the Parquet case. I've
> looked into parquet/encoding.cc to see what could be done to have a better
> memory reservation with ByteArrays.
>
> On my journey, I noticed a few things:
> - The parquet dictionary is first deserialized into an array of views, then
> converted to an Arrow style variable/fixed size binary array (here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> >/
> here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> >).
> This later conversion is done even if it is not used afterwards, in fact it
> seems to be only used here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> >(never
> used for fixed size binary and not used either for variable size binary if
> the target Arrow array is not dict encoded). So if you confirm my
> "discoveries", we could spare ourselves some dictionary conversions by
> doing them lazily when *InsertDictionary *is called.
> - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> >
> and DictByteArrayDecoderImpl::DecodeArrow
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> >
> have been short circuited inside
> ByteArrayDictionaryRecordReader::ReadValuesDense
> <
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532
> >
> and ByteArrayDictionaryRecordReader::ReadValuesSpaced
> <
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548
> >
> (dictionary encoded data pages are always RLE in parquet). This would mean
> that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of
> public, but it is very confusing as they are not used by Arrow itself ;-)
>
> Finaly, regarding your idea (@antoine) of using a simple heuristic to
> pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
> doing the over-allocation at that level because I'm not sure what the true
> contract of the memory pool and the associated allocator is. Does this
> interface guarantee that the memory allocated will not be zeroed out or
> touched in some similar fashion that would trigger physical memory mapping
> ? With my suggestion of using mmap to do huge allocations, I am positioning
> myself at a level where I know for sure that the underlying physical memory
> is not mapped because we don't touch the memory. But as you noticed, I have
> less knowledge about the context so it's very hard to decide when to
> trigger the "runway" pre-allocation or not.
>
> Hope this all makes sense. Took me a while to understand how the decoding
> works ;-)
>
> Remi
>
>
>
> Le ven. 5 juin 2020 à 17:20, Antoine Pitrou  a écrit :
>
> >
> > Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > > I looked into the details of why the decoder could not estimate the
> > target
> > > Arrow array size for my Parquet column. It's because I am decoding from
> > > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > > Parquet). In this case the size prediction is impossible :-(
> >
> > But we can probably make up a heuristic.  For example
> >avg(dictionary value size) * number of non-null values
> >
> > It would avoid a number of resizes, even though there may still be a
> > couple of them at the end.  It may oversize in some cases, but much 

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-16 Thread Rémi Dettai
Hi Antoine and all !

Sorry for the delay, I wanted to understand things a bit better before
getting back to you.  As discussed, I focussed on the Parquet case. I've
looked into parquet/encoding.cc to see what could be done to have a better
memory reservation with ByteArrays.

On my journey, I noticed a few things:
- The parquet dictionary is first deserialized into an array of views, then
converted to an Arrow style variable/fixed size binary array (here
/
here
).
This later conversion is done even if it is not used afterwards, in fact it
seems to be only used here
(never
used for fixed size binary and not used either for variable size binary if
the target Arrow array is not dict encoded). So if you confirm my
"discoveries", we could spare ourselves some dictionary conversions by
doing them lazily when *InsertDictionary *is called.
- It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull

and DictByteArrayDecoderImpl::DecodeArrow

have been short circuited inside
ByteArrayDictionaryRecordReader::ReadValuesDense

and ByteArrayDictionaryRecordReader::ReadValuesSpaced

(dictionary encoded data pages are always RLE in parquet). This would mean
that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of
public, but it is very confusing as they are not used by Arrow itself ;-)

Finaly, regarding your idea (@antoine) of using a simple heuristic to
pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
doing the over-allocation at that level because I'm not sure what the true
contract of the memory pool and the associated allocator is. Does this
interface guarantee that the memory allocated will not be zeroed out or
touched in some similar fashion that would trigger physical memory mapping
? With my suggestion of using mmap to do huge allocations, I am positioning
myself at a level where I know for sure that the underlying physical memory
is not mapped because we don't touch the memory. But as you noticed, I have
less knowledge about the context so it's very hard to decide when to
trigger the "runway" pre-allocation or not.

Hope this all makes sense. Took me a while to understand how the decoding
works ;-)

Remi



Le ven. 5 juin 2020 à 17:20, Antoine Pitrou  a écrit :

>
> Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > I looked into the details of why the decoder could not estimate the
> target
> > Arrow array size for my Parquet column. It's because I am decoding from
> > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > Parquet). In this case the size prediction is impossible :-(
>
> But we can probably make up a heuristic.  For example
>avg(dictionary value size) * number of non-null values
>
> It would avoid a number of resizes, even though there may still be a
> couple of them at the end.  It may oversize in some cases, but much less
> than your proposed strategy of reserving a huge chunk of virtual memory :-)
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou


Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> I looked into the details of why the decoder could not estimate the target
> Arrow array size for my Parquet column. It's because I am decoding from
> Parquet-Dictionary to Arrow-Plain (which is the default when loading
> Parquet). In this case the size prediction is impossible :-(

But we can probably make up a heuristic.  For example
   avg(dictionary value size) * number of non-null values

It would avoid a number of resizes, even though there may still be a
couple of them at the end.  It may oversize in some cases, but much less
than your proposed strategy of reserving a huge chunk of virtual memory :-)

Regards

Antoine.


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou


Le 05/06/2020 à 16:25, Uwe L. Korn a écrit :
> 
> On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
>> Hi Antoine !
>>> I would indeed have expected jemalloc to do that (remap the pages)
>> I have no idea about the performance gain this would provide (if any).
>> Could be interesting to explore.
> 
> This would actually be the most interesting thing. In general, getting access 
> to the pages mapped into RAM would improve in a lot of more situations, not 
> just realloction. For example, when you take a small slice of a large array 
> and only pass this on, but don't an explicit reference to the array, you will 
> still indirectly hold on the larger memory size. Having an allocator that 
> would understand the mapping between pages and memory block would allow us to 
> free the pages that are not part of the view.
> 
> Also, yes: For CSV and JSON, we don't have size estimates beforehand. There 
> this would be a great performance improvement.

For CSV we actually know the size after parsing:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/converter.cc#L177-L178

It would be a shame if this were possible in CSV but not in Parquet, a
storage format dedicated to big columnar data.

Regards

Antoine.


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
I looked into the details of why the decoder could not estimate the target
Arrow array size for my Parquet column. It's because I am decoding from
Parquet-Dictionary to Arrow-Plain (which is the default when loading
Parquet). In this case the size prediction is impossible :-(

> This would actually be the most interesting thing. In general, getting
access to the pages mapped into RAM would improve in a lot of more
situations, not just realloction. For example, when you take a small slice
of a large array and only pass this on, but don't an explicit reference to
the array, you will still indirectly hold on the larger memory size. Having
an allocator that would understand the mapping between pages and memory
block would allow us to free the pages that are not part of the view
Not sure I'm following you on this one. From my understanding the subject
here is mremap which allows you to keep your physical memory but change the
virtual address range that points to it. It seems according to this (
https://stackoverflow.com/questions/11621606/faster-way-to-move-memory-page-than-mremap)
that is mainly efficient for growing large allocations.

Le ven. 5 juin 2020 à 16:25, Uwe L. Korn  a écrit :

>
> On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
> > Hi Antoine !
> > > I would indeed have expected jemalloc to do that (remap the pages)
> > I have no idea about the performance gain this would provide (if any).
> > Could be interesting to explore.
>
> This would actually be the most interesting thing. In general, getting
> access to the pages mapped into RAM would improve in a lot of more
> situations, not just realloction. For example, when you take a small slice
> of a large array and only pass this on, but don't an explicit reference to
> the array, you will still indirectly hold on the larger memory size. Having
> an allocator that would understand the mapping between pages and memory
> block would allow us to free the pages that are not part of the view.
>
> Also, yes: For CSV and JSON, we don't have size estimates beforehand.
> There this would be a great performance improvement.
>
> Best
> Uwe
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn


On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
> Hi Antoine !
> > I would indeed have expected jemalloc to do that (remap the pages)
> I have no idea about the performance gain this would provide (if any).
> Could be interesting to explore.

This would actually be the most interesting thing. In general, getting access 
to the pages mapped into RAM would improve in a lot of more situations, not 
just realloction. For example, when you take a small slice of a large array and 
only pass this on, but don't an explicit reference to the array, you will still 
indirectly hold on the larger memory size. Having an allocator that would 
understand the mapping between pages and memory block would allow us to free 
the pages that are not part of the view.

Also, yes: For CSV and JSON, we don't have size estimates beforehand. There 
this would be a great performance improvement.

Best
Uwe


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
Hi Antoine !
> I would indeed have expected jemalloc to do that (remap the pages)
I have no idea about the performance gain this would provide (if any).
Could be interesting to explore.

> do you know that Arrow also supports integration with another allocator,
mimalloc
I only tried Jemalloc and the system allocator because I assumed mimalloc
was specific to windows. I'll try to look into this as well ;-) I liked the
idea of looking deeper into jemalloc, as it is also the default allocator
for Rust.

Thanks for your insights!

Regards,

Remi



Le ven. 5 juin 2020 à 14:32, Antoine Pitrou  a écrit :

>
> Le 05/06/2020 à 14:25, Rémi Dettai a écrit :
> > Hi Uwe!
> >
> >> As your suggestions don't seem to be specific to Arrow, why not
> > contribute them directly to jemalloc? They are much better in reviewing
> > allocator code than we are.
> > I mentioned this idea in the jemalloc gitter. The first response was that
> > it should work but workloads with realloc aren't very common and these
> huge
> > allocations can have some impact during coredumps. It's true that
> jemalloc
> > is supposed to be general purpose allocator, so it has to make
> compromises
> > in that sense. Let's wait to see if other answers come in. You can follow
> > the conversation here if you are interested
> > https://gitter.im/jemalloc/jemalloc. It seemed to me while
> investigating on
> > Jemalloc inners that most of the conception effort is concentrated on
> > small/medium allocations. This makes sense as they probably represent 99%
> > of workloads.
>
> Thanks for the pointer.  The reply is interesting:
> """
> Another trick that’s possible but we don’t do is to just remap the pages
> into some new space, rather than alloc-and-copy
> """
> I would indeed have expected jemalloc to do that (remap the pages) when
> reallocating large allocations, so I'm a bit surprised they don't.  I
> guess it's one more special case to worry about.
>
> By the way, before going further with jemalloc, do you know that Arrow
> also supports integration with another allocator, mimalloc
> (https://github.com/microsoft/mimalloc/)?  Perhaps you should try it out
> and see whether that improves performance or not.
>
> (and also the system allocator, by the way)
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Antoine Pitrou


Le 05/06/2020 à 14:25, Rémi Dettai a écrit :
> Hi Uwe!
> 
>> As your suggestions don't seem to be specific to Arrow, why not
> contribute them directly to jemalloc? They are much better in reviewing
> allocator code than we are.
> I mentioned this idea in the jemalloc gitter. The first response was that
> it should work but workloads with realloc aren't very common and these huge
> allocations can have some impact during coredumps. It's true that jemalloc
> is supposed to be general purpose allocator, so it has to make compromises
> in that sense. Let's wait to see if other answers come in. You can follow
> the conversation here if you are interested
> https://gitter.im/jemalloc/jemalloc. It seemed to me while investigating on
> Jemalloc inners that most of the conception effort is concentrated on
> small/medium allocations. This makes sense as they probably represent 99%
> of workloads.

Thanks for the pointer.  The reply is interesting:
"""
Another trick that’s possible but we don’t do is to just remap the pages
into some new space, rather than alloc-and-copy
"""
I would indeed have expected jemalloc to do that (remap the pages) when
reallocating large allocations, so I'm a bit surprised they don't.  I
guess it's one more special case to worry about.

By the way, before going further with jemalloc, do you know that Arrow
also supports integration with another allocator, mimalloc
(https://github.com/microsoft/mimalloc/)?  Perhaps you should try it out
and see whether that improves performance or not.

(and also the system allocator, by the way)

Regards

Antoine.


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Rémi Dettai
Hi Uwe!

> As your suggestions don't seem to be specific to Arrow, why not
contribute them directly to jemalloc? They are much better in reviewing
allocator code than we are.
I mentioned this idea in the jemalloc gitter. The first response was that
it should work but workloads with realloc aren't very common and these huge
allocations can have some impact during coredumps. It's true that jemalloc
is supposed to be general purpose allocator, so it has to make compromises
in that sense. Let's wait to see if other answers come in. You can follow
the conversation here if you are interested
https://gitter.im/jemalloc/jemalloc. It seemed to me while investigating on
Jemalloc inners that most of the conception effort is concentrated on
small/medium allocations. This makes sense as they probably represent 99%
of workloads. But Arrow is really about structuring larger chunks of data
in memory. I agree with Antoine that in normal circumstances, if you start
to blame the allocator, it means that you likely got things wrong
elsewhere. But maybe the Arrow workload is specific enough to get a sit in
the realm of exceptions (right next to Video Game engines) ! ;-)

> Still, when we read a column, we should be able to determine its final
size from the Parquet metadata. Maybe we're passing an information there
not along?
I'm going to take a look at this. But even if for Parquet you can get a
good estimation, what about CSV or JSON ? How can you estimate the size of
a column of strings beforehand, specially with a compressed payload where
you don't even know the number of lines!

Le ven. 5 juin 2020 à 12:24, Uwe L. Korn  a écrit :

> Hello Rémi,
>
> under the hood jemalloc does quite similar things to what you describe.
> I'm not sure what the offset is in the current version but in earlier
> releases, it used a different allocation strategy for objects above 4MB.
> For the initial large allocation, you will see quite some copies as mmap is
> returning a new base address and it isn't able to reuse an existing space.
> This could probably be circumvented by a single large allocation which is
> free'd again.
>
> As your suggestions don't seem to be specific to Arrow, why not contribute
> them directly to jemalloc? They are much better in reviewing allocator code
> than we are.
>
> Still, when we read a column, we should be able to determine its final
> size from the Parquet metadata. Maybe we're passing an information there
> not along?
>
> Best,
> Uwe
>
> On Thu, Jun 4, 2020, at 5:48 PM, Rémi Dettai wrote:
> > When creating large arrays, Arrow uses realloc quite intensively.
> >
> > I have an example where y read a gzipped parquet column (strings) that
> > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> > cannot anticipate this and every reallocate call above 1MB (the most
> > critical ones) ends up being a copy.
> >
> > I think that knowing that we like using realloc in Arrow, we could come
> up
> > with an allocator for large objects that would behave a lot better than
> > Jemalloc. For smaller objects, this allocator could just let the memory
> > request being handled by Jemalloc. Not trying to outsmart the brilliant
> > guys from Facebook and co ;-) But for larger objects, we could adopt a
> > custom strategy:
> > - if an allocation or a re-allocation larger than 1MB (or maybe even
> 512K)
> > is made on our memory pool, call mmap with size XGB (X being slightly
> > smaller than the total physical memory on the system). This is ok because
> > mmap will not physically allocate this memory as long as it is not
> touched.
> > - we keep track of all allocations that we created this way, by storing
> the
> > pointer + the actual used size inside our XGB alloc in a map.
> > - when growing an alloc mmaped this way we will always have contiguous
> > memory available, (otherwise we would already have OOMed because X is the
> > physical memory size).
> > - when reducing the alloc size we can free with madvice (optional: if the
> > alloc becomes small enough, we might copy it back into a Jemalloc
> > allocation).
> >
> > I am not an expert of these matters, and I just learned what an allocator
> > really is, so my approach might be naive. In this case feel free ton
> > enlighten me!
> >
> > Please note that I'm not sure about the level of portability of this
> > solution.
> >
> > Have a nice day!
> >
> > Remi
> >
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn
Hello Rémi,

under the hood jemalloc does quite similar things to what you describe. I'm not 
sure what the offset is in the current version but in earlier releases, it used 
a different allocation strategy for objects above 4MB. For the initial large 
allocation, you will see quite some copies as mmap is returning a new base 
address and it isn't able to reuse an existing space. This could probably be 
circumvented by a single large allocation which is free'd again. 

As your suggestions don't seem to be specific to Arrow, why not contribute them 
directly to jemalloc? They are much better in reviewing allocator code than we 
are.

Still, when we read a column, we should be able to determine its final size 
from the Parquet metadata. Maybe we're passing an information there not along?

Best,
Uwe

On Thu, Jun 4, 2020, at 5:48 PM, Rémi Dettai wrote:
> When creating large arrays, Arrow uses realloc quite intensively.
> 
> I have an example where y read a gzipped parquet column (strings) that
> expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> cannot anticipate this and every reallocate call above 1MB (the most
> critical ones) ends up being a copy.
> 
> I think that knowing that we like using realloc in Arrow, we could come up
> with an allocator for large objects that would behave a lot better than
> Jemalloc. For smaller objects, this allocator could just let the memory
> request being handled by Jemalloc. Not trying to outsmart the brilliant
> guys from Facebook and co ;-) But for larger objects, we could adopt a
> custom strategy:
> - if an allocation or a re-allocation larger than 1MB (or maybe even 512K)
> is made on our memory pool, call mmap with size XGB (X being slightly
> smaller than the total physical memory on the system). This is ok because
> mmap will not physically allocate this memory as long as it is not touched.
> - we keep track of all allocations that we created this way, by storing the
> pointer + the actual used size inside our XGB alloc in a map.
> - when growing an alloc mmaped this way we will always have contiguous
> memory available, (otherwise we would already have OOMed because X is the
> physical memory size).
> - when reducing the alloc size we can free with madvice (optional: if the
> alloc becomes small enough, we might copy it back into a Jemalloc
> allocation).
> 
> I am not an expert of these matters, and I just learned what an allocator
> really is, so my approach might be naive. In this case feel free ton
> enlighten me!
> 
> Please note that I'm not sure about the level of portability of this
> solution.
> 
> Have a nice day!
> 
> Remi
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou


Le 04/06/2020 à 18:11, Rémi Dettai a écrit :
>  > Ideally, we should be able to presize the array to a good enough
> estimate.
> You should be able to get away with a correct estimation because parquet
> column metadata contains the uncompressed size. But is their anything wrong
> with this idea of mmaping huge "runways" for our larger allocations ?

I don't know, but I suspect that if common allocators aren't doing it,
there must be some downsides.

I think trying to avoid most (re)allocations should be our preferred
course of action here.

Regards

Antoine.


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
 > Ideally, we should be able to presize the array to a good enough
estimate.
You should be able to get away with a correct estimation because parquet
column metadata contains the uncompressed size. But is their anything wrong
with this idea of mmaping huge "runways" for our larger allocations ?


Le jeu. 4 juin 2020 à 17:58, Antoine Pitrou  a écrit :

> On Thu, 4 Jun 2020 17:48:16 +0200
> Rémi Dettai  wrote:
> > When creating large arrays, Arrow uses realloc quite intensively.
> >
> > I have an example where y read a gzipped parquet column (strings) that
> > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> > cannot anticipate this and every reallocate call above 1MB (the most
> > critical ones) ends up being a copy.
>
> Ideally, we should be able to presize the array to a good enough
> estimate. I don't know if Parquet gives us enough information for that,
> though.
>
> Regards
>
> Antoine.
>
>
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou
On Thu, 4 Jun 2020 17:48:16 +0200
Rémi Dettai  wrote:
> When creating large arrays, Arrow uses realloc quite intensively.
> 
> I have an example where y read a gzipped parquet column (strings) that
> expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> cannot anticipate this and every reallocate call above 1MB (the most
> critical ones) ends up being a copy.

Ideally, we should be able to presize the array to a good enough
estimate. I don't know if Parquet gives us enough information for that,
though.

Regards

Antoine.