Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Rémi Dettai
It makes sense to me that the default behaviour of such a low level api as
kernel does not do any automagic promotion, but shouldn't this kind of
promotion still be requestable by the so called "system developer" user ?
Otherwise he would need to materialize a promoted version of each original
array before the kernel operation, wouldn't he ?

Le mer. 3 juin 2020 à 18:27, Wes McKinney  a écrit :

> On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs
>  wrote:
> >
> > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs
> >  wrote:
> > >
> > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney 
> wrote:
> > > >
> > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > From the user perspective I find the following pretty confusing:
> > > > >
> > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2
> > > > > Out[1]: array([ 0, -2], dtype=int8)
> > > > >
> > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2
> > > > > Out[2]: array([-256,  254], dtype=int16)
> > > > >
> > > > > In my opinion somewhere (on a higher level maybe) we should provide
> > > > > the correct results promoted to a wider type implicitly.
> > > >
> > > > Yes, I agree with you, but I agree that the best place to address
> this
> > > > is at a higher level rather than having this logic implemented at the
> > > > lowest level (kernels) -- I think database systems handle this during
> > > > logical->physical planning.
> > >
> > > It raises another question: where to incorporate the implicit
> promotions?
> > > // correct me if I'm wrong but these implicit promotions are operation
> > > // dependent and distinct from kernel dispatching issue [1]
> > >
> > > The numpy example above can be roughly translated to:
> > > >>> a = pa.array([-128, 127])
> > > >>> pa.compute.add(a, a)
> > > array([ 0, -2]
> > >
> > > Which is rather surprising from the user's perspective.
> >
> > Would it be enough to document the exact behavior  and advice the user
> > to place casts until we have logical -> phisycal machinery?
>
> I think it's enough to clearly document the behavior and assume that
> the "user" will act according to what semantics are desired for their
> use cases. Per my comments in my last e-mail I don't think the users
> of these functions need to be handled with "kid's gloves".
>
> > I'm updating my PR as discussed.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-8919
> > > >
> > > > > Clickhouse for example does the type promotion.
> > > > >
> > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou 
> wrote:
> > > > > >
> > > > > > On Wed, 3 Jun 2020 10:47:38 -0400
> > > > > > Ben Kietzman  wrote:
> > > > > > >
> https://github.com/apache/arrow/pull/7341#issuecomment-638241193
> > > > > > >
> > > > > > > How should arithmetic kernels handle integer overflow?
> > > > > > >
> > > > > > > The approach currently taken in the linked PR is to promote
> such that
> > > > > > > overflow will not occur, for example `(int8, int8)->int16` and
> `(uint16,
> > > > > > > uint16)->uint32`.
> > > > > > >
> > > > > > > I'm not sure that's desirable. For one thing this leads to
> inconsistent
> > > > > > > handling of 64 bit integer types, which are currently allowed
> to overflow
> > > > > > > since we cannot promote further (NB: that means this kernel
> includes
> > > > > > > undefined behavior for int64).
> > > > > >
> > > > > > I agree with you.  I would strongly advise against implicit
> promotion
> > > > > > accross arithmetic operations.  We initially did that in Numba
> and it
> > > > > > quickly became a can of worms.
> > > > > >
> > > > > > The most desirable behaviour IMHO is to keep the original type,
> so:
> > > > > > - (int8, int8) -> int8
> > > > > > - (uint16, uint16) -> uint16
> > > > > >
> > > > > > Then the question is what happens when the actual overflow
> occurs.  I
> > > > > > think this should be directed by a kernel option.  By default an
> error
> > > > > > should probably be raised (letting errors pass and silently
> produce
> > > > > > erroneous data is wrong), but we might want to allow people to
> bypass
> > > > > > overflow checks for speed.
> > > > > >
> > > > > > Note that even if overflow detection is enabled, it *should* be
> possible
> > > > > > to enable vectorization, e.g. by making overflow detection a
> separate
> > > > > > pass (itself vectorizable).
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
>


Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Krisztián Szűcs
On Thu, Jun 4, 2020 at 11:09 AM Rémi Dettai  wrote:
>
> It makes sense to me that the default behaviour of such a low level api as
> kernel does not do any automagic promotion, but shouldn't this kind of
> promotion still be requestable by the so called "system developer" user ?
> Otherwise he would need to materialize a promoted version of each original
> array before the kernel operation, wouldn't he ?

I assume yes, for 1 million elements:

Promotion inside kernel:
u32 + u32 = u64
4MB + 4MB -> 8MB
New allocation: 8MB

Promotion outside kernel:
(u32 -> u64) + (u32 -> u64) = u64
(4MB -> 8MB) + (4MB -> 8MB) -> 8MB
New allocation: 24MB

>
> Le mer. 3 juin 2020 à 18:27, Wes McKinney  a écrit :
>
> > On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs
> >  wrote:
> > >
> > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney 
> > wrote:
> > > > >
> > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs
> > > > >  wrote:
> > > > > >
> > > > > > From the user perspective I find the following pretty confusing:
> > > > > >
> > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2
> > > > > > Out[1]: array([ 0, -2], dtype=int8)
> > > > > >
> > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2
> > > > > > Out[2]: array([-256,  254], dtype=int16)
> > > > > >
> > > > > > In my opinion somewhere (on a higher level maybe) we should provide
> > > > > > the correct results promoted to a wider type implicitly.
> > > > >
> > > > > Yes, I agree with you, but I agree that the best place to address
> > this
> > > > > is at a higher level rather than having this logic implemented at the
> > > > > lowest level (kernels) -- I think database systems handle this during
> > > > > logical->physical planning.
> > > >
> > > > It raises another question: where to incorporate the implicit
> > promotions?
> > > > // correct me if I'm wrong but these implicit promotions are operation
> > > > // dependent and distinct from kernel dispatching issue [1]
> > > >
> > > > The numpy example above can be roughly translated to:
> > > > >>> a = pa.array([-128, 127])
> > > > >>> pa.compute.add(a, a)
> > > > array([ 0, -2]
> > > >
> > > > Which is rather surprising from the user's perspective.
> > >
> > > Would it be enough to document the exact behavior  and advice the user
> > > to place casts until we have logical -> phisycal machinery?
> >
> > I think it's enough to clearly document the behavior and assume that
> > the "user" will act according to what semantics are desired for their
> > use cases. Per my comments in my last e-mail I don't think the users
> > of these functions need to be handled with "kid's gloves".
> >
> > > I'm updating my PR as discussed.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/ARROW-8919
> > > > >
> > > > > > Clickhouse for example does the type promotion.
> > > > > >
> > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou 
> > wrote:
> > > > > > >
> > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400
> > > > > > > Ben Kietzman  wrote:
> > > > > > > >
> > https://github.com/apache/arrow/pull/7341#issuecomment-638241193
> > > > > > > >
> > > > > > > > How should arithmetic kernels handle integer overflow?
> > > > > > > >
> > > > > > > > The approach currently taken in the linked PR is to promote
> > such that
> > > > > > > > overflow will not occur, for example `(int8, int8)->int16` and
> > `(uint16,
> > > > > > > > uint16)->uint32`.
> > > > > > > >
> > > > > > > > I'm not sure that's desirable. For one thing this leads to
> > inconsistent
> > > > > > > > handling of 64 bit integer types, which are currently allowed
> > to overflow
> > > > > > > > since we cannot promote further (NB: that means this kernel
> > includes
> > > > > > > > undefined behavior for int64).
> > > > > > >
> > > > > > > I agree with you.  I would strongly advise against implicit
> > promotion
> > > > > > > accross arithmetic operations.  We initially did that in Numba
> > and it
> > > > > > > quickly became a can of worms.
> > > > > > >
> > > > > > > The most desirable behaviour IMHO is to keep the original type,
> > so:
> > > > > > > - (int8, int8) -> int8
> > > > > > > - (uint16, uint16) -> uint16
> > > > > > >
> > > > > > > Then the question is what happens when the actual overflow
> > occurs.  I
> > > > > > > think this should be directed by a kernel option.  By default an
> > error
> > > > > > > should probably be raised (letting errors pass and silently
> > produce
> > > > > > > erroneous data is wrong), but we might want to allow people to
> > bypass
> > > > > > > overflow checks for speed.
> > > > > > >
> > > > > > > Note that even if overflow detection is enabled, it *should* be
> > possible
> > > > > > > to enable vectorization, e.g. by making overflow detection a
> > separate
> > > > > > > pass (itself vectorizable).
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> >


[NIGHTLY] Arrow Build Report for Job nightly-2020-06-04-0

2020-06-04 Thread Crossbow


Arrow Build Report for Job nightly-2020-06-04-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0

Failed Tasks:
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-centos-7-aarch64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-centos-8-aarch64
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.8-jpype
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-r-linux-as-cran

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-centos-6-amd64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-clean
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-travis-gandiva-jar-xenial
- nuget:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-nuget
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-cpp
- test-conda-python-3.6-pandas-0.23:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-06-04-0-github-test-conda-python-3.6-pandas-0.23
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=n

Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Wes McKinney
On Thu, Jun 4, 2020 at 4:57 AM Krisztián Szűcs
 wrote:
>
> On Thu, Jun 4, 2020 at 11:09 AM Rémi Dettai  wrote:
> >
> > It makes sense to me that the default behaviour of such a low level api as
> > kernel does not do any automagic promotion, but shouldn't this kind of
> > promotion still be requestable by the so called "system developer" user ?
> > Otherwise he would need to materialize a promoted version of each original
> > array before the kernel operation, wouldn't he ?

This will all be handled by the expression execution system. Please, I
beg your collective patience as I will be developing this in the near
future.

> I assume yes, for 1 million elements:
>
> Promotion inside kernel:
> u32 + u32 = u64
> 4MB + 4MB -> 8MB
> New allocation: 8MB
>
> Promotion outside kernel:
> (u32 -> u64) + (u32 -> u64) = u64
> (4MB -> 8MB) + (4MB -> 8MB) -> 8MB
> New allocation: 24MB

This isn't what the execution engine will do, though. The processed
chunks will be much smaller, like 64K elements or less, so at no point
will such large temporary allocations exist.

> >
> > Le mer. 3 juin 2020 à 18:27, Wes McKinney  a écrit :
> >
> > > On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney 
> > > wrote:
> > > > > >
> > > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs
> > > > > >  wrote:
> > > > > > >
> > > > > > > From the user perspective I find the following pretty confusing:
> > > > > > >
> > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2
> > > > > > > Out[1]: array([ 0, -2], dtype=int8)
> > > > > > >
> > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2
> > > > > > > Out[2]: array([-256,  254], dtype=int16)
> > > > > > >
> > > > > > > In my opinion somewhere (on a higher level maybe) we should 
> > > > > > > provide
> > > > > > > the correct results promoted to a wider type implicitly.
> > > > > >
> > > > > > Yes, I agree with you, but I agree that the best place to address
> > > this
> > > > > > is at a higher level rather than having this logic implemented at 
> > > > > > the
> > > > > > lowest level (kernels) -- I think database systems handle this 
> > > > > > during
> > > > > > logical->physical planning.
> > > > >
> > > > > It raises another question: where to incorporate the implicit
> > > promotions?
> > > > > // correct me if I'm wrong but these implicit promotions are operation
> > > > > // dependent and distinct from kernel dispatching issue [1]
> > > > >
> > > > > The numpy example above can be roughly translated to:
> > > > > >>> a = pa.array([-128, 127])
> > > > > >>> pa.compute.add(a, a)
> > > > > array([ 0, -2]
> > > > >
> > > > > Which is rather surprising from the user's perspective.
> > > >
> > > > Would it be enough to document the exact behavior  and advice the user
> > > > to place casts until we have logical -> phisycal machinery?
> > >
> > > I think it's enough to clearly document the behavior and assume that
> > > the "user" will act according to what semantics are desired for their
> > > use cases. Per my comments in my last e-mail I don't think the users
> > > of these functions need to be handled with "kid's gloves".
> > >
> > > > I'm updating my PR as discussed.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/ARROW-8919
> > > > > >
> > > > > > > Clickhouse for example does the type promotion.
> > > > > > >
> > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou 
> > > > > > > 
> > > wrote:
> > > > > > > >
> > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400
> > > > > > > > Ben Kietzman  wrote:
> > > > > > > > >
> > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193
> > > > > > > > >
> > > > > > > > > How should arithmetic kernels handle integer overflow?
> > > > > > > > >
> > > > > > > > > The approach currently taken in the linked PR is to promote
> > > such that
> > > > > > > > > overflow will not occur, for example `(int8, int8)->int16` and
> > > `(uint16,
> > > > > > > > > uint16)->uint32`.
> > > > > > > > >
> > > > > > > > > I'm not sure that's desirable. For one thing this leads to
> > > inconsistent
> > > > > > > > > handling of 64 bit integer types, which are currently allowed
> > > to overflow
> > > > > > > > > since we cannot promote further (NB: that means this kernel
> > > includes
> > > > > > > > > undefined behavior for int64).
> > > > > > > >
> > > > > > > > I agree with you.  I would strongly advise against implicit
> > > promotion
> > > > > > > > accross arithmetic operations.  We initially did that in Numba
> > > and it
> > > > > > > > quickly became a can of worms.
> > > > > > > >
> > > > > > > > The most desirable behaviour IMHO is to keep the original type,
> > > so:
> > > > > > > > - (int8, int8) -> int8
> > > > > > > > - (uint16, uint16) -> uint16
> > > > > > > >
> > > > > > > > Then the question is what happens when the actual overflow
> > > occurs.  I

[jira] [Created] (ARROW-9034) [C++] Implement binary (two bitmap) version of BitBlockCounter

2020-06-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9034:
---

 Summary: [C++] Implement binary (two bitmap) version of 
BitBlockCounter
 Key: ARROW-9034
 URL: https://issues.apache.org/jira/browse/ARROW-9034
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


The current BitBlockCounter from ARROW-9029 is useful for unary operations. 
Some operations involve multiple bitmaps and so it's useful to be able to 
determine the block popcounts of the AND of the respective words in the 
bitmaps. So each returned block would contain the number of bits that are set 
in both bitmaps at the same locations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9035) 8 vs 64 byte alignment

2020-06-04 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-9035:


 Summary: 8 vs 64 byte alignment
 Key: ARROW-9035
 URL: https://issues.apache.org/jira/browse/ARROW-9035
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Affects Versions: 0.17.0
Reporter: Anthony Abate


I used the C++ library to create a very small arrow file (1 field of 5 int32) 
and was surprised that the buffers are not aligned to 64 bytes as per the 
documentation section "Buffer Alignment and Padding" with examples.. based on 
the examples there, the 20 bytes of int32 should be padded to 64 bytes, but it 
is only 24 (see below) .   

extract message metadata - see BodyLength = 24
{code:java}
{
  version: "V4",
  header_type: "RecordBatch",
  header: {
nodes: [
  {
length: 5,
null_count: 0
  }
],
buffers: [
  {
offset: 0,
length: 0
  },
  {
offset: 0,
length: 20
  }
]
  },
  bodyLength: 24
} {code}
Reading further down the documentation section "Encapsulated message format" it 
says serialization should use 8 byte alignment. 

These both seem at odds with each other and some clarification is needed.

Is the documentation wrong? 

Or

Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
When creating large arrays, Arrow uses realloc quite intensively.

I have an example where y read a gzipped parquet column (strings) that
expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
cannot anticipate this and every reallocate call above 1MB (the most
critical ones) ends up being a copy.

I think that knowing that we like using realloc in Arrow, we could come up
with an allocator for large objects that would behave a lot better than
Jemalloc. For smaller objects, this allocator could just let the memory
request being handled by Jemalloc. Not trying to outsmart the brilliant
guys from Facebook and co ;-) But for larger objects, we could adopt a
custom strategy:
- if an allocation or a re-allocation larger than 1MB (or maybe even 512K)
is made on our memory pool, call mmap with size XGB (X being slightly
smaller than the total physical memory on the system). This is ok because
mmap will not physically allocate this memory as long as it is not touched.
- we keep track of all allocations that we created this way, by storing the
pointer + the actual used size inside our XGB alloc in a map.
- when growing an alloc mmaped this way we will always have contiguous
memory available, (otherwise we would already have OOMed because X is the
physical memory size).
- when reducing the alloc size we can free with madvice (optional: if the
alloc becomes small enough, we might copy it back into a Jemalloc
allocation).

I am not an expert of these matters, and I just learned what an allocator
really is, so my approach might be naive. In this case feel free ton
enlighten me!

Please note that I'm not sure about the level of portability of this
solution.

Have a nice day!

Remi


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou
On Thu, 4 Jun 2020 17:48:16 +0200
Rémi Dettai  wrote:
> When creating large arrays, Arrow uses realloc quite intensively.
> 
> I have an example where y read a gzipped parquet column (strings) that
> expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> cannot anticipate this and every reallocate call above 1MB (the most
> critical ones) ends up being a copy.

Ideally, we should be able to presize the array to a good enough
estimate. I don't know if Parquet gives us enough information for that,
though.

Regards

Antoine.




Re: [DISCUSS] Add kernel integer overflow handling

2020-06-04 Thread Francois Saint-Jacques
I documented [1] the behaviors by experimentation or by reading the
documentation. My experiments were mostly about checking INT64_MAX +
1. My preference would be to use the platform defined behavior by
default and provide a safety option that errors.

Feel  free to add more databases/systems.

François

[1] 
https://docs.google.com/spreadsheets/d/1DTFER2arYyNkjEHd1jCLdhQ90kRHbNrVI-UvHvQiP6Y/edit?usp=sharing

On Thu, Jun 4, 2020 at 9:22 AM Wes McKinney  wrote:
>
> On Thu, Jun 4, 2020 at 4:57 AM Krisztián Szűcs
>  wrote:
> >
> > On Thu, Jun 4, 2020 at 11:09 AM Rémi Dettai  wrote:
> > >
> > > It makes sense to me that the default behaviour of such a low level api as
> > > kernel does not do any automagic promotion, but shouldn't this kind of
> > > promotion still be requestable by the so called "system developer" user ?
> > > Otherwise he would need to materialize a promoted version of each original
> > > array before the kernel operation, wouldn't he ?
>
> This will all be handled by the expression execution system. Please, I
> beg your collective patience as I will be developing this in the near
> future.
>
> > I assume yes, for 1 million elements:
> >
> > Promotion inside kernel:
> > u32 + u32 = u64
> > 4MB + 4MB -> 8MB
> > New allocation: 8MB
> >
> > Promotion outside kernel:
> > (u32 -> u64) + (u32 -> u64) = u64
> > (4MB -> 8MB) + (4MB -> 8MB) -> 8MB
> > New allocation: 24MB
>
> This isn't what the execution engine will do, though. The processed
> chunks will be much smaller, like 64K elements or less, so at no point
> will such large temporary allocations exist.
>
> > >
> > > Le mer. 3 juin 2020 à 18:27, Wes McKinney  a écrit :
> > >
> > > > On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > From the user perspective I find the following pretty confusing:
> > > > > > > >
> > > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2
> > > > > > > > Out[1]: array([ 0, -2], dtype=int8)
> > > > > > > >
> > > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2
> > > > > > > > Out[2]: array([-256,  254], dtype=int16)
> > > > > > > >
> > > > > > > > In my opinion somewhere (on a higher level maybe) we should 
> > > > > > > > provide
> > > > > > > > the correct results promoted to a wider type implicitly.
> > > > > > >
> > > > > > > Yes, I agree with you, but I agree that the best place to address
> > > > this
> > > > > > > is at a higher level rather than having this logic implemented at 
> > > > > > > the
> > > > > > > lowest level (kernels) -- I think database systems handle this 
> > > > > > > during
> > > > > > > logical->physical planning.
> > > > > >
> > > > > > It raises another question: where to incorporate the implicit
> > > > promotions?
> > > > > > // correct me if I'm wrong but these implicit promotions are 
> > > > > > operation
> > > > > > // dependent and distinct from kernel dispatching issue [1]
> > > > > >
> > > > > > The numpy example above can be roughly translated to:
> > > > > > >>> a = pa.array([-128, 127])
> > > > > > >>> pa.compute.add(a, a)
> > > > > > array([ 0, -2]
> > > > > >
> > > > > > Which is rather surprising from the user's perspective.
> > > > >
> > > > > Would it be enough to document the exact behavior  and advice the user
> > > > > to place casts until we have logical -> phisycal machinery?
> > > >
> > > > I think it's enough to clearly document the behavior and assume that
> > > > the "user" will act according to what semantics are desired for their
> > > > use cases. Per my comments in my last e-mail I don't think the users
> > > > of these functions need to be handled with "kid's gloves".
> > > >
> > > > > I'm updating my PR as discussed.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/ARROW-8919
> > > > > > >
> > > > > > > > Clickhouse for example does the type promotion.
> > > > > > > >
> > > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou 
> > > > > > > > 
> > > > wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400
> > > > > > > > > Ben Kietzman  wrote:
> > > > > > > > > >
> > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193
> > > > > > > > > >
> > > > > > > > > > How should arithmetic kernels handle integer overflow?
> > > > > > > > > >
> > > > > > > > > > The approach currently taken in the linked PR is to promote
> > > > such that
> > > > > > > > > > overflow will not occur, for example `(int8, int8)->int16` 
> > > > > > > > > > and
> > > > `(uint16,
> > > > > > > > > > uint16)->uint32`.
> > > > > > > > > >
> > > > > > > > > > I'm not sure that's desirable. For one thing this leads to
> > > > inconsistent
> > > > > > > > > > handling of 64 bit integer types, which are cu

Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Rémi Dettai
 > Ideally, we should be able to presize the array to a good enough
estimate.
You should be able to get away with a correct estimation because parquet
column metadata contains the uncompressed size. But is their anything wrong
with this idea of mmaping huge "runways" for our larger allocations ?


Le jeu. 4 juin 2020 à 17:58, Antoine Pitrou  a écrit :

> On Thu, 4 Jun 2020 17:48:16 +0200
> Rémi Dettai  wrote:
> > When creating large arrays, Arrow uses realloc quite intensively.
> >
> > I have an example where y read a gzipped parquet column (strings) that
> > expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> > cannot anticipate this and every reallocate call above 1MB (the most
> > critical ones) ends up being a copy.
>
> Ideally, we should be able to presize the array to a good enough
> estimate. I don't know if Parquet gives us enough information for that,
> though.
>
> Regards
>
> Antoine.
>
>
>


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-04 Thread Antoine Pitrou


Le 04/06/2020 à 18:11, Rémi Dettai a écrit :
>  > Ideally, we should be able to presize the array to a good enough
> estimate.
> You should be able to get away with a correct estimation because parquet
> column metadata contains the uncompressed size. But is their anything wrong
> with this idea of mmaping huge "runways" for our larger allocations ?

I don't know, but I suspect that if common allocators aren't doing it,
there must be some downsides.

I think trying to avoid most (re)allocations should be our preferred
course of action here.

Regards

Antoine.


[jira] [Created] (ARROW-9036) Null pointer exception when caching data frames)

2020-06-04 Thread Gaurangi Saxena (Jira)
Gaurangi Saxena created ARROW-9036:
--

 Summary: Null pointer exception when caching data frames)
 Key: ARROW-9036
 URL: https://issues.apache.org/jira/browse/ARROW-9036
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.16.0
Reporter: Gaurangi Saxena


I get a NPE when I try to cache a DataFrame in spark with Arrow as read format.

 

Stack Trace - 

java.lang.NullPointerExceptionjava.lang.NullPointerException at 
org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:61) at 
org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:649)
 at 
org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
 at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:106)
 at 
com.google.cloud.spark.bigquery.ArrowBinaryIterator$ArrowReaderIterator.hasNext(ArrowBinaryIterator.scala:84)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9037) [C++/C-ABI] unable to import array with null count == -1 (which could be exported)

2020-06-04 Thread Zhuo Peng (Jira)
Zhuo Peng created ARROW-9037:


 Summary: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported)
 Key: ARROW-9037
 URL: https://issues.apache.org/jira/browse/ARROW-9037
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.17.1
Reporter: Zhuo Peng


If an Array is created with null_count == -1 but without any null (and thus no 
null bitmap buffer), then ArrayData.null_count will remain -1 when exporting if 
null_count is never computed. The exported C struct also has null_count == -1 
[1]. But when importing, if null_count != 0, an error [2] will be raised.

[1] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560

[2] 
https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9038) [C++] Improve BitBlockCounter

2020-06-04 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-9038:
---

 Summary: [C++] Improve BitBlockCounter
 Key: ARROW-9038
 URL: https://issues.apache.org/jira/browse/ARROW-9038
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai


ARROW-9029 implements BitBlockCounter. There are chances to improve pops 
counting performance per this review comment: 
https://github.com/apache/arrow/pull/7346#discussion_r435005226



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9039) py_bytes created by pyarrow 0.11.1 cannot be deserialized by more recent versions

2020-06-04 Thread Yoav Git (Jira)
Yoav Git created ARROW-9039:
---

 Summary: py_bytes created by pyarrow 0.11.1 cannot be deserialized 
by more recent versions
 Key: ARROW-9039
 URL: https://issues.apache.org/jira/browse/ARROW-9039
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1, 0.11.1
 Environment: python, windows
Reporter: Yoav Git


I have been saving dataframes into mongodb using:

{{import pandas as pd; import pyarrow as pa}}
{{df = pd.DataFrame([[1,2,3],[2,3,4]], columns = ['x','y','z'])}}

{{byte = pa.serialize(df).to_buffer().to_pybytes()}}

and then reading back using:

{{df = pa.deserialize(pa.py_buffer(memoryview(byte)))}}

However, pyarrow is not back-compatible. i.e. both versions 0.11.1 and 0.15.1 
can read their own pybytes created by it. Alas, they cannot read each other. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Unsubcribe

2020-06-04 Thread Zhuo Jia Dai
-- 
ZJ

zhuojia@gmail.com


[jira] [Created] (ARROW-9040) [Python][Parquet]"_ParquetDatasetV2" fail to read with columns and use_pandas_metadata=True

2020-06-04 Thread cmsxbc (Jira)
cmsxbc created ARROW-9040:
-

 Summary: [Python][Parquet]"_ParquetDatasetV2" fail to read with 
columns and use_pandas_metadata=True
 Key: ARROW-9040
 URL: https://issues.apache.org/jira/browse/ARROW-9040
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: cmsxbc


When call _ParquetDatasetV2.read(columns=['column'], use_pandas_metadata=True),

"TypeError: unhashable type 'dict'"  will be raised from 
{code:java}
index_columns = set(_get_pandas_index_columns(metadata))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9041) overloaded virtual function "arrow::io::Writable::Write" is only partially overridden in class

2020-06-04 Thread Karthikeyan Natarajan (Jira)
Karthikeyan Natarajan created ARROW-9041:


 Summary: overloaded virtual function "arrow::io::Writable::Write" 
is only partially overridden in class 
 Key: ARROW-9041
 URL: https://issues.apache.org/jira/browse/ARROW-9041
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.0
Reporter: Karthikeyan Natarajan


Following warnings appear 

cpp/build/arrow/install/include/arrow/io/file.h(189): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::MemoryMappedFile"

cpp/build/arrow/install/include/arrow/io/memory.h(98): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::MockOutputStream"

cpp/build/arrow/install/include/arrow/io/memory.h(116): warning: overloaded 
virtual function "arrow::io::Writable::Write" is only partially overridden in 
class "arrow::io::FixedSizeBufferWriter"

Suggestion solution is to use `using Writable::Write` in protected/private.

[https://isocpp.org/wiki/faq/strange-inheritance#hiding-rule]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)