[jira] [Created] (ARROW-5909) [Java] Optimize ByteFunctionHelpers equals & compare logic

2019-07-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5909:
-

 Summary: [Java] Optimize ByteFunctionHelpers equals & compare logic
 Key: ARROW-5909
 URL: https://issues.apache.org/jira/browse/ARROW-5909
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now it first compare Long values and then if length < 8 then it compares Byte 
values.

Add the logic to compare Int values when 4 < length < 8.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-10 Thread Fan Liya
Hi all,


We are thinking of providing varchar/varbinary vectors with a different
memory layout which exists in a wide range of systems. The memory layout is
different from that of VarCharVector in the following ways:


   1.

   Instead of storing (start offset, end offset), the new layout stores
   (start offset, length)
   2.

   The content of varchars may not be in a consecutive memory region.
   Instead, it can be in arbitrary memory address.


Due to these differences in memory layout, it incurs performance overhead
when converting data between existing systems and VarCharVectors.

The above difference 1 seems insignificant, while difference 2 is difficult
to overcome. However, the scenario of difference 2 is prevalent in
practice: for example we store strings in a series of memory segments.
Whenever a segment is full, we request a new one. However, these memory
segments may not be consecutive, because other processes/threads are also
requesting/releasing memory segments in the meantime.

So we are wondering if it is possible to support such memory layout in
Arrow. I think there are more systems that are trying to adopting Arrow,
but are hindered by such difficulty.

Would you please give your valuable feedback?


Best,

Liya Fan


Support an alternative memory layout for varchar/varbinary vectors

2019-07-10 Thread Fan Liya
Hi all,


We are thinking of providing varchar/varbinary vectors with a different
memory layout which exists in a wide range of systems. The memory layout is
different from that of VarCharVector in the following ways:


   1.

   Instead of storing (start offset, end offset), the new layout stores
   (start offset, length)
   2.

   The content of varchars may not be in a consecutive memory region.
   Instead, it can be in arbitrary memory address.


Due to these differences in memory layout, it incurs performance overhead
when converting data between existing systems and VarCharVectors.

The above difference 1 seems insignificant, while difference 2 is difficult
to overcome. However, the scenario of difference 2 is prevalent in
practice: for example we store strings in a series of memory segments.
Whenever a segment is full, we request a new one. However, these memory
segments may not be consecutive, because other processes/threads are also
requesting/releasing memory segments in the meantime.

So we are wondering if it is possible to support such memory layout in
Arrow. I think there are more systems that are trying to adopting Arrow,
but are hindered by such difficulty.

Would you please give your valuable feedback?


Best,

Liya Fan


Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-10 Thread Jacques Nadeau
I was also supportive of this pattern. We definitely have used it before to
optimize in certain cases.

On Wed, Jul 10, 2019, 2:40 PM Wes McKinney  wrote:

> On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman 
> wrote:
> >
> > In this scenario option A (include child arrays for each child type, even
> > if that type is not observed) seems like the clearly correct choice to
> me.
> > It yiedls a more intuitive layout for the union array and incurs no
> runtime
> > overhead (since the absent children are empty/null arrays).
>
> I am not sure this is right. The child arrays still occupy memory in
> the Sparse Union case (where all child arrays have the same length).
> In order to satisfy the requirements of the IPC protocol, the child
> arrays need to be of the same type as the types in the union. In the
> Dense Union case, the not-present children will have length 0.
>
> >
> > > why not allow them to be flexible in this regard?
> >
> > I would say that if code doesn't add anything except cognitive overhead
> > then it's worthwhile to remove it.
>
> The cognitive overhead comes for the Arrow library implementer --
> users of the libraries aren't required to deal with this detail
> necessarily. The type ids are optional, after all. Even if it is
> removed, you still have ids, so whether it's
>
> type 0, id=0
> type 1, id=1
> type 2, id=2
>
> or
>
> type 0, id=3
> type 1, id=7
> type 2, id=10
>
> the difference is in the second case, you have to look up the code
> corresponding to each type rather than assuming that the type's
> position and its code are the same.
>
> In processing, branching should occur at the Type level, so a function
> to process a child looks like
>
> ProcessChild(child, child_id, ...)
>
> In either case you have to match a child with its id that appears in the
> data.
>
> Anyway, since Julien and I are responsible for introducing this
> concept in the early stages of the project I'm interested to hear more
> from others. Note that this doesn't serve to resolve the
> Union-of-Nested-Types problem that has prevented the development of
> integration tests between Java and C++.
>
> >
> > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney 
> wrote:
> >
> > > hi Ben,
> > >
> > > Some applications use static type ids for various data types. Let's
> > > consider one possibility:
> > >
> > > BOOLEAN: 0
> > > INT32: 1
> > > DOUBLE: 2
> > > STRING (UTF8): 3
> > >
> > > If you were parsing JSON and constructing unions while parsing, you
> > > might encounter some types, but not all. So if we _don't_ have the
> > > option of having type ids in the metadata then we are left with some
> > > unsatisfactory options:
> > >
> > > A: Include all types in the resulting union, even if they are
> unobserved,
> > > or
> > > B: Assign type id dynamically to types when they are observed
> > >
> > > Option B: is potentially bad because it does not parallelize across
> > > threads or nodes.
> > >
> > > So I do think the feature is useful. It does make the implementations
> > > of unions more complex, though, so it does not come without cost. But
> > > unions are already the most complex tool we have in our nested data
> > > toolbox, so why not allow them to be flexible in this regard?
> > >
> > > In any case I'm -0 on making changes, but would be interested in
> > > feedback of others if there is strong sentiment about deprecating the
> > > feature.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman  >
> > > wrote:
> > > >
> > > > The Union.typeIds property is confusing and its utility is unclear.
> I'd
> > > > like to remove it (or at least document it better). Unless anyone
> knows a
> > > > real advantage for keeping it I plan to assemble a PR to drop it
> from the
> > > > format and the C++ implementation.
> > > >
> > > > ARROW-257 ( resolved by pull request
> > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > optional
> > > > typeIds property (in the C++ implementation, this is
> > > > UnionType::type_codes). Prior to that pull request each element
> (int8) in
> > > > the type_ids (second) buffer of a union array was the index of a
> child
> > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> union
> > > > array began with a value from child_data[5]. After that change to
> > > interpret
> > > > a type_id of 5 one must look through the typeIds property and the
> index
> > > at
> > > > which a 5 is found is the index of the corresponding child array.
> > > >
> > > > The change was made to allow unused child arrays to be dropped; for
> > > example
> > > > if a union type were predefined with 64 members then an array of
> nearly
> > > > identical type containing only int32 and utf8 values would only be
> > > required
> > > > to have two child arrays. Note: the union types are not exactly
> identical
> > > > even though they contain identical members; their typeIds properties
> will
> > > > differ.
> > > >
> > > > However unused child arrays can be 

[jira] [Created] (ARROW-5907) base64 support of bytes-like

2019-07-10 Thread Litchy (JIRA)
Litchy created ARROW-5907:
-

 Summary: base64 support of bytes-like
 Key: ARROW-5907
 URL: https://issues.apache.org/jira/browse/ARROW-5907
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 0.14.0
Reporter: Litchy
 Fix For: 0.14.0


Currently pyarrow could not be encoded by base64

Because it is not bytes-like

A possible scenario could be if we want to push data(like ndarray) to Redis in 
Python and get it from other language like Java. Arrow could be used to 
interact between Python and Java using Array of Arrow.

Adding this feature would support some in-queue and out-queue operations like 
Redis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-10 Thread Joris Van den Bossche
I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
communication, as we are fixing regressions of the 0.14.0 release.

(but I haven't been involved much in releases, so certainly no strong
opinion)

Joris


Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :

> hi folks,
>
> Are there any opinions / strong feelings about the two options:
>
> * Prepare patch 0.14.1 release from a maintenance branch
> * Release 0.15.0 out of master
>
> Aside from the Parquet forward compatibility issues we're still
> discussing, and Eric's C# patch PR 4836, are there any other issues
> that need to be fixed before we go down one of these paths?
>
> Would anyone like to help with release management? I can do so if
> necessary, but I've already done a lot of release management :)
>
> - Wes
>
> On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney  wrote:
> >
> > Hi Eric -- of course!
> >
> > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt 
> > 
> wrote:
> >>
> >> Can we propose getting changes other than Python or Parquet related
> into this release?
> >>
> >> For example, I found a critical issue in the C# implementation that, if
> possible, I'd like to get included in a patch release.
> https://github.com/apache/arrow/pull/4836
> >>
> >> Eric
> >>
> >> -Original Message-
> >> From: Wes McKinney 
> >> Sent: Tuesday, July 9, 2019 7:59 AM
> >> To: dev@arrow.apache.org
> >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package
> problems, Parquet forward compatibility problems
> >>
> >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > > If the problems can be resolved quickly, I should think we could cut
> >> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> >> > > from a maintenance branch or out of master -- any thoughts about
> >> > > this (cutting from master is definitely easier)?
> >> >
> >> > How about just releasing 0.15.0 from master?
> >> > It'll be simpler than creating a patch release.
> >> >
> >>
> >> I'd be fine with that, too.
> >>
> >> >
> >> > Thanks,
> >> > --
> >> > kou
> >> >
> >> > In  nmvwuy8wxxddcctobuuamy4ee...@mail.gmail.com>
> >> >   "[DISCUSS] Need for 0.14.1 release due to Python package problems,
> Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
> >> >   Wes McKinney  wrote:
> >> >
> >> > > hi folks,
> >> > >
> >> > > Perhaps unsurprisingly due to the expansion of our Python packages,
> >> > > a number of things are broken in 0.14.0 that we should fix sooner
> >> > > than the next major release. I'll try to send a complete list to
> >> > > this thread to give a status within a day or two. Other problems may
> >> > > arise in the next 48 hours as more people install the package.
> >> > >
> >> > > If the problems can be resolved quickly, I should think we could cut
> >> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> >> > > from a maintenance branch or out of master -- any thoughts about
> >> > > this (cutting from master is definitely easier)?
> >> > >
> >> > > Would someone (who is not Kou) be able to assist with creating the
> RC?
> >> > >
> >> > > Thanks,
> >> > > Wes
>


Re: [DRAFT] Apache Arrow ASF Board Report July 2019

2019-07-10 Thread Jacques Nadeau
Looks good to me. Thanks for pulling together.

On Wed, Jul 10, 2019 at 2:49 PM Wes McKinney  wrote:

> any comments about this? The report is due
>
> On Sun, Jul 7, 2019 at 6:02 PM Wes McKinney  wrote:
> >
> > ## Description:
> >
> > Apache Arrow is a cross-language development platform for in-memory
> > data. It specifies a standardized language-independent columnar memory
> > format for flat and hierarchical data, organized for efficient
> > analytic operations on modern hardware. It also provides computational
> > libraries and zero-copy streaming messaging and interprocess
> > communication. Languages currently supported include C, C++, C#, Go,
> > Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
> >
> > ## Issues:
> > - There are no issues requiring board attention at this time
> >
> > ## Activity:
> > - The community is discussing a 1.0.0 release featuring
> >   forward-looking binary format stability guarantees. Given the
> >   nature of the project, this is obviously an important milestone
> >   for adoption and user support
> > - Since the last report, a new Buildbot-based CI system has been
> >   connected to apache/arrow to provide additional build capacity, with
> >   a bot system called "ursabot" to provide on demand builds, benchmark
> >   comparisons, and other tools to assist the developer community
> >
> > ## Health report:
> > - We have been having significant problems with CI build times and are
> >   discussing strategies to decouple our de velopment process from the
> >   shared pool of ASF-managed cloud CI resources like Travis CI and
> >   Appveyor
> > - The community is healthy, though there were some concerns
> >   around the 0.14.0 release vote and we are discussing
> >   conventions around handling issues raised during release
> >   candidate vetting.
> >
> > ## PMC changes:
> >
> >  - Currently 26 PMC members.
> >  - No new PMC members added in the last 3 months
> >  - Last PMC addition was Andrew Grove on Sun Feb 03 2019
> >
> > ## Committer base changes:
> >
> >  - Currently 43 committers.
> >  - New commmitters:
> > - Francois Saint-Jacques was added as a committer on Wed Jun 12 2019
> > - Neville Dipale was added as a committer on Mon May 13 2019
> > - Praveen Kumar has also been invited to be a committer an accepted,
> >   but has not been added to the roster in whimsy yet
> >
> > ## Releases:
> >
> >  - 0.14.0 was released on Wed Jul 03 2019
> >
> > ## JIRA activity:
> >
> >  - 735 JIRA tickets created in the last 3 months
> >  - 690 JIRA tickets closed/resolved in the last 3 months
>


[jira] [Created] (ARROW-5906) [CI] Set -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF in builds running in Travis CI, maybe all docker-compose builds by default

2019-07-10 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5906:
---

 Summary: [CI] Set -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF in builds 
running in Travis CI, maybe all docker-compose builds by default
 Key: ARROW-5906
 URL: https://issues.apache.org/jira/browse/ARROW-5906
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This setting should be disabled in general unless we are trying to debug 
something. It makes logs much more verbose



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5905) [Python] support conversion to decimal type from floats?

2019-07-10 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5905:


 Summary: [Python] support conversion to decimal type from floats?
 Key: ARROW-5905
 URL: https://issues.apache.org/jira/browse/ARROW-5905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


We currently allow constructing a decimal array from decimal.Decimal objects or 
from ints:

{code}
In [14]: pa.array([1, 0], type=pa.decimal128(2))
  
Out[14]: 

[
  1,
  0
]

In [31]: pa.array([decimal.Decimal('0.1'), decimal.Decimal('0.2')], 
pa.decimal128(2, 1))
  
Out[31]: 

[
  0.1,
  0.2
]
{code}

but not from floats (or strings):

{code}
In [18]: pa.array([0.1, 0.2], pa.decimal128(2)) 
  
...
ArrowTypeError: int or Decimal object expected, got float
{code}

Is this something we would like to support?

There are for sure precision issues you run into, but if the decimal type is 
fully specified, it seems clear what the user wants. In general, since decimal 
objects in pandas are not that easy to work with, many people might have plain 
float columns that they want to convert to decimal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DRAFT] Apache Arrow ASF Board Report July 2019

2019-07-10 Thread Wes McKinney
any comments about this? The report is due

On Sun, Jul 7, 2019 at 6:02 PM Wes McKinney  wrote:
>
> ## Description:
>
> Apache Arrow is a cross-language development platform for in-memory
> data. It specifies a standardized language-independent columnar memory
> format for flat and hierarchical data, organized for efficient
> analytic operations on modern hardware. It also provides computational
> libraries and zero-copy streaming messaging and interprocess
> communication. Languages currently supported include C, C++, C#, Go,
> Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
>
> ## Issues:
> - There are no issues requiring board attention at this time
>
> ## Activity:
> - The community is discussing a 1.0.0 release featuring
>   forward-looking binary format stability guarantees. Given the
>   nature of the project, this is obviously an important milestone
>   for adoption and user support
> - Since the last report, a new Buildbot-based CI system has been
>   connected to apache/arrow to provide additional build capacity, with
>   a bot system called "ursabot" to provide on demand builds, benchmark
>   comparisons, and other tools to assist the developer community
>
> ## Health report:
> - We have been having significant problems with CI build times and are
>   discussing strategies to decouple our de velopment process from the
>   shared pool of ASF-managed cloud CI resources like Travis CI and
>   Appveyor
> - The community is healthy, though there were some concerns
>   around the 0.14.0 release vote and we are discussing
>   conventions around handling issues raised during release
>   candidate vetting.
>
> ## PMC changes:
>
>  - Currently 26 PMC members.
>  - No new PMC members added in the last 3 months
>  - Last PMC addition was Andrew Grove on Sun Feb 03 2019
>
> ## Committer base changes:
>
>  - Currently 43 committers.
>  - New commmitters:
> - Francois Saint-Jacques was added as a committer on Wed Jun 12 2019
> - Neville Dipale was added as a committer on Mon May 13 2019
> - Praveen Kumar has also been invited to be a committer an accepted,
>   but has not been added to the roster in whimsy yet
>
> ## Releases:
>
>  - 0.14.0 was released on Wed Jul 03 2019
>
> ## JIRA activity:
>
>  - 735 JIRA tickets created in the last 3 months
>  - 690 JIRA tickets closed/resolved in the last 3 months


Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-10 Thread Wes McKinney
On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman  wrote:
>
> In this scenario option A (include child arrays for each child type, even
> if that type is not observed) seems like the clearly correct choice to me.
> It yiedls a more intuitive layout for the union array and incurs no runtime
> overhead (since the absent children are empty/null arrays).

I am not sure this is right. The child arrays still occupy memory in
the Sparse Union case (where all child arrays have the same length).
In order to satisfy the requirements of the IPC protocol, the child
arrays need to be of the same type as the types in the union. In the
Dense Union case, the not-present children will have length 0.

>
> > why not allow them to be flexible in this regard?
>
> I would say that if code doesn't add anything except cognitive overhead
> then it's worthwhile to remove it.

The cognitive overhead comes for the Arrow library implementer --
users of the libraries aren't required to deal with this detail
necessarily. The type ids are optional, after all. Even if it is
removed, you still have ids, so whether it's

type 0, id=0
type 1, id=1
type 2, id=2

or

type 0, id=3
type 1, id=7
type 2, id=10

the difference is in the second case, you have to look up the code
corresponding to each type rather than assuming that the type's
position and its code are the same.

In processing, branching should occur at the Type level, so a function
to process a child looks like

ProcessChild(child, child_id, ...)

In either case you have to match a child with its id that appears in the data.

Anyway, since Julien and I are responsible for introducing this
concept in the early stages of the project I'm interested to hear more
from others. Note that this doesn't serve to resolve the
Union-of-Nested-Types problem that has prevented the development of
integration tests between Java and C++.

>
> On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney  wrote:
>
> > hi Ben,
> >
> > Some applications use static type ids for various data types. Let's
> > consider one possibility:
> >
> > BOOLEAN: 0
> > INT32: 1
> > DOUBLE: 2
> > STRING (UTF8): 3
> >
> > If you were parsing JSON and constructing unions while parsing, you
> > might encounter some types, but not all. So if we _don't_ have the
> > option of having type ids in the metadata then we are left with some
> > unsatisfactory options:
> >
> > A: Include all types in the resulting union, even if they are unobserved,
> > or
> > B: Assign type id dynamically to types when they are observed
> >
> > Option B: is potentially bad because it does not parallelize across
> > threads or nodes.
> >
> > So I do think the feature is useful. It does make the implementations
> > of unions more complex, though, so it does not come without cost. But
> > unions are already the most complex tool we have in our nested data
> > toolbox, so why not allow them to be flexible in this regard?
> >
> > In any case I'm -0 on making changes, but would be interested in
> > feedback of others if there is strong sentiment about deprecating the
> > feature.
> >
> > - Wes
> >
> > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman 
> > wrote:
> > >
> > > The Union.typeIds property is confusing and its utility is unclear. I'd
> > > like to remove it (or at least document it better). Unless anyone knows a
> > > real advantage for keeping it I plan to assemble a PR to drop it from the
> > > format and the C++ implementation.
> > >
> > > ARROW-257 ( resolved by pull request
> > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > optional
> > > typeIds property (in the C++ implementation, this is
> > > UnionType::type_codes). Prior to that pull request each element (int8) in
> > > the type_ids (second) buffer of a union array was the index of a child
> > > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > > array began with a value from child_data[5]. After that change to
> > interpret
> > > a type_id of 5 one must look through the typeIds property and the index
> > at
> > > which a 5 is found is the index of the corresponding child array.
> > >
> > > The change was made to allow unused child arrays to be dropped; for
> > example
> > > if a union type were predefined with 64 members then an array of nearly
> > > identical type containing only int32 and utf8 values would only be
> > required
> > > to have two child arrays. Note: the union types are not exactly identical
> > > even though they contain identical members; their typeIds properties will
> > > differ.
> > >
> > > However unused child arrays can be replaced by null arrays (which are
> > > almost equally lightweight as they require no heap allocation). I'm also
> > > unaware of a use case for predefined type_ids; if they are application
> > > specific then I think it's out of scope for arrow to maintain a
> > child_index
> > > <-> type_id mapping. It seems that the optimization has questionable
> > merit
> > > and does not warrant the added complexity.
> >


Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-10 Thread Ben Kietzman
In this scenario option A (include child arrays for each child type, even
if that type is not observed) seems like the clearly correct choice to me.
It yiedls a more intuitive layout for the union array and incurs no runtime
overhead (since the absent children are empty/null arrays).

> why not allow them to be flexible in this regard?

I would say that if code doesn't add anything except cognitive overhead
then it's worthwhile to remove it.

On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney  wrote:

> hi Ben,
>
> Some applications use static type ids for various data types. Let's
> consider one possibility:
>
> BOOLEAN: 0
> INT32: 1
> DOUBLE: 2
> STRING (UTF8): 3
>
> If you were parsing JSON and constructing unions while parsing, you
> might encounter some types, but not all. So if we _don't_ have the
> option of having type ids in the metadata then we are left with some
> unsatisfactory options:
>
> A: Include all types in the resulting union, even if they are unobserved,
> or
> B: Assign type id dynamically to types when they are observed
>
> Option B: is potentially bad because it does not parallelize across
> threads or nodes.
>
> So I do think the feature is useful. It does make the implementations
> of unions more complex, though, so it does not come without cost. But
> unions are already the most complex tool we have in our nested data
> toolbox, so why not allow them to be flexible in this regard?
>
> In any case I'm -0 on making changes, but would be interested in
> feedback of others if there is strong sentiment about deprecating the
> feature.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman 
> wrote:
> >
> > The Union.typeIds property is confusing and its utility is unclear. I'd
> > like to remove it (or at least document it better). Unless anyone knows a
> > real advantage for keeping it I plan to assemble a PR to drop it from the
> > format and the C++ implementation.
> >
> > ARROW-257 ( resolved by pull request
> > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> optional
> > typeIds property (in the C++ implementation, this is
> > UnionType::type_codes). Prior to that pull request each element (int8) in
> > the type_ids (second) buffer of a union array was the index of a child
> > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > array began with a value from child_data[5]. After that change to
> interpret
> > a type_id of 5 one must look through the typeIds property and the index
> at
> > which a 5 is found is the index of the corresponding child array.
> >
> > The change was made to allow unused child arrays to be dropped; for
> example
> > if a union type were predefined with 64 members then an array of nearly
> > identical type containing only int32 and utf8 values would only be
> required
> > to have two child arrays. Note: the union types are not exactly identical
> > even though they contain identical members; their typeIds properties will
> > differ.
> >
> > However unused child arrays can be replaced by null arrays (which are
> > almost equally lightweight as they require no heap allocation). I'm also
> > unaware of a use case for predefined type_ids; if they are application
> > specific then I think it's out of scope for arrow to maintain a
> child_index
> > <-> type_id mapping. It seems that the optimization has questionable
> merit
> > and does not warrant the added complexity.
>


[jira] [Created] (ARROW-5904) [Java] [Plasma] Fix compilation of Plasma Java client

2019-07-10 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-5904:
-

 Summary: [Java] [Plasma] Fix compilation of Plasma Java client
 Key: ARROW-5904
 URL: https://issues.apache.org/jira/browse/ARROW-5904
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


This is broken since the introduction of user-defined Status messages:
{code:java}
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:
 In function '_jobject* 
Java_org_apache_arrow_plasma_PlasmaClientJNI_create(JNIEnv*, jclass, jlong, 
jbyteArray, jint, jbyteArray)':
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:114:9:
 error: 'class arrow::Status' has no member named 'IsPlasmaObjectExists'
   if (s.IsPlasmaObjectExists()) {
 ^
external/plasma/cpp/src/plasma/lib/java/org_apache_arrow_plasma_PlasmaClientJNI.cc:120:9:
 error: 'class arrow::Status' has no member named 'IsPlasmaStoreFull'
   if (s.IsPlasmaStoreFull()) {
 ^{code}
[~guoyuhong85] Can you add this codepath to the test so we can catch this kind 
of breakage more quickly in the future?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow biweekly sync call today at 12pm US/Eastern / 16:00 UTC

2019-07-10 Thread Neal Richardson
Attendees:
Hatem Helal
Uwe Korn
Micah Kornfield
Wes McKinney
Prudhvi Porandla
Neal Richardson
Krisztián Szűcs

Topics discussed:

Issues with 0.14:
* Python manylinux2010 wheels broken, runtime dependency on lz4: fixed in
master, bad wheels removed
* Python macOS wheels have runtime dependency on homebrew openssl: also
fixed in master
* Parquet (C++ library): new way to annotate logical types was recently
added; for forward compatibility, both old and new metadata is set. Affects
interpretation of timestamps; arrow handled UTC normalization but Parquet
format didn't. But writing timestamps in new format and read by old reader
comes in as integers. See https://issues.apache.org/jira/browse/ARROW-5878,
https://issues.apache.org/jira/browse/ARROW-5888,
https://issues.apache.org/jira/browse/ARROW-5889
* C# issue: https://issues.apache.org/jira/browse/ARROW-5887
* Question of whether to release a bugfix only 0.14.1 release or release a
0.15.0 from master: raised on the mailing list today

Removing arrow::Column from C++
Flatbuffers alignment: discuss addressing in 1.0.0
Getting lots of Java PRs: need to get more Java reviewer bandwidth to keep
up

On Wed, Jul 10, 2019 at 7:07 AM Wes McKinney  wrote:

> All are welcome at
>
> https://meet.google.com/vtm-teks-phx
>


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-10 Thread Wes McKinney
hi folks,

Are there any opinions / strong feelings about the two options:

* Prepare patch 0.14.1 release from a maintenance branch
* Release 0.15.0 out of master

Aside from the Parquet forward compatibility issues we're still
discussing, and Eric's C# patch PR 4836, are there any other issues
that need to be fixed before we go down one of these paths?

Would anyone like to help with release management? I can do so if
necessary, but I've already done a lot of release management :)

- Wes

On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney  wrote:
>
> Hi Eric -- of course!
>
> On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt 
>  wrote:
>>
>> Can we propose getting changes other than Python or Parquet related into 
>> this release?
>>
>> For example, I found a critical issue in the C# implementation that, if 
>> possible, I'd like to get included in a patch release.  
>> https://github.com/apache/arrow/pull/4836
>>
>> Eric
>>
>> -Original Message-
>> From: Wes McKinney 
>> Sent: Tuesday, July 9, 2019 7:59 AM
>> To: dev@arrow.apache.org
>> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package 
>> problems, Parquet forward compatibility problems
>>
>> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei  wrote:
>> >
>> > Hi,
>> >
>> > > If the problems can be resolved quickly, I should think we could cut
>> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
>> > > from a maintenance branch or out of master -- any thoughts about
>> > > this (cutting from master is definitely easier)?
>> >
>> > How about just releasing 0.15.0 from master?
>> > It'll be simpler than creating a patch release.
>> >
>>
>> I'd be fine with that, too.
>>
>> >
>> > Thanks,
>> > --
>> > kou
>> >
>> > In 
>> >   "[DISCUSS] Need for 0.14.1 release due to Python package problems, 
>> > Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
>> >   Wes McKinney  wrote:
>> >
>> > > hi folks,
>> > >
>> > > Perhaps unsurprisingly due to the expansion of our Python packages,
>> > > a number of things are broken in 0.14.0 that we should fix sooner
>> > > than the next major release. I'll try to send a complete list to
>> > > this thread to give a status within a day or two. Other problems may
>> > > arise in the next 48 hours as more people install the package.
>> > >
>> > > If the problems can be resolved quickly, I should think we could cut
>> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
>> > > from a maintenance branch or out of master -- any thoughts about
>> > > this (cutting from master is definitely easier)?
>> > >
>> > > Would someone (who is not Kou) be able to assist with creating the RC?
>> > >
>> > > Thanks,
>> > > Wes


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-10 Thread Wes McKinney
I did my best to remove the class from the GLib bindings -- there are
probably some conventions around API versions that I did not respect,
so I will need some help from others on GLib and Ruby.

MATLAB and R are also affected, but should be relatively simple to change.

I'll wait to hear more feedback from others before investing more of
my time in the project

- Wes

On Tue, Jul 9, 2019 at 8:18 PM Wes McKinney  wrote:
>
> Thanks for the feedback.
>
> I just posted a PR that removes the class in the C++ and Python
> libraries, hopefully this will help with the discussion. That I was
> able to do it in less than a day should be good evidence that the
> abstraction may be superfluous
>
> https://github.com/apache/arrow/pull/4841
>
> On Tue, Jul 9, 2019 at 4:26 PM Tim Swast  wrote:
> >
> > FWIW, I found the Column class to be confusing in Python. It felt redundant
> > / unneeded to actually create Tables.
> >
> > On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney  wrote:
> >
> > > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou  wrote:
> > > >
> > > >
> > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > > > >
> > > > > I'm concerned about continuing to maintain the Column class as it's
> > > > > spilling complexity into computational libraries and bindings alike.
> > > > >
> > > > > The Python Column class for example mostly forwards method calls to
> > > > > the underlying ChunkedArray
> > > > >
> > > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > > >
> > > > > If the developer wants to construct a Table or insert a new "column",
> > > > > Column objects must generally be constructed, leading to boilerplate
> > > > > without clear benefit.
> > > >
> > > > We could simply add the desired ChunkedArray-based convenience methods
> > > > without removing the Column-based APIs.
> > > >
> > > > I don't know if it's really cumbersome to maintain the Column class.
> > > > It's generally a very stable part of the API, and the Column class is
> > > > just a thin wrapper over a ChunkedArray + a field.
> > > >
> > >
> > > The indirection that it produces in public APIs I have found to be a
> > > nuisance, though (for example, doing things with the result of
> > > table[i] in Python).
> > >
> > > I'm about halfway through a patch to remove it, I'll let people review
> > > the work to assess the before-and-after.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >


Re: [Discuss] Are Union.typeIds worth keeping?

2019-07-10 Thread Wes McKinney
hi Ben,

Some applications use static type ids for various data types. Let's
consider one possibility:

BOOLEAN: 0
INT32: 1
DOUBLE: 2
STRING (UTF8): 3

If you were parsing JSON and constructing unions while parsing, you
might encounter some types, but not all. So if we _don't_ have the
option of having type ids in the metadata then we are left with some
unsatisfactory options:

A: Include all types in the resulting union, even if they are unobserved, or
B: Assign type id dynamically to types when they are observed

Option B: is potentially bad because it does not parallelize across
threads or nodes.

So I do think the feature is useful. It does make the implementations
of unions more complex, though, so it does not come without cost. But
unions are already the most complex tool we have in our nested data
toolbox, so why not allow them to be flexible in this regard?

In any case I'm -0 on making changes, but would be interested in
feedback of others if there is strong sentiment about deprecating the
feature.

- Wes

On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman  wrote:
>
> The Union.typeIds property is confusing and its utility is unclear. I'd
> like to remove it (or at least document it better). Unless anyone knows a
> real advantage for keeping it I plan to assemble a PR to drop it from the
> format and the C++ implementation.
>
> ARROW-257 ( resolved by pull request
> https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
> typeIds property (in the C++ implementation, this is
> UnionType::type_codes). Prior to that pull request each element (int8) in
> the type_ids (second) buffer of a union array was the index of a child
> array. Thus a type_ids buffer beginning with 5 indicated that the union
> array began with a value from child_data[5]. After that change to interpret
> a type_id of 5 one must look through the typeIds property and the index at
> which a 5 is found is the index of the corresponding child array.
>
> The change was made to allow unused child arrays to be dropped; for example
> if a union type were predefined with 64 members then an array of nearly
> identical type containing only int32 and utf8 values would only be required
> to have two child arrays. Note: the union types are not exactly identical
> even though they contain identical members; their typeIds properties will
> differ.
>
> However unused child arrays can be replaced by null arrays (which are
> almost equally lightweight as they require no heap allocation). I'm also
> unaware of a use case for predefined type_ids; if they are application
> specific then I think it's out of scope for arrow to maintain a child_index
> <-> type_id mapping. It seems that the optimization has questionable merit
> and does not warrant the added complexity.


[Discuss] Are Union.typeIds worth keeping?

2019-07-10 Thread Ben Kietzman
The Union.typeIds property is confusing and its utility is unclear. I'd
like to remove it (or at least document it better). Unless anyone knows a
real advantage for keeping it I plan to assemble a PR to drop it from the
format and the C++ implementation.

ARROW-257 ( resolved by pull request
https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
typeIds property (in the C++ implementation, this is
UnionType::type_codes). Prior to that pull request each element (int8) in
the type_ids (second) buffer of a union array was the index of a child
array. Thus a type_ids buffer beginning with 5 indicated that the union
array began with a value from child_data[5]. After that change to interpret
a type_id of 5 one must look through the typeIds property and the index at
which a 5 is found is the index of the corresponding child array.

The change was made to allow unused child arrays to be dropped; for example
if a union type were predefined with 64 members then an array of nearly
identical type containing only int32 and utf8 values would only be required
to have two child arrays. Note: the union types are not exactly identical
even though they contain identical members; their typeIds properties will
differ.

However unused child arrays can be replaced by null arrays (which are
almost equally lightweight as they require no heap allocation). I'm also
unaware of a use case for predefined type_ids; if they are application
specific then I think it's out of scope for arrow to maintain a child_index
<-> type_id mapping. It seems that the optimization has questionable merit
and does not warrant the added complexity.


[jira] [Created] (ARROW-5903) [Java] Set methods in DecimalVector are slow

2019-07-10 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5903:
-

 Summary: [Java] Set methods in DecimalVector are slow
 Key: ARROW-5903
 URL: https://issues.apache.org/jira/browse/ARROW-5903
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


The methods are doing a bound check on each byte in the input buffer and each 
byte on the output buffer. Avoiding this repetitive work improves perf by a 
factor of 2x to 3x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Spark and Arrow Flight

2019-07-10 Thread Wes McKinney
Of course, it might make just as much sense in Apache Spark. Probably
worth bringing up with that community, too

On Wed, Jul 10, 2019 at 12:37 PM Wes McKinney  wrote:
>
> hi Ryan -- I was thinking that this might be built separately from the
> main Java project. We don't have a model in the codebase yet for
> libraries that depend on the core libraries (this could be in an apps/
> directory at the top level, so apps/spark-flight-source or something).
> So the development procedure would be to build and install the Arrow
> libraries first and then build the Spark-Flight source as a follow up.
>
> I think there would be a lot of benefit to maintaining common
> development infrastructure -- for example, we could set up
> docker-compose tasks to spin up nodes to simulate a distributed system
> for testing and benchmarking purposes, and utilize common CI systems.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray  wrote:
> >
> > Hey Wes,
> >
> > Would be happy to! Jacques and I had originally thought to try and get it
> > into Spark but perhaps Arrow might be a better home. I think the only issue
> > is whether we want to bring Spark jars and their dependencies into Arrow.
> > One challenge I have had so far with the connector is managing the
> > transitive arrow dependencies from Spark, the connector only works on
> > relatively recent versions of Spark and potentially can create circular
> > arrow dependencies. I think this issue will be better once 1.0.0 is done
> > and we can rely on a stable format/api.
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney  wrote:
> >
> > > Hi Ryan, have you thought about developing this inside Apache Arrow?
> > >
> > > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:
> > >
> > > > Great, thanks Ryan! I'll take a look
> > > >
> > > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
> > > >
> > > > > Hi Bryan,
> > > > >
> > > > > I have an implementation of option #3 nearly ready for a PR. I will
> > > > mention
> > > > > you when I publish it.
> > > > >
> > > > > The working prototype for the Spark connector is here:
> > > > > https://github.com/rymurr/flight-spark-source. It technically works
> > > (and
> > > > > is
> > > > > very fast!) however the implementation is pretty dodgy and needs to be
> > > > > cleaned up before ready for prime time. I plan to have it ready to go
> > > for
> > > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please 
> > > > > shout
> > > > if
> > > > > you have any comments or are interested in contributing!
> > > > >
> > > > > Best,
> > > > > Ryan
> > > > >
> > > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
> > > > >
> > > > > > I'm in favor of option #3 also, but not sure what the best thing to
> > > do
> > > > > with
> > > > > > the existing FlightInfo response is. I'm definitely interested in
> > > > > > connecting Spark with Flight, can you share more details of your 
> > > > > > work
> > > > or
> > > > > is
> > > > > > it planned to be open sourced?
> > > > > >
> > > > > > Thanks,
> > > > > > Bryan
> > > > > >
> > > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> > > implementation
> > > > > can
> > > > > > > rely on calling GetFlightInfo.
> > > > > > >
> > > > > > >
> > > > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > > > >
> > > > > > > > We've been thinking about a similar issue, where sometimes we
> > > want
> > > > > > > > just the schema, but the service can't necessarily return the
> > > > schema
> > > > > > > > without fetching data - right now we return a sentinel value in
> > > > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> > > indicate
> > > > an
> > > > > > > > error.
> > > > > > > >
> > > > > > > > I might be missing something though - what happens between step 
> > > > > > > > 1
> > > > and
> > > > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> > > have
> > > > > the
> > > > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > David
> > > > > > > >
> > > > > > > > On 7/1/19, Wes McKinney  wrote:
> > > > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> > > but
> > > > I
> > > > > > > >> like the more structured solution of explicitly requesting the
> > > > > schema
> > > > > > > >> given a descriptor.
> > > > > > > >>
> > > > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> > > if
> > > > > you
> > > > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > > > receive
> > > > > > > >> the schema again. The schema is optional, so if it became a
> > 

Re: Spark and Arrow Flight

2019-07-10 Thread Wes McKinney
hi Ryan -- I was thinking that this might be built separately from the
main Java project. We don't have a model in the codebase yet for
libraries that depend on the core libraries (this could be in an apps/
directory at the top level, so apps/spark-flight-source or something).
So the development procedure would be to build and install the Arrow
libraries first and then build the Spark-Flight source as a follow up.

I think there would be a lot of benefit to maintaining common
development infrastructure -- for example, we could set up
docker-compose tasks to spin up nodes to simulate a distributed system
for testing and benchmarking purposes, and utilize common CI systems.

- Wes

On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray  wrote:
>
> Hey Wes,
>
> Would be happy to! Jacques and I had originally thought to try and get it
> into Spark but perhaps Arrow might be a better home. I think the only issue
> is whether we want to bring Spark jars and their dependencies into Arrow.
> One challenge I have had so far with the connector is managing the
> transitive arrow dependencies from Spark, the connector only works on
> relatively recent versions of Spark and potentially can create circular
> arrow dependencies. I think this issue will be better once 1.0.0 is done
> and we can rely on a stable format/api.
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney  wrote:
>
> > Hi Ryan, have you thought about developing this inside Apache Arrow?
> >
> > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:
> >
> > > Great, thanks Ryan! I'll take a look
> > >
> > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
> > >
> > > > Hi Bryan,
> > > >
> > > > I have an implementation of option #3 nearly ready for a PR. I will
> > > mention
> > > > you when I publish it.
> > > >
> > > > The working prototype for the Spark connector is here:
> > > > https://github.com/rymurr/flight-spark-source. It technically works
> > (and
> > > > is
> > > > very fast!) however the implementation is pretty dodgy and needs to be
> > > > cleaned up before ready for prime time. I plan to have it ready to go
> > for
> > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > > if
> > > > you have any comments or are interested in contributing!
> > > >
> > > > Best,
> > > > Ryan
> > > >
> > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
> > > >
> > > > > I'm in favor of option #3 also, but not sure what the best thing to
> > do
> > > > with
> > > > > the existing FlightInfo response is. I'm definitely interested in
> > > > > connecting Spark with Flight, can you share more details of your work
> > > or
> > > > is
> > > > > it planned to be open sourced?
> > > > >
> > > > > Thanks,
> > > > > Bryan
> > > > >
> > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> > implementation
> > > > can
> > > > > > rely on calling GetFlightInfo.
> > > > > >
> > > > > >
> > > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > > >
> > > > > > > We've been thinking about a similar issue, where sometimes we
> > want
> > > > > > > just the schema, but the service can't necessarily return the
> > > schema
> > > > > > > without fetching data - right now we return a sentinel value in
> > > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> > indicate
> > > an
> > > > > > > error.
> > > > > > >
> > > > > > > I might be missing something though - what happens between step 1
> > > and
> > > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> > have
> > > > the
> > > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > On 7/1/19, Wes McKinney  wrote:
> > > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> > but
> > > I
> > > > > > >> like the more structured solution of explicitly requesting the
> > > > schema
> > > > > > >> given a descriptor.
> > > > > > >>
> > > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> > if
> > > > you
> > > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > > receive
> > > > > > >> the schema again. The schema is optional, so if it became a
> > > > > > >> performance problem then a particular server might return the
> > > schema
> > > > > > >> as null from GetFlightInfo.
> > > > > > >>
> > > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > > request
> > > > > > >> that returns _both_ the schema and the query plan.
> > > > > > >>
> > > > > > >> Thoughts from others?
> > > > > > >>
> > > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > > 

Re: Spark and Arrow Flight

2019-07-10 Thread Ryan Murray
Hey Wes,

Would be happy to! Jacques and I had originally thought to try and get it
into Spark but perhaps Arrow might be a better home. I think the only issue
is whether we want to bring Spark jars and their dependencies into Arrow.
One challenge I have had so far with the connector is managing the
transitive arrow dependencies from Spark, the connector only works on
relatively recent versions of Spark and potentially can create circular
arrow dependencies. I think this issue will be better once 1.0.0 is done
and we can rely on a stable format/api.

Best,
Ryan

On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney  wrote:

> Hi Ryan, have you thought about developing this inside Apache Arrow?
>
> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler  wrote:
>
> > Great, thanks Ryan! I'll take a look
> >
> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray  wrote:
> >
> > > Hi Bryan,
> > >
> > > I have an implementation of option #3 nearly ready for a PR. I will
> > mention
> > > you when I publish it.
> > >
> > > The working prototype for the Spark connector is here:
> > > https://github.com/rymurr/flight-spark-source. It technically works
> (and
> > > is
> > > very fast!) however the implementation is pretty dodgy and needs to be
> > > cleaned up before ready for prime time. I plan to have it ready to go
> for
> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> > if
> > > you have any comments or are interested in contributing!
> > >
> > > Best,
> > > Ryan
> > >
> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler  wrote:
> > >
> > > > I'm in favor of option #3 also, but not sure what the best thing to
> do
> > > with
> > > > the existing FlightInfo response is. I'm definitely interested in
> > > > connecting Spark with Flight, can you share more details of your work
> > or
> > > is
> > > > it planned to be open sourced?
> > > >
> > > > Thanks,
> > > > Bryan
> > > >
> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > >
> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> implementation
> > > can
> > > > > rely on calling GetFlightInfo.
> > > > >
> > > > >
> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > >
> > > > > > We've been thinking about a similar issue, where sometimes we
> want
> > > > > > just the schema, but the service can't necessarily return the
> > schema
> > > > > > without fetching data - right now we return a sentinel value in
> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> indicate
> > an
> > > > > > error.
> > > > > >
> > > > > > I might be missing something though - what happens between step 1
> > and
> > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> have
> > > the
> > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On 7/1/19, Wes McKinney  wrote:
> > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> but
> > I
> > > > > >> like the more structured solution of explicitly requesting the
> > > schema
> > > > > >> given a descriptor.
> > > > > >>
> > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> if
> > > you
> > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > receive
> > > > > >> the schema again. The schema is optional, so if it became a
> > > > > >> performance problem then a particular server might return the
> > schema
> > > > > >> as null from GetFlightInfo.
> > > > > >>
> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > request
> > > > > >> that returns _both_ the schema and the query plan.
> > > > > >>
> > > > > >> Thoughts from others?
> > > > > >>
> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > jacq...@apache.org>
> > > > > wrote:
> > > > > >>>
> > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > others
> > > > > >>> think.
> > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > Schema
> > > > > off
> > > > > >>> the GetFlightInfo response...
> > > > > >>>
> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> rym...@dremio.com>
> > > > > wrote:
> > > > > >>>
> > > > >  Hi All,
> > > > > 
> > > > >  I have been working on building an arrow flight source for
> > spark.
> > > > The
> > > > >  goal
> > > > >  here is for Spark to be able to use a group of arrow flight
> > > > endpoints
> > > > >  to
> > > > >  get a dataset pulled over to spark in parallel.
> > > > > 
> > > > >  I am unsure of the best model for the spark <-> flight
> > > conversation
> > > > > and
> > > > >  wanted to get your opinion on the best way to go.
> > > > > 
> > > > >  I am breaking up 

Re: [DISCUSS] Release cadence and release vote conventions

2019-07-10 Thread Wes McKinney
On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > in future releases we should
> > institute a minimum 24-hour "quiet period" after any community
> > feedback on a release candidate to allow issues to be examined
> > further.
>
> I agree with this. I'll do so when I do a release manager in
> the future.
>
> > To be able to release more often, two things have to happen:
> >
> > * More PMC members must engage with the release management role,
> > process, and tools
> > * Continued improvements to release tooling to make the process less
> > painful for the release manager. For example, it seems we may want to
> > find a different place than Bintray to host binary artifacts
> > temporarily during release votes
>
> My opinion that we need to build nightly release system.
>
> It uses dev/release/NN-*.sh to build .tar.gz and binary
> artifacts from the .tar.gz.
> It also uses dev/release/verify-release-candidate.* to
> verify build .tar.gz and binary artifacts.
> It also uses dev/release/post-NN-*.sh to do post release
> tasks. (Some tasks such as uploading a package to packaging
> system will be dry-run.)
>

I agree that having a turn-key release system that's capable of
producing nightly packages is the way to do. That way any problems
that would block a release will come up as they happen rather than
piling up until the very end like they are now.

> I needed 10 or more changes for dev/release/ to create
> 0.14.0 RC0. (Some of them are still in my local stashes. I
> don't have time to create pull requests for them
> yet. Because I postponed some tasks of my main
> business. I'll create pull requests after I finished the
> postponed tasks of my main business.)
>

Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we need to
release again soon because of problems with 0.14.0 please let us know
what patches will be needed to make another release.

> If we fix problems related to dev/release/ in our normal
> development process, release process will be less painful.
>
> The biggest problem for 0.14.0 RC0 is java/pom.xml related:
>   https://github.com/apache/arrow/pull/4717
>
> It was difficult for me because I don't have Java
> knowledge. Release manager needs help from many developers
> because release manager may not have knowledge of all
> supported languages. Apache Arrow supports 10 over
> languages.
>
>
> For Bintray API limit problem, we'll be able to resolve it.
> I was added to https://bintray.com/apache/ members:
>
>   https://issues.apache.org/jira/browse/INFRA-18698
>
> I'll be able to use Bintray API without limitation in the
> future. Release managers should also request the same thing.
>

This is good, I will add myself. Other PMC members should also add themselves.

>
> Thanks,
> --
> kou
>
> In 
>   "[DISCUSS] Release cadence and release vote conventions" on Sat, 6 Jul 2019 
> 16:28:50 -0500,
>   Wes McKinney  wrote:
>
> > hi folks,
> >
> > As a reminder, particularly since we have many new community members
> > (some of whom have never been involved with an ASF project before),
> > releases are approved exclusively by the PMC and in general releases
> > cannot be vetoed. In spite of that, we strive to make releases that
> > have unanimous (either by explicit +1 or lazy consent) support of the
> > PMC. So it is better to have unanimous 5 +1 votes than 6 +1 votes with
> > a -1 dissenting vote.
> >
> > On the 0.14.0 vote, as with previous release votes, some issues with
> > the release were raised by members of the community, whether build or
> > test-related problems or other failures. Technically speaking, such
> > issues have no _direct_ bearing on whether a release vote passes, only
> > on whether PMC members vote +1, 0, or -1. A PMC member is allowed to
> > change their vote based on new information -- for example, if I voted
> > +1 on a release and then someone reported a serious licensing issue,
> > then I would revise my vote to -1.
> >
> > On the RC0 vote thread, Jacques wrote [1]
> >
> > "A release vote should last until we arrive at consensus. When an
> > issue is potentially identified, those that have voted should be given
> > ample time to change their vote and others that may have been lazy
> > consenters should be given time to chime in. There is no maximum
> > amount of time a vote can be open. Allowing at least 24 hours after an
> > objection is raised is a pretty minimum expectation unless the
> > objector removes their objection.
> >
> > Note that Apache is more focused on consensus than timing (as opposed to
> > virtually other other organizations in the world)."
> >
> > I agree with this and my opinion is that in future releases we should
> > institute a minimum 24-hour "quiet period" after any community
> > feedback on a release candidate to allow issues to be examined
> > further. If someone finds a potential problem, and no negative votes
> > are cast or changed, then the vote can close.
> >
> > As a related matter, it seems clear to me that 

Arrow biweekly sync call today at 12pm US/Eastern / 16:00 UTC

2019-07-10 Thread Wes McKinney
All are welcome at

https://meet.google.com/vtm-teks-phx


Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"

2019-07-10 Thread Wes McKinney
On Wed, Jul 10, 2019 at 12:43 AM Micah Kornfield  wrote:
>
> Hi Eric,
> Short answer: I think your understanding matches what I was proposing.  
> Longer answer below.
>
>> So, for example, we release library v1.0.0 in a few months and then library 
>> v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java didn't 
>> make any breaking API changes from 1.0.0. But C# made 3 API breaking 
>> changes. This would be acceptable?
>
> Yes.  I think all language bindings are going under rapid enough iteration 
> that we are making at least a few small breaking API changes on each release 
> even though we try to avoid it.  I think it will be worth having further 
> discussions on the release process once at least a few languages get to a 
> more stable point.
>

I agree with this. I think we are a pretty long ways away from making
API stability _guarantees_ in any of the implementations, though we
certainly should try to be courteous about the changes we do make, to
allow for graceful transitions over a period of 1-2 releases if
possible.

> Thanks,
> Micah
>
> On Tue, Jul 9, 2019 at 2:26 PM Eric Erhardt  
> wrote:
>>
>> Just to be sure I fully understand the proposal:
>>
>> For the Library Version, we are going to increment the MAJOR version on 
>> every normal release, and increment the MINOR version if we need to release 
>> a patch/bug fix type of release.
>>
>> Since SemVer allows for API breaking changes on MAJOR versions, this 
>> basically means, each library (C++, Python, C#, Java, etc) _can_ introduce 
>> API breaking changes on every normal release (like we have been with the 
>> 0.x.0 releases).
>>
>> So, for example, we release library v1.0.0 in a few months and then library 
>> v2.0.0 a few months after that.  In v2.0.0, C++, Python, and Java didn't 
>> make any breaking API changes from 1.0.0. But C# made 3 API breaking 
>> changes. This would be acceptable?
>>
>> If my understanding above is correct, then I think this is a good plan. 
>> Initially I was concerned that the C# library wouldn't be free to make API 
>> breaking changes with making the version `1.0.0`. The C# library is still 
>> pretty inadequate, and I have a feeling there are a few things that will 
>> need to change about it in the future. But with the above plan, this concern 
>> won't be a problem.
>>
>> Eric
>>
>> -Original Message-
>> From: Micah Kornfield 
>> Sent: Monday, July 1, 2019 10:02 PM
>> To: Wes McKinney 
>> Cc: dev@arrow.apache.org
>> Subject: Re: [Discuss] Compatibility Guarantees and Versioning Post "1.0.0"
>>
>> Hi Wes,
>> Thanks for your response.  In regards to the protocol negotiation your 
>> description of feature reporting (snipped below) is along the lines of what 
>> I was thinking.  It might not be necessary for 1.0.0, but at some point 
>> might become useful.
>>
>>
>> >  Note that we don't really have a mechanism for clients and servers to
>> > report to each other what features they support, so this could help
>> > with that when for applications where it might matter.
>>
>>
>> Thanks,
>> Micah
>>
>>
>> On Mon, Jul 1, 2019 at 12:54 PM Wes McKinney  wrote:
>>
>> > hi Micah,
>> >
>> > Sorry for the delay in feedback. I looked at the document and it seems
>> > like a reasonable perspective about forward- and
>> > backward-compatibility.
>> >
>> > It seems like the main thing you are proposing is to apply Semantic
>> > Versioning to Format and Library versions separately. That's an
>> > interesting idea, my thought had been to have a version number that is
>> > FORMAT_VERSION.LIBRARY_VERSION.PATCH_VERSION. But your proposal is
>> > more flexible in some ways, so let me clarify for others reading
>> >
>> > In what you are proposing, the next release would be:
>> >
>> > Format version: 1.0.0
>> > Library version: 1.0.0
>> >
>> > Suppose that 20 major versions down the road we stand at
>> >
>> > Format version: 1.5.0
>> > Library version: 20.0.0
>> >
>> > The minor version of the Format would indicate that there are
>> > additions, like new elements in the Type union, but otherwise backward
>> > and forward compatible. So the Minor version means "new things, but
>> > old clients will not be disrupted if those new things are not used".
>> > We've already been doing this since the V4 Format iteration but we
>> > have not had a way to signal that there may be new features. As a
>> > corollary to this, I wonder if we should create a dual version in the
>> > metadata
>> >
>> > PROTOCOL VERSION: (what is currently MetadataVersion, V2) FEATURE
>> > VERSION: not tracked at all
>> >
>> > So Minor version bumps in the format would trigger a bump in the
>> > FeatureVersion. Note that we don't really have a mechanism for clients
>> > and servers to report to each other what features they support, so
>> > this could help with that when for applications where it might matter.
>> >
>> > Should backward/forward compatibility be disrupted in the future, then
>> > a change to the major version would be 

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-10 Thread Wes McKinney
The issue is fairly esoteric, so it will probably take more time to
collect feedback. I could create a C++ implementation of this if it
helps with the process.

On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield  wrote:
>
> Does anybody else have thoughts on this?   Other language contributors?
>
> It seems like we still might not have enough of a consensus for a vote?
>
> Thanks,
> Micah
>
>
>
>
> On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney  wrote:
>
> > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> >
> > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou  wrote:
> >
> > >
> > > I guess I still dont understand how the IPC stream format works :-/
> > >
> > > To put it clearly: what happens in Flight?  Will a Flight message
> > > automatically get the "stream continuation message" in front of it?
> > >
> > >
> > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou 
> > > wrote:
> > > >>
> > > >>
> > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > >>> Thanks for the references.
> > > >>>
> > > >>> If we decided to make a change around this, we could call the first 4
> > > >>> bytes a stream continuation marker to make it slightly less ugly
> > > >>>
> > > >>> * 0x: continue
> > > >>> * 0x: stop
> > > >>
> > > >> Do you mean it would be a separate IPC message?
> > > >
> > > > No, I think this is only about how we could change the message prefix
> > > > from 4 bytes to 8 bytes
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > >
> > > > Currently a 0x (0 metadata size) is used as an end-of-stream
> > > > marker. So what I was saying is that the first 8 bytes could be
> > > >
> > > > <4 bytes: stream continuation>
> > > >
> > > >>
> > > >>
> > > >>>
> > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > wrote:
> > > 
> > >  Hi Wes,
> > >  I'm not an expert on this either, my inclination mostly comes from
> > > some research I've done.  I think it is important to distinguish two
> > cases:
> > >  1.  unaligned access at the processor instruction level
> > >  2.  undefined behavior
> > > 
> > >  From my reading unaligned access is fine on most modern
> > architectures
> > > and it seems the performance penalty has mostly been eliminated.
> > > 
> > >  Undefined behavior is a compiler/language concept.  The problem is
> > > the compiler can choose to do anything in UB scenarios, not just the
> > > "obvious" translation.  Specifically, the compiler is under no obligation
> > > to generate the unaligned access instructions, and if it doesn't SEGVs
> > > ensue.  Two examples, both of which relate to SIMD optimizations are
> > linked
> > > below.
> > > 
> > >  I tend to be on the conservative side with this type of thing but if
> > > we have experts on the the ML that can offer a more informed opinion, I
> > > would love to hear it.
> > > 
> > >  [1]
> > > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > >  [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > 
> > >  On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney 
> > > wrote:
> > > >
> > > > The <0x> solution is downright ugly but I
> > think
> > > > it's one of the only ways that achieves
> > > >
> > > > * backward compatibility (new clients can read old data)
> > > > * opt-in forward compatibility (if we want to go to the labor of
> > > doing
> > > > so, sort of dangerous)
> > > > * old clients receiving new data do not blow up (they will see a
> > > > metadata length of -1)
> > > >
> > > > NB 0x  would look like:
> > > >
> > > > In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > > Out[13]: array([4294967295,128], dtype=uint32)
> > > >
> > > > In [14]: np.array([(2 << 32) - 1, 128],
> > > > dtype=np.uint32).view(np.int32)
> > > > Out[14]: array([ -1, 128], dtype=int32)
> > > >
> > > > In [15]: np.array([(2 << 32) - 1, 128],
> > > dtype=np.uint32).view(np.uint8)
> > > > Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> > dtype=uint8)
> > > >
> > > > Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > > >
> > > > Do you know in what circumstances unaligned reads from Flatbuffers
> > > > might cause an issue? I do not know enough about UB but my
> > > > understanding is that it causes issues on some specialized
> > platforms
> > > > where for most modern x86-64 processors and compilers it is not
> > > really
> > > > an issue (though perhaps a performance issue)
> > > >
> > > > On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > > emkornfi...@gmail.com> wrote:
> > > >>
> > > >> At least on the read-side we can make this detectable by using
> > > something like <0x> instead 

[jira] [Created] (ARROW-5902) [Java] Implement HashTable for dictionary encoding

2019-07-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5902:
-

 Summary: [Java] Implement HashTable for dictionary encoding
 Key: ARROW-5902
 URL: https://issues.apache.org/jira/browse/ARROW-5902
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in [https://github.com/apache/arrow/pull/4792]

Implement a hash table to only store hash & index, meanwhile add check equal 
function in ValueVector API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5901) [Rust] Implement PartialEq to compare array and json values

2019-07-10 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5901:
-

 Summary: [Rust] Implement PartialEq to compare array and json 
values
 Key: ARROW-5901
 URL: https://issues.apache.org/jira/browse/ARROW-5901
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Renjie Liu
Assignee: Renjie Liu


Useful in tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5900) [Gandiva] [Java] Decimal precision,scale bounds check

2019-07-10 Thread Praveen Kumar Desabandu (JIRA)
Praveen Kumar Desabandu created ARROW-5900:
--

 Summary: [Gandiva] [Java] Decimal precision,scale bounds check
 Key: ARROW-5900
 URL: https://issues.apache.org/jira/browse/ARROW-5900
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Praveen Kumar Desabandu
Assignee: Praveen Kumar Desabandu


Currently we accept decimal precision of values, need bounds checking that it 
is between 1-38 inclusive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5899) [Python][Packaging] Bundle uriparser.dll in windows wheels

2019-07-10 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5899:
--

 Summary: [Python][Packaging] Bundle uriparser.dll in windows 
wheels 
 Key: ARROW-5899
 URL: https://issues.apache.org/jira/browse/ARROW-5899
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The windows nightly wheel builds are failing: 
https://ci.appveyor.com/project/Ursa-Labs/crossbow/builds/25688922 probably 
caused by 88fcb09, but it's hard to tell because of the error message 
"ImportError: DLL load failed: The specified module could not be found." is not 
very descriptive.

Theoretically it shouldn't affect the 0.14 release because 88fcb09 was added 
afterwards.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5898) [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment

2019-07-10 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5898:
---

 Summary: [Java] Provide functionality to efficiently compute hash 
code for arbitrary memory segment
 Key: ARROW-5898
 URL: https://issues.apache.org/jira/browse/ARROW-5898
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue adds a functionality to efficiently compute  the hash code for a 
consecutive memory region. This functionality is important in practical 
scenarios because it helps:

* Avoid unnecessary memory copy.

* Avoid repeated conversions between Java objects & Arrow buffers. 

Since the algorithm for calculating hash code has  significant performance 
implications, we need to design an interface so that different algorithms can 
be easily introduces as a plug-in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-10 Thread Micah Kornfield
Does anybody else have thoughts on this?   Other language contributors?

It seems like we still might not have enough of a consensus for a vote?

Thanks,
Micah




On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney  wrote:

> Correct. The encapsulated IPC message will just be 4 bytes bigger.
>
> On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou  wrote:
>
> >
> > I guess I still dont understand how the IPC stream format works :-/
> >
> > To put it clearly: what happens in Flight?  Will a Flight message
> > automatically get the "stream continuation message" in front of it?
> >
> >
> > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > >>> Thanks for the references.
> > >>>
> > >>> If we decided to make a change around this, we could call the first 4
> > >>> bytes a stream continuation marker to make it slightly less ugly
> > >>>
> > >>> * 0x: continue
> > >>> * 0x: stop
> > >>
> > >> Do you mean it would be a separate IPC message?
> > >
> > > No, I think this is only about how we could change the message prefix
> > > from 4 bytes to 8 bytes
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > >
> > > Currently a 0x (0 metadata size) is used as an end-of-stream
> > > marker. So what I was saying is that the first 8 bytes could be
> > >
> > > <4 bytes: stream continuation>
> > >
> > >>
> > >>
> > >>>
> > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > wrote:
> > 
> >  Hi Wes,
> >  I'm not an expert on this either, my inclination mostly comes from
> > some research I've done.  I think it is important to distinguish two
> cases:
> >  1.  unaligned access at the processor instruction level
> >  2.  undefined behavior
> > 
> >  From my reading unaligned access is fine on most modern
> architectures
> > and it seems the performance penalty has mostly been eliminated.
> > 
> >  Undefined behavior is a compiler/language concept.  The problem is
> > the compiler can choose to do anything in UB scenarios, not just the
> > "obvious" translation.  Specifically, the compiler is under no obligation
> > to generate the unaligned access instructions, and if it doesn't SEGVs
> > ensue.  Two examples, both of which relate to SIMD optimizations are
> linked
> > below.
> > 
> >  I tend to be on the conservative side with this type of thing but if
> > we have experts on the the ML that can offer a more informed opinion, I
> > would love to hear it.
> > 
> >  [1]
> > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> >  [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > 
> >  On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney 
> > wrote:
> > >
> > > The <0x> solution is downright ugly but I
> think
> > > it's one of the only ways that achieves
> > >
> > > * backward compatibility (new clients can read old data)
> > > * opt-in forward compatibility (if we want to go to the labor of
> > doing
> > > so, sort of dangerous)
> > > * old clients receiving new data do not blow up (they will see a
> > > metadata length of -1)
> > >
> > > NB 0x  would look like:
> > >
> > > In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > Out[13]: array([4294967295,128], dtype=uint32)
> > >
> > > In [14]: np.array([(2 << 32) - 1, 128],
> > > dtype=np.uint32).view(np.int32)
> > > Out[14]: array([ -1, 128], dtype=int32)
> > >
> > > In [15]: np.array([(2 << 32) - 1, 128],
> > dtype=np.uint32).view(np.uint8)
> > > Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> dtype=uint8)
> > >
> > > Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > >
> > > Do you know in what circumstances unaligned reads from Flatbuffers
> > > might cause an issue? I do not know enough about UB but my
> > > understanding is that it causes issues on some specialized
> platforms
> > > where for most modern x86-64 processors and compilers it is not
> > really
> > > an issue (though perhaps a performance issue)
> > >
> > > On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > emkornfi...@gmail.com> wrote:
> > >>
> > >> At least on the read-side we can make this detectable by using
> > something like <0x> instead of int64_t.  On the
> write
> > side we would need some sort of default mode that we could flip on/off if
> > we wanted to maintain compatibility.
> > >>
> > >> I should say I think we should fix it.  Undefined behavior is
> > unpaid debt that might never be collected or might cause things to fail
> in
> > difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> > >>
> > >> -Micah
> > >>
> > >> On Sun, Jun 30, 2019 at 3:17 PM