Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Micah Kornfield
A few thoughts:
- I think we should iron out the remaining incompatibilities between java
and C++ before going to 1.0.0 (at least Union and NullType), and I'm not
sure I will have time to them before the next release, so I would prefer to
try to aim for the subsequent release to make it 1.0.0
- For 1.0.0 should we change the metadata format version to a new naming
scheme [1] (seems like more of a hassle then it is worth)?
- I'm a little concerned about the implications for forward-compatibility
restrictions for format changes.  For instance the large list types would
not be forward compatible (at least by some definitions), similarly if we
deal with compression [2] it would also seem to not be forward compatible.
Would this mean we bump the format version number for each change even
though they would be backwards compatible?

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
[2] https://issues.apache.org/jira/browse/ARROW-300

On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney  wrote:

> I agree re: marketing value of a 1.0.0 release.
>
> For the record, I think we should continue to allow the API of each
> respective library component to evolve freely and allow the
> individuals developing each to decide how to handle deprecations, API
> changes, etc., as we have up until this point. The project is still
> very much in "innovation mode" across the board, but some parts may
> grow more conservative than others. Having roughly time-based releases
> encourages everyone to be ready-to-release at any given time, and we
> develop a steady cadence of getting new functionality and
> improvements/fixes out the door.
>
> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou  wrote:
> >
> >
> > I think there's a marketing merit to issuing a 1.0.0 release.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 07/06/2019 à 20:05, Wes McKinney a écrit :
> > > So one idea is that we could call the next release 1.14.0. So the
> > > second number is the API version number. This encodes a sequencing of
> > > the evolution of the API. The library APIs are already decoupled from
> > > the binary serialization protocol, so I think we merely have to state
> > > that API changes and protocol changes are not related to each other.
> > >
> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau 
> wrote:
> > >>
> > >> It brings up an interesting point... do we couple the stability of
> the apis
> > >> with the stability of the protocol. If the protocol is stable, we
> should
> > >> start providing guarantees for it. How do we want to express these
> > >> different velocities?
> > >>
> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou 
> wrote:
> > >>
> > >>>
> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> >  On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
> > >>> wrote:
> > 
> > > Hi Wes,
> > >
> > > Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> > >>
> > >> I think
> > >> this would have a lot of benefits for project onlookers to remove
> > >> various warnings around the codebase around stability and cautions
> > >> against persistence of protocol data. It's fair to say that if we
> _do_
> > >> make changes in the future, that there will be a transition path
> for
> > >> migrate persisted data, should it ever come to that.
> > >
> > > I think that's a good idea, but perhaps the stability promise
> shouldn't
> > > cover the Flight protocol yet?
> > 
> >  Agreed.
> > 
> > >> I would suggest a "1.0.0" release either as our next release
> (instead
> > >> of 0.14.0) or the release right after that (if we need more time
> to
> > >> get affairs in order), with the guidance for users of:
> > >
> > > I think we should first do a regular 0.14.0 with all that's on our
> plate
> > > right now, then work towards a 1.0.0 as the release following that.
> > 
> >  What is different from your perspective? If the protocol hasn't
> changed
> > >>> in
> >  over a year, why not call it 1.0?
> > >>>
> > >>> I would say that perhaps some API cleanup is in order.  Remove
> > >>> deprecated ones, review experimental APIs, perhaps mark experimental
> > >>> certain APIs that we forgot to...
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> > >>>
>


[jira] [Created] (ARROW-5532) Field Metadata Not Read

2019-06-07 Thread Trey Hakanson (JIRA)
Trey Hakanson created ARROW-5532:


 Summary: Field Metadata Not Read
 Key: ARROW-5532
 URL: https://issues.apache.org/jira/browse/ARROW-5532
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.13.0
 Environment: Mac OSX 10.14, Chrome 74
Reporter: Trey Hakanson


Field metadata is not read when using `@apache-arrow/ts@0.13.0`. Example below 
also uses `pyarrow==0.13.0`

Steps to reproduce:

Adding metadata:

```py
import pyarrow as pa
import pandas as pd

source = "sample.csv"
output = "sample.arrow"
df = pd.read_csv(source)
table = pa.Table.from_pandas(df)
schema = pa.schema([
 column.field.add_metadata(\{"foo": "bar"}))
 for column
 in table.columns
])
writer = pa.RecordBatchFileWriter(output, schema)
writer.write(table)
writer.close()
```

Reading field metadata using `pyarrow`:

```py
source = "sample.arrow"
field = "foo"
reader = pa.RecordBatchFileReader(source)
reader.schema.field_by_name(field).metadata # Correctly shows `\{"foo": "bar"}`
```

Reading field metadata using `@apache-arrow/ts`:

```ts
import \{ Table, Field, Type } from "@apache-arrow/ts";

const url = "https://example.com/sample.arrow;;
const buf = await fetch(url).then(res => res.arrayBuffer());
const table = Table.from([new Uint8Array(buf)]);
for (let field of table.schema.fields) {
 field.metadata; // Incorrectly shows an empty map
}
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

2019-06-07 Thread Wes McKinney
Great, thanks Jacques. I'll kick off a vote thread so we can hopefully
get this approved

On Fri, Jun 7, 2019 at 3:27 PM Jacques Nadeau  wrote:
>
> I'm good with this. The consistent separator is a good improvement.
>
> On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney  wrote:
>
> > hey Jacques,
> >
> > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau  wrote:
> > >
> > > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > > late to the conversation. Couple thoughts:
> > >
> > > - Should we reconsider having this be an independent optional field as
> > > opposed to overloading customer_metadata? It avoids having the weird
> > string
> > > prefixing behavior
> >
> > This is one option that we've discussed. The downside of this is that
> > it becomes another piece of metadata that Arrow implementations need
> > to mind when they are passing through IPC messages. The idea is that
> > "dumb" readers can simply ignore the metadata but pass it along in a
> > subsequent message. For example, suppose a simplistic data
> > service/microservice that evaluates a filter against record batches
> > coming through. There might be columns with extension types that come
> > through that the service does not recognize.
> >
> > In some implementations the custom_metadata member is preserved in
> > schemas and survives IPC round trips, but this is a feature that IMHO
> > should be implemented consistently in all Arrow implementations. For
> > example, I believe that Java drops the custom_metadata as soon as the
> > IPC protocol is parsed.
> >
> > Admittedly, this is not a huge issue, so if you had an extra member of
> > Field like
> >
> > table ExtensionType {
> >   name: string
> >   metadata: string
> > }
> >
> > ...
> >
> > table Field {
> >   ...
> >   custom_type : ExtensionType
> > }
> >
> > then that would work, too. It's more obtrusive to implementations as
> > readers that do not recognize a type should still mind this metadata
> > and pass it along in subsequent messages. If we embed in
> > custom_metadata then this happens automatically (assuming that
> > custom_metadata is preserved...)
> >
> > > - I'd be inclined to be much more stringent about type naming. Maybe even
> > > make the name multiple parts to force the issue?
> >
> > I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> > also that colon ":" is the designated namespace separator and I've
> > made the metadata keys
> >
> > ARROW:extension:name
> > ARROW:extension:metadata
> >
> > As far as the actual type name, since it's application-defined, it
> > might be better to leave this up to the developer-user. If we defined
> > any "built-in extension types" (things like UUID come to mind) we
> > might want to have a pseudo-namespace like "builtin.uuid",
> > "builtin.ipv6", etc. for these
> >
> > Let me know what you think -- it would be great to start a vote on this
> > soon.
> >
> > Thanks
> > Wes
> >
> > >
> > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney 
> > wrote:
> > >
> > > > hi Micah,
> > > >
> > > > I have just updated my PR per your comments with more examples of
> > > > extension types.
> > > >
> > > > https://github.com/apache/arrow/pull/4332
> > > >
> > > > Are there more comments about this? I can start a vote in a couple of
> > > > days absent further opinions.
> > > >
> > > > Can someone volunteer to review David's Java PR? I would like to move
> > > > this along so we have a chance of having working extension types in
> > > > the 0.14 release. A number of people are also interested in bridging
> > > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > > types [1]) and Arrow's ExtensionType
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > [1]:
> > > >
> > https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > > >
> > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield  > >
> > > > wrote:
> > > > >
> > > > > Hi Wes,
> > > > > Like I said I think this approach looks good, I think what I'm
> > looking
> > > > for is a little more documentation/examples on how additional types
> > would
> > > > be handled.  I think Tensor would be a good example, we also had
> > questions
> > > > about INET addresses previously, maybe this would be a another good
> > > > illustrative example.  Providing examples of serialized metadata in the
> > > > docs would be useful (clarifying that these are opaque binary blobs,
> > that
> > > > will be passed along to extension type factories?)
> > > > >
> > > > > In this regard, I think it might be good to provide a further
> > > > recommendations for the name of extension types:  What do you think
> > about
> > > > recommend organization/projects namespace them to according to some
> > > > convention, so that there aren't conflicts and extensions can be
> > shared?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > >
> > > > >
> > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney 
> > > > wrote:
> > > > >>
> > > > >>
> > > > 

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

2019-06-07 Thread Jacques Nadeau
I'm good with this. The consistent separator is a good improvement.

On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney  wrote:

> hey Jacques,
>
> On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau  wrote:
> >
> > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > late to the conversation. Couple thoughts:
> >
> > - Should we reconsider having this be an independent optional field as
> > opposed to overloading customer_metadata? It avoids having the weird
> string
> > prefixing behavior
>
> This is one option that we've discussed. The downside of this is that
> it becomes another piece of metadata that Arrow implementations need
> to mind when they are passing through IPC messages. The idea is that
> "dumb" readers can simply ignore the metadata but pass it along in a
> subsequent message. For example, suppose a simplistic data
> service/microservice that evaluates a filter against record batches
> coming through. There might be columns with extension types that come
> through that the service does not recognize.
>
> In some implementations the custom_metadata member is preserved in
> schemas and survives IPC round trips, but this is a feature that IMHO
> should be implemented consistently in all Arrow implementations. For
> example, I believe that Java drops the custom_metadata as soon as the
> IPC protocol is parsed.
>
> Admittedly, this is not a huge issue, so if you had an extra member of
> Field like
>
> table ExtensionType {
>   name: string
>   metadata: string
> }
>
> ...
>
> table Field {
>   ...
>   custom_type : ExtensionType
> }
>
> then that would work, too. It's more obtrusive to implementations as
> readers that do not recognize a type should still mind this metadata
> and pass it along in subsequent messages. If we embed in
> custom_metadata then this happens automatically (assuming that
> custom_metadata is preserved...)
>
> > - I'd be inclined to be much more stringent about type naming. Maybe even
> > make the name multiple parts to force the issue?
>
> I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> also that colon ":" is the designated namespace separator and I've
> made the metadata keys
>
> ARROW:extension:name
> ARROW:extension:metadata
>
> As far as the actual type name, since it's application-defined, it
> might be better to leave this up to the developer-user. If we defined
> any "built-in extension types" (things like UUID come to mind) we
> might want to have a pseudo-namespace like "builtin.uuid",
> "builtin.ipv6", etc. for these
>
> Let me know what you think -- it would be great to start a vote on this
> soon.
>
> Thanks
> Wes
>
> >
> > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney 
> wrote:
> >
> > > hi Micah,
> > >
> > > I have just updated my PR per your comments with more examples of
> > > extension types.
> > >
> > > https://github.com/apache/arrow/pull/4332
> > >
> > > Are there more comments about this? I can start a vote in a couple of
> > > days absent further opinions.
> > >
> > > Can someone volunteer to review David's Java PR? I would like to move
> > > this along so we have a chance of having working extension types in
> > > the 0.14 release. A number of people are also interested in bridging
> > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > types [1]) and Arrow's ExtensionType
> > >
> > > Thanks
> > > Wes
> > >
> > > [1]:
> > >
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > >
> > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield  >
> > > wrote:
> > > >
> > > > Hi Wes,
> > > > Like I said I think this approach looks good, I think what I'm
> looking
> > > for is a little more documentation/examples on how additional types
> would
> > > be handled.  I think Tensor would be a good example, we also had
> questions
> > > about INET addresses previously, maybe this would be a another good
> > > illustrative example.  Providing examples of serialized metadata in the
> > > docs would be useful (clarifying that these are opaque binary blobs,
> that
> > > will be passed along to extension type factories?)
> > > >
> > > > In this regard, I think it might be good to provide a further
> > > recommendations for the name of extension types:  What do you think
> about
> > > recommend organization/projects namespace them to according to some
> > > convention, so that there aren't conflicts and extensions can be
> shared?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > >
> > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney 
> > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney 
> wrote:
> > > >>>
> > > >>> Hi Micah,
> > > >>>
> > > >>> The use cases I'm aware of are mostly coming from proprietary
> > > applications. My idea was for the extension metadata to be as
> unobtrusive
> > > as possible. The only alternative as I see it would be to have an
> Extension
> > > value in the Type union which would be more intrusive to 

Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Wes McKinney
I agree re: marketing value of a 1.0.0 release.

For the record, I think we should continue to allow the API of each
respective library component to evolve freely and allow the
individuals developing each to decide how to handle deprecations, API
changes, etc., as we have up until this point. The project is still
very much in "innovation mode" across the board, but some parts may
grow more conservative than others. Having roughly time-based releases
encourages everyone to be ready-to-release at any given time, and we
develop a steady cadence of getting new functionality and
improvements/fixes out the door.

On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou  wrote:
>
>
> I think there's a marketing merit to issuing a 1.0.0 release.
>
> Regards
>
> Antoine.
>
>
> Le 07/06/2019 à 20:05, Wes McKinney a écrit :
> > So one idea is that we could call the next release 1.14.0. So the
> > second number is the API version number. This encodes a sequencing of
> > the evolution of the API. The library APIs are already decoupled from
> > the binary serialization protocol, so I think we merely have to state
> > that API changes and protocol changes are not related to each other.
> >
> > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau  wrote:
> >>
> >> It brings up an interesting point... do we couple the stability of the apis
> >> with the stability of the protocol. If the protocol is stable, we should
> >> start providing guarantees for it. How do we want to express these
> >> different velocities?
> >>
> >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou  wrote:
> >>
> >>>
> >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
>  On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
> >>> wrote:
> 
> > Hi Wes,
> >
> > Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> >>
> >> I think
> >> this would have a lot of benefits for project onlookers to remove
> >> various warnings around the codebase around stability and cautions
> >> against persistence of protocol data. It's fair to say that if we _do_
> >> make changes in the future, that there will be a transition path for
> >> migrate persisted data, should it ever come to that.
> >
> > I think that's a good idea, but perhaps the stability promise shouldn't
> > cover the Flight protocol yet?
> 
>  Agreed.
> 
> >> I would suggest a "1.0.0" release either as our next release (instead
> >> of 0.14.0) or the release right after that (if we need more time to
> >> get affairs in order), with the guidance for users of:
> >
> > I think we should first do a regular 0.14.0 with all that's on our plate
> > right now, then work towards a 1.0.0 as the release following that.
> 
>  What is different from your perspective? If the protocol hasn't changed
> >>> in
>  over a year, why not call it 1.0?
> >>>
> >>> I would say that perhaps some API cleanup is in order.  Remove
> >>> deprecated ones, review experimental APIs, perhaps mark experimental
> >>> certain APIs that we forgot to...
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>


[jira] [Created] (ARROW-5531) [Python] Support binary/utf8 in Array.from_buffers

2019-06-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5531:
---

 Summary: [Python] Support binary/utf8 in Array.from_buffers
 Key: ARROW-5531
 URL: https://issues.apache.org/jira/browse/ARROW-5531
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


See discussion in ARROW-2607



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Antoine Pitrou


I think there's a marketing merit to issuing a 1.0.0 release.

Regards

Antoine.


Le 07/06/2019 à 20:05, Wes McKinney a écrit :
> So one idea is that we could call the next release 1.14.0. So the
> second number is the API version number. This encodes a sequencing of
> the evolution of the API. The library APIs are already decoupled from
> the binary serialization protocol, so I think we merely have to state
> that API changes and protocol changes are not related to each other.
> 
> On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau  wrote:
>>
>> It brings up an interesting point... do we couple the stability of the apis
>> with the stability of the protocol. If the protocol is stable, we should
>> start providing guarantees for it. How do we want to express these
>> different velocities?
>>
>> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou  wrote:
>>
>>>
>>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
 On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
>>> wrote:

> Hi Wes,
>
> Le 07/06/2019 à 17:42, Wes McKinney a écrit :
>>
>> I think
>> this would have a lot of benefits for project onlookers to remove
>> various warnings around the codebase around stability and cautions
>> against persistence of protocol data. It's fair to say that if we _do_
>> make changes in the future, that there will be a transition path for
>> migrate persisted data, should it ever come to that.
>
> I think that's a good idea, but perhaps the stability promise shouldn't
> cover the Flight protocol yet?

 Agreed.

>> I would suggest a "1.0.0" release either as our next release (instead
>> of 0.14.0) or the release right after that (if we need more time to
>> get affairs in order), with the guidance for users of:
>
> I think we should first do a regular 0.14.0 with all that's on our plate
> right now, then work towards a 1.0.0 as the release following that.

 What is different from your perspective? If the protocol hasn't changed
>>> in
 over a year, why not call it 1.0?
>>>
>>> I would say that perhaps some API cleanup is in order.  Remove
>>> deprecated ones, review experimental APIs, perhaps mark experimental
>>> certain APIs that we forgot to...
>>>
>>> Regards
>>>
>>> Antoine.
>>>


[jira] [Created] (ARROW-5530) [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior

2019-06-07 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5530:
-

 Summary: [C++] Add options to ValueCount/Unique/DictEncode kernel 
to toggle null behavior
 Key: ARROW-5530
 URL: https://issues.apache.org/jira/browse/ARROW-5530
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Wes McKinney
So one idea is that we could call the next release 1.14.0. So the
second number is the API version number. This encodes a sequencing of
the evolution of the API. The library APIs are already decoupled from
the binary serialization protocol, so I think we merely have to state
that API changes and protocol changes are not related to each other.

On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau  wrote:
>
> It brings up an interesting point... do we couple the stability of the apis
> with the stability of the protocol. If the protocol is stable, we should
> start providing guarantees for it. How do we want to express these
> different velocities?
>
> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou  wrote:
>
> >
> > Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> > > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
> > wrote:
> > >
> > >> Hi Wes,
> > >>
> > >> Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> > >>>
> > >>> I think
> > >>> this would have a lot of benefits for project onlookers to remove
> > >>> various warnings around the codebase around stability and cautions
> > >>> against persistence of protocol data. It's fair to say that if we _do_
> > >>> make changes in the future, that there will be a transition path for
> > >>> migrate persisted data, should it ever come to that.
> > >>
> > >> I think that's a good idea, but perhaps the stability promise shouldn't
> > >> cover the Flight protocol yet?
> > >
> > > Agreed.
> > >
> > >>> I would suggest a "1.0.0" release either as our next release (instead
> > >>> of 0.14.0) or the release right after that (if we need more time to
> > >>> get affairs in order), with the guidance for users of:
> > >>
> > >> I think we should first do a regular 0.14.0 with all that's on our plate
> > >> right now, then work towards a 1.0.0 as the release following that.
> > >
> > > What is different from your perspective? If the protocol hasn't changed
> > in
> > > over a year, why not call it 1.0?
> >
> > I would say that perhaps some API cleanup is in order.  Remove
> > deprecated ones, review experimental APIs, perhaps mark experimental
> > certain APIs that we forgot to...
> >
> > Regards
> >
> > Antoine.
> >


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Jacques Nadeau
It brings up an interesting point... do we couple the stability of the apis
with the stability of the protocol. If the protocol is stable, we should
start providing guarantees for it. How do we want to express these
different velocities?

On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou  wrote:

>
> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou 
> wrote:
> >
> >> Hi Wes,
> >>
> >> Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> >>>
> >>> I think
> >>> this would have a lot of benefits for project onlookers to remove
> >>> various warnings around the codebase around stability and cautions
> >>> against persistence of protocol data. It's fair to say that if we _do_
> >>> make changes in the future, that there will be a transition path for
> >>> migrate persisted data, should it ever come to that.
> >>
> >> I think that's a good idea, but perhaps the stability promise shouldn't
> >> cover the Flight protocol yet?
> >
> > Agreed.
> >
> >>> I would suggest a "1.0.0" release either as our next release (instead
> >>> of 0.14.0) or the release right after that (if we need more time to
> >>> get affairs in order), with the guidance for users of:
> >>
> >> I think we should first do a regular 0.14.0 with all that's on our plate
> >> right now, then work towards a 1.0.0 as the release following that.
> >
> > What is different from your perspective? If the protocol hasn't changed
> in
> > over a year, why not call it 1.0?
>
> I would say that perhaps some API cleanup is in order.  Remove
> deprecated ones, review experimental APIs, perhaps mark experimental
> certain APIs that we forgot to...
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-5529) [Flight] Allow serving with multiple TLS certificates

2019-06-07 Thread David Li (JIRA)
David Li created ARROW-5529:
---

 Summary: [Flight] Allow serving with multiple TLS certificates
 Key: ARROW-5529
 URL: https://issues.apache.org/jira/browse/ARROW-5529
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC
Reporter: David Li
Assignee: David Li


We should allow serving a Flight service with more than one TLS certificate. 
This makes health checking easier in large deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Antoine Pitrou


Le 07/06/2019 à 19:44, Jacques Nadeau a écrit :
> On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou  wrote:
> 
>> Hi Wes,
>>
>> Le 07/06/2019 à 17:42, Wes McKinney a écrit :
>>>
>>> I think
>>> this would have a lot of benefits for project onlookers to remove
>>> various warnings around the codebase around stability and cautions
>>> against persistence of protocol data. It's fair to say that if we _do_
>>> make changes in the future, that there will be a transition path for
>>> migrate persisted data, should it ever come to that.
>>
>> I think that's a good idea, but perhaps the stability promise shouldn't
>> cover the Flight protocol yet?
> 
> Agreed.
> 
>>> I would suggest a "1.0.0" release either as our next release (instead
>>> of 0.14.0) or the release right after that (if we need more time to
>>> get affairs in order), with the guidance for users of:
>>
>> I think we should first do a regular 0.14.0 with all that's on our plate
>> right now, then work towards a 1.0.0 as the release following that.
> 
> What is different from your perspective? If the protocol hasn't changed in
> over a year, why not call it 1.0?

I would say that perhaps some API cleanup is in order.  Remove
deprecated ones, review experimental APIs, perhaps mark experimental
certain APIs that we forgot to...

Regards

Antoine.


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Jacques Nadeau
On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou  wrote:

>
> Hi Wes,
>
> Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> >
> > I think
> > this would have a lot of benefits for project onlookers to remove
> > various warnings around the codebase around stability and cautions
> > against persistence of protocol data. It's fair to say that if we _do_
> > make changes in the future, that there will be a transition path for
> > migrate persisted data, should it ever come to that.
>
> I think that's a good idea, but perhaps the stability promise shouldn't
> cover the Flight protocol yet?
>

Agreed.


>
> > I would suggest a "1.0.0" release either as our next release (instead
> > of 0.14.0) or the release right after that (if we need more time to
> > get affairs in order), with the guidance for users of:
>
> I think we should first do a regular 0.14.0 with all that's on our plate
> right now, then work towards a 1.0.0 as the release following that.
>

What is different from your perspective? If the protocol hasn't changed in
over a year, why not call it 1.0?


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Antoine Pitrou


Hi Wes,

Le 07/06/2019 à 17:42, Wes McKinney a écrit :
> 
> I think
> this would have a lot of benefits for project onlookers to remove
> various warnings around the codebase around stability and cautions
> against persistence of protocol data. It's fair to say that if we _do_
> make changes in the future, that there will be a transition path for
> migrate persisted data, should it ever come to that.

I think that's a good idea, but perhaps the stability promise shouldn't
cover the Flight protocol yet?

> I would suggest a "1.0.0" release either as our next release (instead
> of 0.14.0) or the release right after that (if we need more time to
> get affairs in order), with the guidance for users of:

I think we should first do a regular 0.14.0 with all that's on our plate
right now, then work towards a 1.0.0 as the release following that.

> PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y
> will be forward and backwards compatible (though new metadata fields
> introduced in newer versions will be dropped in older readers)
> MAJOR VERSION (0): API changes possible (and indeed, likely) from
> major release to major release
> MINOR VERSION (0): No API changes, bug fix only release
> 
> Thoughts on the above?

That sounds reasonable to me.

Regards

Antoine.


Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Jacques Nadeau
You didn't mention it specifically but one big thing I'd like to get into
the next release is prepackaged flight tools for c++, java and python.

As far as 1.0, I think its time and would vote for naming the next release
1.0. Thanks for bringing this up!

On Fri, Jun 7, 2019 at 8:43 AM Wes McKinney  wrote:

> hi folks,
>
> Our last release 0.13.0 occurred at the end of March. I think it would
> be good to plot a course for the next release (0.14.0?) as soon as
> possible. There are still a number of issues (such as the shared
> library duplication issue in the Python wheels) that I think might
> discourage us from releasing right now. Do you think that pushing for
> a release candidate by the end of June is reasonable?
>
> As a second matter (and this can be split off into a separate
> discussion thread), the Arrow format and binary protocol has been
> stable effectively since the 0.8.0 release which occurred in December
> 2017. While we have some details yet to iron out in compatibility
> testing between implementations (for example, the Union question, see
> mailing list discussion [1]) and new features (e.g. 64-bit offset
> binary/string/list types), in theory these should not prevent us
> necessarily from making a declaration of protocol stability. I think
> this would have a lot of benefits for project onlookers to remove
> various warnings around the codebase around stability and cautions
> against persistence of protocol data. It's fair to say that if we _do_
> make changes in the future, that there will be a transition path for
> migrate persisted data, should it ever come to that.
>
> I would suggest a "1.0.0" release either as our next release (instead
> of 0.14.0) or the release right after that (if we need more time to
> get affairs in order), with the guidance for users of:
>
> PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y
> will be forward and backwards compatible (though new metadata fields
> introduced in newer versions will be dropped in older readers)
> MAJOR VERSION (0): API changes possible (and indeed, likely) from
> major release to major release
> MINOR VERSION (0): No API changes, bug fix only release
>
> Thoughts on the above?
>
> Thanks
> Wes
>
> [1]:
> https://lists.apache.org/thread.html/e54e8ec096f665a8aef94155de3b6c567258c0d15209de4b966dd8da@%3Cdev.arrow.apache.org%3E
>


[jira] [Created] (ARROW-5528) Concatenate() crashes when concatenating empty binary arrays.

2019-06-07 Thread Zhuo Peng (JIRA)
Zhuo Peng created ARROW-5528:


 Summary: Concatenate() crashes when concatenating empty binary 
arrays.
 Key: ARROW-5528
 URL: https://issues.apache.org/jira/browse/ARROW-5528
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Zhuo Peng
 Fix For: 0.14.0


[https://github.com/brills/arrow/commit/42063bb5297f34d9b98e831264c47add2da68591]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability

2019-06-07 Thread Wes McKinney
hi folks,

Our last release 0.13.0 occurred at the end of March. I think it would
be good to plot a course for the next release (0.14.0?) as soon as
possible. There are still a number of issues (such as the shared
library duplication issue in the Python wheels) that I think might
discourage us from releasing right now. Do you think that pushing for
a release candidate by the end of June is reasonable?

As a second matter (and this can be split off into a separate
discussion thread), the Arrow format and binary protocol has been
stable effectively since the 0.8.0 release which occurred in December
2017. While we have some details yet to iron out in compatibility
testing between implementations (for example, the Union question, see
mailing list discussion [1]) and new features (e.g. 64-bit offset
binary/string/list types), in theory these should not prevent us
necessarily from making a declaration of protocol stability. I think
this would have a lot of benefits for project onlookers to remove
various warnings around the codebase around stability and cautions
against persistence of protocol data. It's fair to say that if we _do_
make changes in the future, that there will be a transition path for
migrate persisted data, should it ever come to that.

I would suggest a "1.0.0" release either as our next release (instead
of 0.14.0) or the release right after that (if we need more time to
get affairs in order), with the guidance for users of:

PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y
will be forward and backwards compatible (though new metadata fields
introduced in newer versions will be dropped in older readers)
MAJOR VERSION (0): API changes possible (and indeed, likely) from
major release to major release
MINOR VERSION (0): No API changes, bug fix only release

Thoughts on the above?

Thanks
Wes

[1]: 
https://lists.apache.org/thread.html/e54e8ec096f665a8aef94155de3b6c567258c0d15209de4b966dd8da@%3Cdev.arrow.apache.org%3E


Re: Apache Arrow Tutorial at VLDB

2019-06-07 Thread Wes McKinney
hi Emanuel,

I'd be interested in participating in the workshop (and attending VLDB
generally). How long would the tutorial / presentation be expected to
be? There might others from the community who could help out with
preparing materials.

Thanks,
Wes

On Thu, Jun 6, 2019 at 12:27 PM Emanuel Zgraggen  wrote:
>
> Hi,
>
> Not sure if this is the appropriate place to ask this, but I though I'd
> give it a shot:
>
> I'm a postdoc in MIT's database group and I'm co-organizing a workshop on
> Big Data Open Source Systems (http://boss.dima.tu-berlin.de/) at this years
> VLDB in Los Angeles. VLDB is one of the leading academic conferences on
> data management and database research (http://vldb.org/2019/).
>
> We are still looking for a few folks to give tutorials at the workshop and
> I was wondering if someone from the Arrow dev community would be interested
> in doing such a tutorial on Arrow?
>
> Apache Arrow has gained quite some interest in the academic community
> (including us here at MIT) and I think it would be great for more people to
> learn about it.
>
> Best,
> Emanuel


[jira] [Created] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data

2019-06-07 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5527:
-

 Summary: [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) 
for heap data
 Key: ARROW-5527
 URL: https://issues.apache.org/jira/browse/ARROW-5527
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


The current implementation uses `std::vector` and `std::string` with unbounded 
size. The refactor would take a memory pool in the constructor for buffer 
management and would get rid of vectors.

This will have the side effect of propagating Status to some calls (notably 
insert due to Upsize failing to resize).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5526) [Developer] Add more prominent notice to GitHub issue template to direct bug reports to JIRA

2019-06-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5526:
---

 Summary: [Developer] Add more prominent notice to GitHub issue 
template to direct bug reports to JIRA
 Key: ARROW-5526
 URL: https://issues.apache.org/jira/browse/ARROW-5526
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 0.14.0


Many people are deleting the issue template without reading it. I will change 
the template to feature a more prominent notice about filing bug reports in JIRA

Recent examples

* https://github.com/apache/arrow/issues/4489
* https://github.com/apache/arrow/issues/4495



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5525) Enable continuous fuzzing

2019-06-07 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5525:


 Summary: Enable continuous fuzzing
 Key: ARROW-5525
 URL: https://issues.apache.org/jira/browse/ARROW-5525
 Project: Apache Arrow
  Issue Type: Test
  Components: C++
Reporter: Marco Neumann


Since fuzzing kinda only works if done as a continuous background job, we 
should find a way of doing so. This likely requires another service than 
Travis. Basic requirements are:
 * master builds should be submitted for fuzzing
 * project members should be informed about new crashes (ideally not via public 
issue due to potential security impact)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)