Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
A few thoughts: - I think we should iron out the remaining incompatibilities between java and C++ before going to 1.0.0 (at least Union and NullType), and I'm not sure I will have time to them before the next release, so I would prefer to try to aim for the subsequent release to make it 1.0.0 - For 1.0.0 should we change the metadata format version to a new naming scheme [1] (seems like more of a hassle then it is worth)? - I'm a little concerned about the implications for forward-compatibility restrictions for format changes. For instance the large list types would not be forward compatible (at least by some definitions), similarly if we deal with compression [2] it would also seem to not be forward compatible. Would this mean we bump the format version number for each change even though they would be backwards compatible? Thanks, Micah [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 [2] https://issues.apache.org/jira/browse/ARROW-300 On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney wrote: > I agree re: marketing value of a 1.0.0 release. > > For the record, I think we should continue to allow the API of each > respective library component to evolve freely and allow the > individuals developing each to decide how to handle deprecations, API > changes, etc., as we have up until this point. The project is still > very much in "innovation mode" across the board, but some parts may > grow more conservative than others. Having roughly time-based releases > encourages everyone to be ready-to-release at any given time, and we > develop a steady cadence of getting new functionality and > improvements/fixes out the door. > > On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou wrote: > > > > > > I think there's a marketing merit to issuing a 1.0.0 release. > > > > Regards > > > > Antoine. > > > > > > Le 07/06/2019 à 20:05, Wes McKinney a écrit : > > > So one idea is that we could call the next release 1.14.0. So the > > > second number is the API version number. This encodes a sequencing of > > > the evolution of the API. The library APIs are already decoupled from > > > the binary serialization protocol, so I think we merely have to state > > > that API changes and protocol changes are not related to each other. > > > > > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau > wrote: > > >> > > >> It brings up an interesting point... do we couple the stability of > the apis > > >> with the stability of the protocol. If the protocol is stable, we > should > > >> start providing guarantees for it. How do we want to express these > > >> different velocities? > > >> > > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou > wrote: > > >> > > >>> > > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou > > >>> wrote: > > > > > Hi Wes, > > > > > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : > > >> > > >> I think > > >> this would have a lot of benefits for project onlookers to remove > > >> various warnings around the codebase around stability and cautions > > >> against persistence of protocol data. It's fair to say that if we > _do_ > > >> make changes in the future, that there will be a transition path > for > > >> migrate persisted data, should it ever come to that. > > > > > > I think that's a good idea, but perhaps the stability promise > shouldn't > > > cover the Flight protocol yet? > > > > Agreed. > > > > >> I would suggest a "1.0.0" release either as our next release > (instead > > >> of 0.14.0) or the release right after that (if we need more time > to > > >> get affairs in order), with the guidance for users of: > > > > > > I think we should first do a regular 0.14.0 with all that's on our > plate > > > right now, then work towards a 1.0.0 as the release following that. > > > > What is different from your perspective? If the protocol hasn't > changed > > >>> in > > over a year, why not call it 1.0? > > >>> > > >>> I would say that perhaps some API cleanup is in order. Remove > > >>> deprecated ones, review experimental APIs, perhaps mark experimental > > >>> certain APIs that we forgot to... > > >>> > > >>> Regards > > >>> > > >>> Antoine. > > >>> >
[jira] [Created] (ARROW-5532) Field Metadata Not Read
Trey Hakanson created ARROW-5532: Summary: Field Metadata Not Read Key: ARROW-5532 URL: https://issues.apache.org/jira/browse/ARROW-5532 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.13.0 Environment: Mac OSX 10.14, Chrome 74 Reporter: Trey Hakanson Field metadata is not read when using `@apache-arrow/ts@0.13.0`. Example below also uses `pyarrow==0.13.0` Steps to reproduce: Adding metadata: ```py import pyarrow as pa import pandas as pd source = "sample.csv" output = "sample.arrow" df = pd.read_csv(source) table = pa.Table.from_pandas(df) schema = pa.schema([ column.field.add_metadata(\{"foo": "bar"})) for column in table.columns ]) writer = pa.RecordBatchFileWriter(output, schema) writer.write(table) writer.close() ``` Reading field metadata using `pyarrow`: ```py source = "sample.arrow" field = "foo" reader = pa.RecordBatchFileReader(source) reader.schema.field_by_name(field).metadata # Correctly shows `\{"foo": "bar"}` ``` Reading field metadata using `@apache-arrow/ts`: ```ts import \{ Table, Field, Type } from "@apache-arrow/ts"; const url = "https://example.com/sample.arrow;; const buf = await fetch(url).then(res => res.arrayBuffer()); const table = Table.from([new Uint8Array(buf)]); for (let field of table.schema.fields) { field.metadata; // Incorrectly shows an empty map } ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol
Great, thanks Jacques. I'll kick off a vote thread so we can hopefully get this approved On Fri, Jun 7, 2019 at 3:27 PM Jacques Nadeau wrote: > > I'm good with this. The consistent separator is a good improvement. > > On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney wrote: > > > hey Jacques, > > > > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau wrote: > > > > > > Thanks for pushing this along. I think it is important. Sorry I'm coming > > > late to the conversation. Couple thoughts: > > > > > > - Should we reconsider having this be an independent optional field as > > > opposed to overloading customer_metadata? It avoids having the weird > > string > > > prefixing behavior > > > > This is one option that we've discussed. The downside of this is that > > it becomes another piece of metadata that Arrow implementations need > > to mind when they are passing through IPC messages. The idea is that > > "dumb" readers can simply ignore the metadata but pass it along in a > > subsequent message. For example, suppose a simplistic data > > service/microservice that evaluates a filter against record batches > > coming through. There might be columns with extension types that come > > through that the service does not recognize. > > > > In some implementations the custom_metadata member is preserved in > > schemas and survives IPC round trips, but this is a feature that IMHO > > should be implemented consistently in all Arrow implementations. For > > example, I believe that Java drops the custom_metadata as soon as the > > IPC protocol is parsed. > > > > Admittedly, this is not a huge issue, so if you had an extra member of > > Field like > > > > table ExtensionType { > > name: string > > metadata: string > > } > > > > ... > > > > table Field { > > ... > > custom_type : ExtensionType > > } > > > > then that would work, too. It's more obtrusive to implementations as > > readers that do not recognize a type should still mind this metadata > > and pass it along in subsequent messages. If we embed in > > custom_metadata then this happens automatically (assuming that > > custom_metadata is preserved...) > > > > > - I'd be inclined to be much more stringent about type naming. Maybe even > > > make the name multiple parts to force the issue? > > > > I just updated my PR https://github.com/apache/arrow/pull/4332 to say > > also that colon ":" is the designated namespace separator and I've > > made the metadata keys > > > > ARROW:extension:name > > ARROW:extension:metadata > > > > As far as the actual type name, since it's application-defined, it > > might be better to leave this up to the developer-user. If we defined > > any "built-in extension types" (things like UUID come to mind) we > > might want to have a pseudo-namespace like "builtin.uuid", > > "builtin.ipv6", etc. for these > > > > Let me know what you think -- it would be great to start a vote on this > > soon. > > > > Thanks > > Wes > > > > > > > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney > > wrote: > > > > > > > hi Micah, > > > > > > > > I have just updated my PR per your comments with more examples of > > > > extension types. > > > > > > > > https://github.com/apache/arrow/pull/4332 > > > > > > > > Are there more comments about this? I can start a vote in a couple of > > > > days absent further opinions. > > > > > > > > Can someone volunteer to review David's Java PR? I would like to move > > > > this along so we have a chance of having working extension types in > > > > the 0.14 release. A number of people are also interested in bridging > > > > between pandas's ExtensionArray facility (for custom DataFrame column > > > > types [1]) and Arrow's ExtensionType > > > > > > > > Thanks > > > > Wes > > > > > > > > [1]: > > > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > > > > > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield > > > > > > wrote: > > > > > > > > > > Hi Wes, > > > > > Like I said I think this approach looks good, I think what I'm > > looking > > > > for is a little more documentation/examples on how additional types > > would > > > > be handled. I think Tensor would be a good example, we also had > > questions > > > > about INET addresses previously, maybe this would be a another good > > > > illustrative example. Providing examples of serialized metadata in the > > > > docs would be useful (clarifying that these are opaque binary blobs, > > that > > > > will be passed along to extension type factories?) > > > > > > > > > > In this regard, I think it might be good to provide a further > > > > recommendations for the name of extension types: What do you think > > about > > > > recommend organization/projects namespace them to according to some > > > > convention, so that there aren't conflicts and extensions can be > > shared? > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > > > > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney > > > > wrote: > > > > >> > > > > >> > > > >
Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol
I'm good with this. The consistent separator is a good improvement. On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney wrote: > hey Jacques, > > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau wrote: > > > > Thanks for pushing this along. I think it is important. Sorry I'm coming > > late to the conversation. Couple thoughts: > > > > - Should we reconsider having this be an independent optional field as > > opposed to overloading customer_metadata? It avoids having the weird > string > > prefixing behavior > > This is one option that we've discussed. The downside of this is that > it becomes another piece of metadata that Arrow implementations need > to mind when they are passing through IPC messages. The idea is that > "dumb" readers can simply ignore the metadata but pass it along in a > subsequent message. For example, suppose a simplistic data > service/microservice that evaluates a filter against record batches > coming through. There might be columns with extension types that come > through that the service does not recognize. > > In some implementations the custom_metadata member is preserved in > schemas and survives IPC round trips, but this is a feature that IMHO > should be implemented consistently in all Arrow implementations. For > example, I believe that Java drops the custom_metadata as soon as the > IPC protocol is parsed. > > Admittedly, this is not a huge issue, so if you had an extra member of > Field like > > table ExtensionType { > name: string > metadata: string > } > > ... > > table Field { > ... > custom_type : ExtensionType > } > > then that would work, too. It's more obtrusive to implementations as > readers that do not recognize a type should still mind this metadata > and pass it along in subsequent messages. If we embed in > custom_metadata then this happens automatically (assuming that > custom_metadata is preserved...) > > > - I'd be inclined to be much more stringent about type naming. Maybe even > > make the name multiple parts to force the issue? > > I just updated my PR https://github.com/apache/arrow/pull/4332 to say > also that colon ":" is the designated namespace separator and I've > made the metadata keys > > ARROW:extension:name > ARROW:extension:metadata > > As far as the actual type name, since it's application-defined, it > might be better to leave this up to the developer-user. If we defined > any "built-in extension types" (things like UUID come to mind) we > might want to have a pseudo-namespace like "builtin.uuid", > "builtin.ipv6", etc. for these > > Let me know what you think -- it would be great to start a vote on this > soon. > > Thanks > Wes > > > > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney > wrote: > > > > > hi Micah, > > > > > > I have just updated my PR per your comments with more examples of > > > extension types. > > > > > > https://github.com/apache/arrow/pull/4332 > > > > > > Are there more comments about this? I can start a vote in a couple of > > > days absent further opinions. > > > > > > Can someone volunteer to review David's Java PR? I would like to move > > > this along so we have a chance of having working extension types in > > > the 0.14 release. A number of people are also interested in bridging > > > between pandas's ExtensionArray facility (for custom DataFrame column > > > types [1]) and Arrow's ExtensionType > > > > > > Thanks > > > Wes > > > > > > [1]: > > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > > > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield > > > > wrote: > > > > > > > > Hi Wes, > > > > Like I said I think this approach looks good, I think what I'm > looking > > > for is a little more documentation/examples on how additional types > would > > > be handled. I think Tensor would be a good example, we also had > questions > > > about INET addresses previously, maybe this would be a another good > > > illustrative example. Providing examples of serialized metadata in the > > > docs would be useful (clarifying that these are opaque binary blobs, > that > > > will be passed along to extension type factories?) > > > > > > > > In this regard, I think it might be good to provide a further > > > recommendations for the name of extension types: What do you think > about > > > recommend organization/projects namespace them to according to some > > > convention, so that there aren't conflicts and extensions can be > shared? > > > > > > > > Thanks, > > > > Micah > > > > > > > > > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney > > > wrote: > > > >> > > > >> > > > >> > > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney > wrote: > > > >>> > > > >>> Hi Micah, > > > >>> > > > >>> The use cases I'm aware of are mostly coming from proprietary > > > applications. My idea was for the extension metadata to be as > unobtrusive > > > as possible. The only alternative as I see it would be to have an > Extension > > > value in the Type union which would be more intrusive to
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
I agree re: marketing value of a 1.0.0 release. For the record, I think we should continue to allow the API of each respective library component to evolve freely and allow the individuals developing each to decide how to handle deprecations, API changes, etc., as we have up until this point. The project is still very much in "innovation mode" across the board, but some parts may grow more conservative than others. Having roughly time-based releases encourages everyone to be ready-to-release at any given time, and we develop a steady cadence of getting new functionality and improvements/fixes out the door. On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou wrote: > > > I think there's a marketing merit to issuing a 1.0.0 release. > > Regards > > Antoine. > > > Le 07/06/2019 à 20:05, Wes McKinney a écrit : > > So one idea is that we could call the next release 1.14.0. So the > > second number is the API version number. This encodes a sequencing of > > the evolution of the API. The library APIs are already decoupled from > > the binary serialization protocol, so I think we merely have to state > > that API changes and protocol changes are not related to each other. > > > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau wrote: > >> > >> It brings up an interesting point... do we couple the stability of the apis > >> with the stability of the protocol. If the protocol is stable, we should > >> start providing guarantees for it. How do we want to express these > >> different velocities? > >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou wrote: > >> > >>> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou > >>> wrote: > > > Hi Wes, > > > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : > >> > >> I think > >> this would have a lot of benefits for project onlookers to remove > >> various warnings around the codebase around stability and cautions > >> against persistence of protocol data. It's fair to say that if we _do_ > >> make changes in the future, that there will be a transition path for > >> migrate persisted data, should it ever come to that. > > > > I think that's a good idea, but perhaps the stability promise shouldn't > > cover the Flight protocol yet? > > Agreed. > > >> I would suggest a "1.0.0" release either as our next release (instead > >> of 0.14.0) or the release right after that (if we need more time to > >> get affairs in order), with the guidance for users of: > > > > I think we should first do a regular 0.14.0 with all that's on our plate > > right now, then work towards a 1.0.0 as the release following that. > > What is different from your perspective? If the protocol hasn't changed > >>> in > over a year, why not call it 1.0? > >>> > >>> I would say that perhaps some API cleanup is in order. Remove > >>> deprecated ones, review experimental APIs, perhaps mark experimental > >>> certain APIs that we forgot to... > >>> > >>> Regards > >>> > >>> Antoine. > >>>
[jira] [Created] (ARROW-5531) [Python] Support binary/utf8 in Array.from_buffers
Wes McKinney created ARROW-5531: --- Summary: [Python] Support binary/utf8 in Array.from_buffers Key: ARROW-5531 URL: https://issues.apache.org/jira/browse/ARROW-5531 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.14.0 See discussion in ARROW-2607 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
I think there's a marketing merit to issuing a 1.0.0 release. Regards Antoine. Le 07/06/2019 à 20:05, Wes McKinney a écrit : > So one idea is that we could call the next release 1.14.0. So the > second number is the API version number. This encodes a sequencing of > the evolution of the API. The library APIs are already decoupled from > the binary serialization protocol, so I think we merely have to state > that API changes and protocol changes are not related to each other. > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau wrote: >> >> It brings up an interesting point... do we couple the stability of the apis >> with the stability of the protocol. If the protocol is stable, we should >> start providing guarantees for it. How do we want to express these >> different velocities? >> >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou wrote: >> >>> >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou >>> wrote: > Hi Wes, > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : >> >> I think >> this would have a lot of benefits for project onlookers to remove >> various warnings around the codebase around stability and cautions >> against persistence of protocol data. It's fair to say that if we _do_ >> make changes in the future, that there will be a transition path for >> migrate persisted data, should it ever come to that. > > I think that's a good idea, but perhaps the stability promise shouldn't > cover the Flight protocol yet? Agreed. >> I would suggest a "1.0.0" release either as our next release (instead >> of 0.14.0) or the release right after that (if we need more time to >> get affairs in order), with the guidance for users of: > > I think we should first do a regular 0.14.0 with all that's on our plate > right now, then work towards a 1.0.0 as the release following that. What is different from your perspective? If the protocol hasn't changed >>> in over a year, why not call it 1.0? >>> >>> I would say that perhaps some API cleanup is in order. Remove >>> deprecated ones, review experimental APIs, perhaps mark experimental >>> certain APIs that we forgot to... >>> >>> Regards >>> >>> Antoine. >>>
[jira] [Created] (ARROW-5530) [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior
Francois Saint-Jacques created ARROW-5530: - Summary: [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior Key: ARROW-5530 URL: https://issues.apache.org/jira/browse/ARROW-5530 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Francois Saint-Jacques -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
So one idea is that we could call the next release 1.14.0. So the second number is the API version number. This encodes a sequencing of the evolution of the API. The library APIs are already decoupled from the binary serialization protocol, so I think we merely have to state that API changes and protocol changes are not related to each other. On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau wrote: > > It brings up an interesting point... do we couple the stability of the apis > with the stability of the protocol. If the protocol is stable, we should > start providing guarantees for it. How do we want to express these > different velocities? > > On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou wrote: > > > > > Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > > > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou > > wrote: > > > > > >> Hi Wes, > > >> > > >> Le 07/06/2019 à 17:42, Wes McKinney a écrit : > > >>> > > >>> I think > > >>> this would have a lot of benefits for project onlookers to remove > > >>> various warnings around the codebase around stability and cautions > > >>> against persistence of protocol data. It's fair to say that if we _do_ > > >>> make changes in the future, that there will be a transition path for > > >>> migrate persisted data, should it ever come to that. > > >> > > >> I think that's a good idea, but perhaps the stability promise shouldn't > > >> cover the Flight protocol yet? > > > > > > Agreed. > > > > > >>> I would suggest a "1.0.0" release either as our next release (instead > > >>> of 0.14.0) or the release right after that (if we need more time to > > >>> get affairs in order), with the guidance for users of: > > >> > > >> I think we should first do a regular 0.14.0 with all that's on our plate > > >> right now, then work towards a 1.0.0 as the release following that. > > > > > > What is different from your perspective? If the protocol hasn't changed > > in > > > over a year, why not call it 1.0? > > > > I would say that perhaps some API cleanup is in order. Remove > > deprecated ones, review experimental APIs, perhaps mark experimental > > certain APIs that we forgot to... > > > > Regards > > > > Antoine. > >
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
It brings up an interesting point... do we couple the stability of the apis with the stability of the protocol. If the protocol is stable, we should start providing guarantees for it. How do we want to express these different velocities? On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou wrote: > > Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou > wrote: > > > >> Hi Wes, > >> > >> Le 07/06/2019 à 17:42, Wes McKinney a écrit : > >>> > >>> I think > >>> this would have a lot of benefits for project onlookers to remove > >>> various warnings around the codebase around stability and cautions > >>> against persistence of protocol data. It's fair to say that if we _do_ > >>> make changes in the future, that there will be a transition path for > >>> migrate persisted data, should it ever come to that. > >> > >> I think that's a good idea, but perhaps the stability promise shouldn't > >> cover the Flight protocol yet? > > > > Agreed. > > > >>> I would suggest a "1.0.0" release either as our next release (instead > >>> of 0.14.0) or the release right after that (if we need more time to > >>> get affairs in order), with the guidance for users of: > >> > >> I think we should first do a regular 0.14.0 with all that's on our plate > >> right now, then work towards a 1.0.0 as the release following that. > > > > What is different from your perspective? If the protocol hasn't changed > in > > over a year, why not call it 1.0? > > I would say that perhaps some API cleanup is in order. Remove > deprecated ones, review experimental APIs, perhaps mark experimental > certain APIs that we forgot to... > > Regards > > Antoine. >
[jira] [Created] (ARROW-5529) [Flight] Allow serving with multiple TLS certificates
David Li created ARROW-5529: --- Summary: [Flight] Allow serving with multiple TLS certificates Key: ARROW-5529 URL: https://issues.apache.org/jira/browse/ARROW-5529 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC Reporter: David Li Assignee: David Li We should allow serving a Flight service with more than one TLS certificate. This makes health checking easier in large deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou wrote: > >> Hi Wes, >> >> Le 07/06/2019 à 17:42, Wes McKinney a écrit : >>> >>> I think >>> this would have a lot of benefits for project onlookers to remove >>> various warnings around the codebase around stability and cautions >>> against persistence of protocol data. It's fair to say that if we _do_ >>> make changes in the future, that there will be a transition path for >>> migrate persisted data, should it ever come to that. >> >> I think that's a good idea, but perhaps the stability promise shouldn't >> cover the Flight protocol yet? > > Agreed. > >>> I would suggest a "1.0.0" release either as our next release (instead >>> of 0.14.0) or the release right after that (if we need more time to >>> get affairs in order), with the guidance for users of: >> >> I think we should first do a regular 0.14.0 with all that's on our plate >> right now, then work towards a 1.0.0 as the release following that. > > What is different from your perspective? If the protocol hasn't changed in > over a year, why not call it 1.0? I would say that perhaps some API cleanup is in order. Remove deprecated ones, review experimental APIs, perhaps mark experimental certain APIs that we forgot to... Regards Antoine.
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou wrote: > > Hi Wes, > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : > > > > I think > > this would have a lot of benefits for project onlookers to remove > > various warnings around the codebase around stability and cautions > > against persistence of protocol data. It's fair to say that if we _do_ > > make changes in the future, that there will be a transition path for > > migrate persisted data, should it ever come to that. > > I think that's a good idea, but perhaps the stability promise shouldn't > cover the Flight protocol yet? > Agreed. > > > I would suggest a "1.0.0" release either as our next release (instead > > of 0.14.0) or the release right after that (if we need more time to > > get affairs in order), with the guidance for users of: > > I think we should first do a regular 0.14.0 with all that's on our plate > right now, then work towards a 1.0.0 as the release following that. > What is different from your perspective? If the protocol hasn't changed in over a year, why not call it 1.0?
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
Hi Wes, Le 07/06/2019 à 17:42, Wes McKinney a écrit : > > I think > this would have a lot of benefits for project onlookers to remove > various warnings around the codebase around stability and cautions > against persistence of protocol data. It's fair to say that if we _do_ > make changes in the future, that there will be a transition path for > migrate persisted data, should it ever come to that. I think that's a good idea, but perhaps the stability promise shouldn't cover the Flight protocol yet? > I would suggest a "1.0.0" release either as our next release (instead > of 0.14.0) or the release right after that (if we need more time to > get affairs in order), with the guidance for users of: I think we should first do a regular 0.14.0 with all that's on our plate right now, then work towards a 1.0.0 as the release following that. > PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y > will be forward and backwards compatible (though new metadata fields > introduced in newer versions will be dropped in older readers) > MAJOR VERSION (0): API changes possible (and indeed, likely) from > major release to major release > MINOR VERSION (0): No API changes, bug fix only release > > Thoughts on the above? That sounds reasonable to me. Regards Antoine.
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
You didn't mention it specifically but one big thing I'd like to get into the next release is prepackaged flight tools for c++, java and python. As far as 1.0, I think its time and would vote for naming the next release 1.0. Thanks for bringing this up! On Fri, Jun 7, 2019 at 8:43 AM Wes McKinney wrote: > hi folks, > > Our last release 0.13.0 occurred at the end of March. I think it would > be good to plot a course for the next release (0.14.0?) as soon as > possible. There are still a number of issues (such as the shared > library duplication issue in the Python wheels) that I think might > discourage us from releasing right now. Do you think that pushing for > a release candidate by the end of June is reasonable? > > As a second matter (and this can be split off into a separate > discussion thread), the Arrow format and binary protocol has been > stable effectively since the 0.8.0 release which occurred in December > 2017. While we have some details yet to iron out in compatibility > testing between implementations (for example, the Union question, see > mailing list discussion [1]) and new features (e.g. 64-bit offset > binary/string/list types), in theory these should not prevent us > necessarily from making a declaration of protocol stability. I think > this would have a lot of benefits for project onlookers to remove > various warnings around the codebase around stability and cautions > against persistence of protocol data. It's fair to say that if we _do_ > make changes in the future, that there will be a transition path for > migrate persisted data, should it ever come to that. > > I would suggest a "1.0.0" release either as our next release (instead > of 0.14.0) or the release right after that (if we need more time to > get affairs in order), with the guidance for users of: > > PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y > will be forward and backwards compatible (though new metadata fields > introduced in newer versions will be dropped in older readers) > MAJOR VERSION (0): API changes possible (and indeed, likely) from > major release to major release > MINOR VERSION (0): No API changes, bug fix only release > > Thoughts on the above? > > Thanks > Wes > > [1]: > https://lists.apache.org/thread.html/e54e8ec096f665a8aef94155de3b6c567258c0d15209de4b966dd8da@%3Cdev.arrow.apache.org%3E >
[jira] [Created] (ARROW-5528) Concatenate() crashes when concatenating empty binary arrays.
Zhuo Peng created ARROW-5528: Summary: Concatenate() crashes when concatenating empty binary arrays. Key: ARROW-5528 URL: https://issues.apache.org/jira/browse/ARROW-5528 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Zhuo Peng Fix For: 0.14.0 [https://github.com/brills/arrow/commit/42063bb5297f34d9b98e831264c47add2da68591] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
hi folks, Our last release 0.13.0 occurred at the end of March. I think it would be good to plot a course for the next release (0.14.0?) as soon as possible. There are still a number of issues (such as the shared library duplication issue in the Python wheels) that I think might discourage us from releasing right now. Do you think that pushing for a release candidate by the end of June is reasonable? As a second matter (and this can be split off into a separate discussion thread), the Arrow format and binary protocol has been stable effectively since the 0.8.0 release which occurred in December 2017. While we have some details yet to iron out in compatibility testing between implementations (for example, the Union question, see mailing list discussion [1]) and new features (e.g. 64-bit offset binary/string/list types), in theory these should not prevent us necessarily from making a declaration of protocol stability. I think this would have a lot of benefits for project onlookers to remove various warnings around the codebase around stability and cautions against persistence of protocol data. It's fair to say that if we _do_ make changes in the future, that there will be a transition path for migrate persisted data, should it ever come to that. I would suggest a "1.0.0" release either as our next release (instead of 0.14.0) or the release right after that (if we need more time to get affairs in order), with the guidance for users of: PROTOCOL VERSION (1): Protocol version, so libraries bearing 1.x.y will be forward and backwards compatible (though new metadata fields introduced in newer versions will be dropped in older readers) MAJOR VERSION (0): API changes possible (and indeed, likely) from major release to major release MINOR VERSION (0): No API changes, bug fix only release Thoughts on the above? Thanks Wes [1]: https://lists.apache.org/thread.html/e54e8ec096f665a8aef94155de3b6c567258c0d15209de4b966dd8da@%3Cdev.arrow.apache.org%3E
Re: Apache Arrow Tutorial at VLDB
hi Emanuel, I'd be interested in participating in the workshop (and attending VLDB generally). How long would the tutorial / presentation be expected to be? There might others from the community who could help out with preparing materials. Thanks, Wes On Thu, Jun 6, 2019 at 12:27 PM Emanuel Zgraggen wrote: > > Hi, > > Not sure if this is the appropriate place to ask this, but I though I'd > give it a shot: > > I'm a postdoc in MIT's database group and I'm co-organizing a workshop on > Big Data Open Source Systems (http://boss.dima.tu-berlin.de/) at this years > VLDB in Los Angeles. VLDB is one of the leading academic conferences on > data management and database research (http://vldb.org/2019/). > > We are still looking for a few folks to give tutorials at the workshop and > I was wondering if someone from the Arrow dev community would be interested > in doing such a tutorial on Arrow? > > Apache Arrow has gained quite some interest in the academic community > (including us here at MIT) and I think it would be great for more people to > learn about it. > > Best, > Emanuel
[jira] [Created] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data
Francois Saint-Jacques created ARROW-5527: - Summary: [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data Key: ARROW-5527 URL: https://issues.apache.org/jira/browse/ARROW-5527 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques The current implementation uses `std::vector` and `std::string` with unbounded size. The refactor would take a memory pool in the constructor for buffer management and would get rid of vectors. This will have the side effect of propagating Status to some calls (notably insert due to Upsize failing to resize). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5526) [Developer] Add more prominent notice to GitHub issue template to direct bug reports to JIRA
Wes McKinney created ARROW-5526: --- Summary: [Developer] Add more prominent notice to GitHub issue template to direct bug reports to JIRA Key: ARROW-5526 URL: https://issues.apache.org/jira/browse/ARROW-5526 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Wes McKinney Fix For: 0.14.0 Many people are deleting the issue template without reading it. I will change the template to feature a more prominent notice about filing bug reports in JIRA Recent examples * https://github.com/apache/arrow/issues/4489 * https://github.com/apache/arrow/issues/4495 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5525) Enable continuous fuzzing
Marco Neumann created ARROW-5525: Summary: Enable continuous fuzzing Key: ARROW-5525 URL: https://issues.apache.org/jira/browse/ARROW-5525 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Marco Neumann Since fuzzing kinda only works if done as a continuous background job, we should find a way of doing so. This likely requires another service than Travis. Basic requirements are: * master builds should be submitted for fuzzing * project members should be informed about new crashes (ideally not via public issue due to potential security impact) -- This message was sent by Atlassian JIRA (v7.6.3#76005)