[jira] [Created] (ARROW-5550) [C++] Refactor Buffers method on concatenate to consolidate code.
Micah Kornfield created ARROW-5550: -- Summary: [C++] Refactor Buffers method on concatenate to consolidate code. Key: ARROW-5550 URL: https://issues.apache.org/jira/browse/ARROW-5550 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Micah Kornfield See https://github.com/apache/arrow/pull/4498/files for reference. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5549) [C++][Docs] Summarize function argument type guidelines in developers/cpp.rst
Wes McKinney created ARROW-5549: --- Summary: [C++][Docs] Summarize function argument type guidelines in developers/cpp.rst Key: ARROW-5549 URL: https://issues.apache.org/jira/browse/ARROW-5549 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.14.0 We have a number of spoken and unspoken guidelines around argument passing -- some of them are codified in the Google style guide while others (e.g. use of smart pointers as function arguments) are applied via convention and enforced in code reviews. I propose to add a section to make each case explicit so that our code can become more hygienic and our code reviews less intense -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5548) [Documentation] http://arrow.apache.org/docs/latest/ is not latest
Neal Richardson created ARROW-5548: -- Summary: [Documentation] http://arrow.apache.org/docs/latest/ is not latest Key: ARROW-5548 URL: https://issues.apache.org/jira/browse/ARROW-5548 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Website Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.14.0 In testing out the Dockerfile for building the docs, I noticed it created an asf-site/docs/latest directory at the end. Out of curiosity, I went to [http://arrow.apache.org/docs/latest/], and it reports a version of {{0.11.1.dev473+g6ed02454}}, which is not close to "latest". I'd like to see this "latest" site get updated automatically. I'm working on getting this Docker setup complete (cf. https://issues.apache.org/jira/browse/ARROW-5497), and once that's working, it should be feasible to add a Travis-CI job to update /docs/latest on every commit to master to apache/arrow. cc [~wesmckinn] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] 32- and 64-bit decimal types
On Mon, Jun 10, 2019 at 4:18 PM Wes McKinney wrote: > > On the 1.0.0 protocol discussion, one item that we've skirted for some > time is other decimal sizes: > > https://issues.apache.org/jira/browse/ARROW-2009 > > I understand this is a loaded subject since a deliberate decision was > made to remove types from the initial Java implementation of Arrow > that was forked from Apache Drill. However, it's a friction point that > has come up in a number of scenarios as many database and storage > systems have 32- and 64-bit variants for low precision decimal data. > As an example Apache Kudu [1] has all three types, and the Parquet > columnar format allows not only 32/64 bit storage but fixed size > binary (size a function of precision) and variable-length binary > encoding [2]. > > One of the arguments against using these types in a computational > setting is that many mathematical operations will necessarily trigger > an up-promotion to a larger type. It's hard for us to predict how > people will use the Arrow format, though, and the current situation is > forcing an up-promotion regardless of how the format is being used, > even for simple data transport > > In anticipation of long-term needs, I would suggest a possible solution of: > > * Adding bitWidth field to Decimal table in Schema.fbs [3] with > default value of 128 > * Constraining bit widths to 32, 64, and 128 bits for the time being > * Permit storage of smaller precision decimals in larger storage like > we have now BTW, even if we do not allow 32/64 bit decimals in the format, we should consider adding a bitWidth field with static value 128 as a matter of future-proofing the metadata. This change would make it so that old readers are unable to see the bitWidth field, so the addition would not be possible without bumping the protocol version. > > If this isn't deemed desirable by the community, decimal extension > types could be employed for serialization-free transport for smaller > decimals, but I view this as suboptimal. > > Interested in the thoughts of others. > > thanks > Wes > > [1]: > https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55 > [2]: > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal > [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
Sounds good. On Mon, Jun 10, 2019 at 11:06 AM Wes McKinney wrote: > Hi all, > > OK, it sounds like there is reasonable consensus behind the plan: > > * Make a 0.14.0 release in the near future (later this month?) > * Publicize that the next release will be 1.0.0, in a "speak now or > hold your peace" fashion > * Release 1.0.0 as following release. I would suggest not waiting too > long, so late August / early September time frame > > I'm going to continue grooming the 0.14.0 backlog to help refine the > scope of what still needs to be done for C++/Python to get the next > release out. If the stakeholders in various project subcomponents > could also groom the backlog and mark any blockers, that would be very > helpful. > > I suggest shooting for a release candidate for 0.14.0 either the week > of June 24 or July 1 (depending on where things stand) > > Thanks > Wes > > On Mon, Jun 10, 2019 at 2:39 AM Sutou Kouhei wrote: > > > > Hi, > > > > I think that 0.14.0 is better for the next version. > > > > People who don't try Apache Arrow yet to wait 1.0.0 will use > > Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them, > > we will get more users and contributors by 1.0.0. They may > > not care protocol stability. They may just care "1.0.0". > > > > We'll be able to release less problem 1.0.0 by releasing > > 0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than > > 1.0.0-RCX. 0.14.0 users will find critical problems. > > > > > > Thanks, > > -- > > kou > > > > In > > "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking > Arrow protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700, > > Micah Kornfield wrote: > > > > > A few thoughts: > > > - I think we should iron out the remaining incompatibilities between > java > > > and C++ before going to 1.0.0 (at least Union and NullType), and I'm > not > > > sure I will have time to them before the next release, so I would > prefer to > > > try to aim for the subsequent release to make it 1.0.0 > > > - For 1.0.0 should we change the metadata format version to a new > naming > > > scheme [1] (seems like more of a hassle then it is worth)? > > > - I'm a little concerned about the implications for > forward-compatibility > > > restrictions for format changes. For instance the large list types > would > > > not be forward compatible (at least by some definitions), similarly if > we > > > deal with compression [2] it would also seem to not be forward > compatible. > > > Would this mean we bump the format version number for each change even > > > though they would be backwards compatible? > > > > > > Thanks, > > > Micah > > > > > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 > > > [2] https://issues.apache.org/jira/browse/ARROW-300 > > > > > > On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney > wrote: > > > > > >> I agree re: marketing value of a 1.0.0 release. > > >> > > >> For the record, I think we should continue to allow the API of each > > >> respective library component to evolve freely and allow the > > >> individuals developing each to decide how to handle deprecations, API > > >> changes, etc., as we have up until this point. The project is still > > >> very much in "innovation mode" across the board, but some parts may > > >> grow more conservative than others. Having roughly time-based releases > > >> encourages everyone to be ready-to-release at any given time, and we > > >> develop a steady cadence of getting new functionality and > > >> improvements/fixes out the door. > > >> > > >> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou > wrote: > > >> > > > >> > > > >> > I think there's a marketing merit to issuing a 1.0.0 release. > > >> > > > >> > Regards > > >> > > > >> > Antoine. > > >> > > > >> > > > >> > Le 07/06/2019 à 20:05, Wes McKinney a écrit : > > >> > > So one idea is that we could call the next release 1.14.0. So the > > >> > > second number is the API version number. This encodes a > sequencing of > > >> > > the evolution of the API. The library APIs are already decoupled > from > > >> > > the binary serialization protocol, so I think we merely have to > state > > >> > > that API changes and protocol changes are not related to each > other. > > >> > > > > >> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau < > jacq...@apache.org> > > >> wrote: > > >> > >> > > >> > >> It brings up an interesting point... do we couple the stability > of > > >> the apis > > >> > >> with the stability of the protocol. If the protocol is stable, we > > >> should > > >> > >> start providing guarantees for it. How do we want to express > these > > >> > >> different velocities? > > >> > >> > > >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou < > anto...@python.org> > > >> wrote: > > >> > >> > > >> > >>> > > >> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > > >> > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou < > anto...@python.org> > > >> > >>> wrote: > > >> > > > >> > > Hi Wes, > > >> >
[DISCUSS] 32- and 64-bit decimal types
On the 1.0.0 protocol discussion, one item that we've skirted for some time is other decimal sizes: https://issues.apache.org/jira/browse/ARROW-2009 I understand this is a loaded subject since a deliberate decision was made to remove types from the initial Java implementation of Arrow that was forked from Apache Drill. However, it's a friction point that has come up in a number of scenarios as many database and storage systems have 32- and 64-bit variants for low precision decimal data. As an example Apache Kudu [1] has all three types, and the Parquet columnar format allows not only 32/64 bit storage but fixed size binary (size a function of precision) and variable-length binary encoding [2]. One of the arguments against using these types in a computational setting is that many mathematical operations will necessarily trigger an up-promotion to a larger type. It's hard for us to predict how people will use the Arrow format, though, and the current situation is forcing an up-promotion regardless of how the format is being used, even for simple data transport In anticipation of long-term needs, I would suggest a possible solution of: * Adding bitWidth field to Decimal table in Schema.fbs [3] with default value of 128 * Constraining bit widths to 32, 64, and 128 bits for the time being * Permit storage of smaller precision decimals in larger storage like we have now If this isn't deemed desirable by the community, decimal extension types could be employed for serialization-free transport for smaller decimals, but I view this as suboptimal. Interested in the thoughts of others. thanks Wes [1]: https://github.com/apache/kudu/blob/master/src/kudu/common/common.proto#L55 [2]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L121
[jira] [Created] (ARROW-5547) [C++][Flight] arrow-flight.pc isn't provided
Sutou Kouhei created ARROW-5547: --- Summary: [C++][Flight] arrow-flight.pc isn't provided Key: ARROW-5547 URL: https://issues.apache.org/jira/browse/ARROW-5547 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5546) [C#] Remove IArrowArray and use Array base class.
Eric Erhardt created ARROW-5546: --- Summary: [C#] Remove IArrowArray and use Array base class. Key: ARROW-5546 URL: https://issues.apache.org/jira/browse/ARROW-5546 Project: Apache Arrow Issue Type: Improvement Components: C# Affects Versions: 0.13.0 Reporter: Eric Erhardt In .NET libraries, we have historically favored classes (abstract or otherwise) over interfaces. See [Choosing Between Classes and Interfaces|https://docs.microsoft.com/en-us/previous-versions/dotnet/netframework-4.0/ms229013(v%3dvs.100)]. The main reasoning is that you can add members to a class over time, but once you ship an interface, it can never be changed. You can only add new interfaces. In light of this, we should remove the IArrowArray interface, and instead just the base `Array` class as the abstraction for all Arrow Arrays. As part of this, we should also consider renaming `Array` because it conflicts with the System.Array type. Instead we should consider naming it `ArrowArray` to make it unique from the very common System.Array type in .NET. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[VOTE] Formalizing "Extension Type" metadata in Arrow binary protocol
hi folks, In two mailing list threads [1] [2] we have discussed adding an "extension type" mechanism to the Arrow binary/IPC protocol. The idea is to be able to "annotate" built-in Arrow data types with a type name and serialized type data/metadata so that users can implement their own custom columnar data containers that contain application-defined business logic not built-in to the Arrow libraries. This is designed to be non-obtrusive: readers who are not aware of an extension type can interact with the built-in Arrow type opaquely, and propagate the extension metadata unmodified As two examples: * "uuid" may annotate "fixed size binary of value width 16 bytes" * "latitude-longitude" may annotate "struct" or similar An implementation may provide specialized columnar containers with additional business logic around manipulating such data in-memory as required for application development We also have prototype implementations of this mechanism ready to go in C++ and Java. I have proposed language additions to the specification [3] and the C++ implementation with the following tenets: - The custom_metadata Flatbuffers field shall use the colon character ":" as a namespace separator - "ARROW" is designated as a reserved namespace in custom_metadata, for example "ARROW:property" - There may be multiple levels of namespacing, for example: "ARROW:myorg:property_name" - Extension type fields "ARROW:extension:name" and "ARROW:extension:metadata" are reserved in custom_metadata to enable serialization of extension type information - The details of implementation and how extension types are exposed to library users is implementation dependent Please vote to accept these changes (see [3] for the actual changes). The vote will be open for at least 72 hours [ ] +1: Adopt these changes into the Arrow columnar format specification [ ] +0: . . . [ ] -1: I disagree because . . . Here is my vote: +1 [1]: https://lists.apache.org/thread.html/96c3f5fe64f45a4c5ccac0562dbfd356b76cd722aa521100b5988d40@%3Cdev.arrow.apache.org%3E [2]: https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E [3]: https://github.com/apache/arrow/pull/4332
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
Hi all, OK, it sounds like there is reasonable consensus behind the plan: * Make a 0.14.0 release in the near future (later this month?) * Publicize that the next release will be 1.0.0, in a "speak now or hold your peace" fashion * Release 1.0.0 as following release. I would suggest not waiting too long, so late August / early September time frame I'm going to continue grooming the 0.14.0 backlog to help refine the scope of what still needs to be done for C++/Python to get the next release out. If the stakeholders in various project subcomponents could also groom the backlog and mark any blockers, that would be very helpful. I suggest shooting for a release candidate for 0.14.0 either the week of June 24 or July 1 (depending on where things stand) Thanks Wes On Mon, Jun 10, 2019 at 2:39 AM Sutou Kouhei wrote: > > Hi, > > I think that 0.14.0 is better for the next version. > > People who don't try Apache Arrow yet to wait 1.0.0 will use > Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them, > we will get more users and contributors by 1.0.0. They may > not care protocol stability. They may just care "1.0.0". > > We'll be able to release less problem 1.0.0 by releasing > 0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than > 1.0.0-RCX. 0.14.0 users will find critical problems. > > > Thanks, > -- > kou > > In > "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow > protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700, > Micah Kornfield wrote: > > > A few thoughts: > > - I think we should iron out the remaining incompatibilities between java > > and C++ before going to 1.0.0 (at least Union and NullType), and I'm not > > sure I will have time to them before the next release, so I would prefer to > > try to aim for the subsequent release to make it 1.0.0 > > - For 1.0.0 should we change the metadata format version to a new naming > > scheme [1] (seems like more of a hassle then it is worth)? > > - I'm a little concerned about the implications for forward-compatibility > > restrictions for format changes. For instance the large list types would > > not be forward compatible (at least by some definitions), similarly if we > > deal with compression [2] it would also seem to not be forward compatible. > > Would this mean we bump the format version number for each change even > > though they would be backwards compatible? > > > > Thanks, > > Micah > > > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 > > [2] https://issues.apache.org/jira/browse/ARROW-300 > > > > On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney wrote: > > > >> I agree re: marketing value of a 1.0.0 release. > >> > >> For the record, I think we should continue to allow the API of each > >> respective library component to evolve freely and allow the > >> individuals developing each to decide how to handle deprecations, API > >> changes, etc., as we have up until this point. The project is still > >> very much in "innovation mode" across the board, but some parts may > >> grow more conservative than others. Having roughly time-based releases > >> encourages everyone to be ready-to-release at any given time, and we > >> develop a steady cadence of getting new functionality and > >> improvements/fixes out the door. > >> > >> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou wrote: > >> > > >> > > >> > I think there's a marketing merit to issuing a 1.0.0 release. > >> > > >> > Regards > >> > > >> > Antoine. > >> > > >> > > >> > Le 07/06/2019 à 20:05, Wes McKinney a écrit : > >> > > So one idea is that we could call the next release 1.14.0. So the > >> > > second number is the API version number. This encodes a sequencing of > >> > > the evolution of the API. The library APIs are already decoupled from > >> > > the binary serialization protocol, so I think we merely have to state > >> > > that API changes and protocol changes are not related to each other. > >> > > > >> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau > >> wrote: > >> > >> > >> > >> It brings up an interesting point... do we couple the stability of > >> the apis > >> > >> with the stability of the protocol. If the protocol is stable, we > >> should > >> > >> start providing guarantees for it. How do we want to express these > >> > >> different velocities? > >> > >> > >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou > >> wrote: > >> > >> > >> > >>> > >> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : > >> > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou > >> > >>> wrote: > >> > > >> > > Hi Wes, > >> > > > >> > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : > >> > >> > >> > >> I think > >> > >> this would have a lot of benefits for project onlookers to remove > >> > >> various warnings around the codebase around stability and cautions > >> > >> against persistence of protocol data. It's fair to say that if we > >> _do_ > >> > >> make changes in the future, th
[jira] [Created] (ARROW-5545) Clarify expectation of UTC values for timestamps with time zones in C++ API docs
TP Boudreau created ARROW-5545: -- Summary: Clarify expectation of UTC values for timestamps with time zones in C++ API docs Key: ARROW-5545 URL: https://issues.apache.org/jira/browse/ARROW-5545 Project: Apache Arrow Issue Type: Improvement Reporter: TP Boudreau Assignee: TP Boudreau Fix For: 0.14.0 For timestamp datatypes, if the timezone parameter is non-empty, the int64 array values in the associated column are assumed to be normalized to UTC. This requirement should be made clear to the C++ API user. (It can be inferred from the flatbuffers schema, but that internal implementation document probably wouldn't ordinarily be consulted by a C++ API consumer.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5544) [Archery] should not return non-zero in `benchmark diff` sub command on regression
Francois Saint-Jacques created ARROW-5544: - Summary: [Archery] should not return non-zero in `benchmark diff` sub command on regression Key: ARROW-5544 URL: https://issues.apache.org/jira/browse/ARROW-5544 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques When a regression is detected, but the command ran successfully, it should return zero. Currently it returns the number of regression. This is to play better with ursabot. It should be left to the user to decide what to do with the json data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5543) [Documentation] Migrate FAQ page to Sphinx / rst around release time
Wes McKinney created ARROW-5543: --- Summary: [Documentation] Migrate FAQ page to Sphinx / rst around release time Key: ARROW-5543 URL: https://issues.apache.org/jira/browse/ARROW-5543 Project: Apache Arrow Issue Type: Improvement Reporter: Wes McKinney Fix For: 0.14.0 In ARROW-973, a Markdown page with the FAQ was added. When we are close to publishing a new version of the Sphinx site, it would make sense to move the FAQ to the main docs project and link from the project from page -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5542) [Java] Bootstrap initial developer documentation in docs/source/developers/java.rst
Wes McKinney created ARROW-5542: --- Summary: [Java] Bootstrap initial developer documentation in docs/source/developers/java.rst Key: ARROW-5542 URL: https://issues.apache.org/jira/browse/ARROW-5542 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Wes McKinney Fix For: 0.14.0 The project lacks prose documentation about Java development. I propose to begin a section about it in the Sphinx project -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5541) [R] cast from negative int32 to uint32 and uint64 are now safe
Romain François created ARROW-5541: -- Summary: [R] cast from negative int32 to uint32 and uint64 are now safe Key: ARROW-5541 URL: https://issues.apache.org/jira/browse/ARROW-5541 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Romain François Assignee: Romain François Fix For: 0.14.0 The test just need some updates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5540) pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string
Michał Kujawski created ARROW-5540: -- Summary: pa.lib.tzinfo_to_string(tz) throws ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string Key: ARROW-5540 URL: https://issues.apache.org/jira/browse/ARROW-5540 Project: Apache Arrow Issue Type: Bug Reporter: Michał Kujawski *Overview:* When trying to save DataFrame to parquet error is thrown while parsing a column with the following properties: {code:java} dtype: datetime64[ns, tzoffset(None, -14400)] dtype.tz: tzoffset(None, -14400) {code} *Error:* {code:java} ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string{code} *Error stack:* {code:java} File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas File "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 480, in dataframe_to_arrays types) File "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 209, in construct_metadata field_name=sanitized_name) File "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 153, in get_column_metadata string_dtype, extra_metadata = get_extension_dtype_info(column) File "/home/koojav/projects/toptal/teftel/.venv/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 126, in get_extension_dtype_info metadata = {'timezone': pa.lib.tzinfo_to_string(dtype.tz)} File "pyarrow/types.pxi", line 1149, in pyarrow.lib.tzinfo_to_string ValueError: Unable to convert timezone `tzoffset(None, -14400)` to string {code} *Libraries:* * pandas 0.24.2 * pyarrow 0.13.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5539) [Java] Test failure
Antoine Pitrou created ARROW-5539: - Summary: [Java] Test failure Key: ARROW-5539 URL: https://issues.apache.org/jira/browse/ARROW-5539 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Antoine Pitrou I know next to nothing about Java ecosystems. I'm trying to build and test locally, and get the following failures: {code} [ERROR] Tests run: 6, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.011 s <<< FAILURE! - in io.netty.buffer.TestArrowBuf [ERROR] testSetBytesSliced(io.netty.buffer.TestArrowBuf) Time elapsed: 0.004 s <<< ERROR! java.lang.NoSuchMethodError: io.netty.buffer.ArrowBuf.setBytes(ILjava/nio/ByteBuffer;II)Lio/netty/buffer/ArrowBuf; at io.netty.buffer.TestArrowBuf.testSetBytesSliced(TestArrowBuf.java:100) [ERROR] testSetBytesUnsliced(io.netty.buffer.TestArrowBuf) Time elapsed: 0 s <<< ERROR! java.lang.NoSuchMethodError: io.netty.buffer.ArrowBuf.setBytes(ILjava/nio/ByteBuffer;II)Lio/netty/buffer/ArrowBuf; at io.netty.buffer.TestArrowBuf.testSetBytesUnsliced(TestArrowBuf.java:121) 12:27:49.541 [main] WARN o.apache.arrow.memory.BoundsChecking - "drill.enable_unsafe_memory_access" has been renamed to "arrow.enable_unsafe_memory_access" 12:27:49.543 [main] WARN o.apache.arrow.memory.BoundsChecking - "arrow.enable_unsafe_memory_access" can be set to: true (to not check) or false (to check, default) 12:27:49.617 [main] WARN o.apache.arrow.memory.BoundsChecking - "drill.enable_unsafe_memory_access" has been renamed to "arrow.enable_unsafe_memory_access" 12:27:49.619 [main] WARN o.apache.arrow.memory.BoundsChecking - "arrow.enable_unsafe_memory_access" can be set to: true (to not check) or false (to check, default) {code} Java version is the following: {code} $ java -version java version "1.8.0_201" Java(TM) SE Runtime Environment (build 1.8.0_201-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) {code} I'm on Ubuntu 18.04. Perhaps I need another JVM? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Discuss][Java][Typical use cases for dictionary encoding string vectors]
Hi all, This is concerning issue ARROW-3396. I have summarized the problem (please see if my understanding is correct), and proposed some solutions to it. Please give your valuable feedback. For details, please see: https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing Thank you in advance. Best, Liya Fan
Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability
Hi, I think that 0.14.0 is better for the next version. People who don't try Apache Arrow yet to wait 1.0.0 will use Apache Arrow when we release 1.0.0. If 1.0.0 satisfies them, we will get more users and contributors by 1.0.0. They may not care protocol stability. They may just care "1.0.0". We'll be able to release less problem 1.0.0 by releasing 0.14.0 as RC for 1.0.0. 0.14.0 will be used more people than 1.0.0-RCX. 0.14.0 users will find critical problems. Thanks, -- kou In "Re: [DISCUSS] Timing of release and making a 1.0.0 release marking Arrow protocol stability" on Fri, 7 Jun 2019 22:28:22 -0700, Micah Kornfield wrote: > A few thoughts: > - I think we should iron out the remaining incompatibilities between java > and C++ before going to 1.0.0 (at least Union and NullType), and I'm not > sure I will have time to them before the next release, so I would prefer to > try to aim for the subsequent release to make it 1.0.0 > - For 1.0.0 should we change the metadata format version to a new naming > scheme [1] (seems like more of a hassle then it is worth)? > - I'm a little concerned about the implications for forward-compatibility > restrictions for format changes. For instance the large list types would > not be forward compatible (at least by some definitions), similarly if we > deal with compression [2] it would also seem to not be forward compatible. > Would this mean we bump the format version number for each change even > though they would be backwards compatible? > > Thanks, > Micah > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 > [2] https://issues.apache.org/jira/browse/ARROW-300 > > On Fri, Jun 7, 2019 at 12:42 PM Wes McKinney wrote: > >> I agree re: marketing value of a 1.0.0 release. >> >> For the record, I think we should continue to allow the API of each >> respective library component to evolve freely and allow the >> individuals developing each to decide how to handle deprecations, API >> changes, etc., as we have up until this point. The project is still >> very much in "innovation mode" across the board, but some parts may >> grow more conservative than others. Having roughly time-based releases >> encourages everyone to be ready-to-release at any given time, and we >> develop a steady cadence of getting new functionality and >> improvements/fixes out the door. >> >> On Fri, Jun 7, 2019 at 1:25 PM Antoine Pitrou wrote: >> > >> > >> > I think there's a marketing merit to issuing a 1.0.0 release. >> > >> > Regards >> > >> > Antoine. >> > >> > >> > Le 07/06/2019 à 20:05, Wes McKinney a écrit : >> > > So one idea is that we could call the next release 1.14.0. So the >> > > second number is the API version number. This encodes a sequencing of >> > > the evolution of the API. The library APIs are already decoupled from >> > > the binary serialization protocol, so I think we merely have to state >> > > that API changes and protocol changes are not related to each other. >> > > >> > > On Fri, Jun 7, 2019 at 12:58 PM Jacques Nadeau >> wrote: >> > >> >> > >> It brings up an interesting point... do we couple the stability of >> the apis >> > >> with the stability of the protocol. If the protocol is stable, we >> should >> > >> start providing guarantees for it. How do we want to express these >> > >> different velocities? >> > >> >> > >> On Fri, Jun 7, 2019 at 10:48 AM Antoine Pitrou >> wrote: >> > >> >> > >>> >> > >>> Le 07/06/2019 à 19:44, Jacques Nadeau a écrit : >> > On Fri, Jun 7, 2019 at 10:25 AM Antoine Pitrou >> > >>> wrote: >> > >> > > Hi Wes, >> > > >> > > Le 07/06/2019 à 17:42, Wes McKinney a écrit : >> > >> >> > >> I think >> > >> this would have a lot of benefits for project onlookers to remove >> > >> various warnings around the codebase around stability and cautions >> > >> against persistence of protocol data. It's fair to say that if we >> _do_ >> > >> make changes in the future, that there will be a transition path >> for >> > >> migrate persisted data, should it ever come to that. >> > > >> > > I think that's a good idea, but perhaps the stability promise >> shouldn't >> > > cover the Flight protocol yet? >> > >> > Agreed. >> > >> > >> I would suggest a "1.0.0" release either as our next release >> (instead >> > >> of 0.14.0) or the release right after that (if we need more time >> to >> > >> get affairs in order), with the guidance for users of: >> > > >> > > I think we should first do a regular 0.14.0 with all that's on our >> plate >> > > right now, then work towards a 1.0.0 as the release following that. >> > >> > What is different from your perspective? If the protocol hasn't >> changed >> > >>> in >> > over a year, why not call it 1.0? >> > >>> >> > >>> I would say that perhaps some API cleanup is in order. Remove >> > >>> deprecated ones, review experimental APIs, perhaps mark experimental >> > >>> certain