Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
I'm trying to work out the exact steps in my mind for a migration. It seems like one approach is: 1. Add a code change which throws a clear exception it encounters -1 for size. In java the reasonable place seems to be at [1] (there might be more?). The exception should state that the current stream reader isn't compatible with version 1.0.0 streams (we should have similar ones in each language). We can add a note about the environment variable in 2 if we decide to do it. Release this change as 0.15.0 or 0.14.2 and ensure at least Spark upgrades to this version. 2. Change the reader implementation to support reading both 1.0.0 streams and be backwards compatible with pre-1.0.0 streams. Change the writer implementation to default to writing 1.0.0 streams but have an environment variable that make it write backwards compatible streams (writer compatibility seems like it should be optional). Release this as 1.0.0 3. If provided, remove the environment variable switch in a later release. Thanks, Micah [1] https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67 On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney wrote: > To be clear, we could make a patch 0.14.x release that includes the > necessary compatibility changes. I presume Spark will be able to upgrade to > a new patch release (I'd be surprised if not, otherwise how can you get > security fixes)? > > On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler wrote: > > > Hey Wes, > > I understand we don't want to burden 1.0 by maintaining compatibility and > > that is fine with me. I'm just try to figure out how to best handle this > > situation so Spark users won't get a cryptic error message. It sounds > like > > it will need to be handled on the Spark side to not allow mixing 1.0 and > > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with > > compatibility would help, it might depend on when things get released but > > we can discuss that in another thread. > > > > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney > wrote: > > > > > hi Bryan -- well, the reason for the current 0.x version is precisely > > > to avoid a situation where we are making decisions on the basis of > > > maintaining forward / backward compatibility. > > > > > > One possible way forward on this is to make a 0.15.0 (0.14.2, so there > > > is less trouble for Spark to upgrade) release that supports reading > > > _both_ old and new variants of the protocol. > > > > > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler > wrote: > > > > > > > > Are we going to say that Arrow 1.0 is not compatible with any version > > > > before? My concern is that Spark 2.4.x might get stuck on Arrow Java > > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not > > > work. > > > > In Spark 3.0.0, though it will be no problem to update both Java and > > > Python > > > > to 1.0. Having a compatibility mode so that new readers/writers can > > work > > > > with old readers using a 4-byte prefix would solve the problem, but > if > > we > > > > don't want to do this will pyarrow be able to raise an error that > > clearly > > > > the new version does not support the old protocol? For example, > would > > a > > > > pyarrow reader see the 0x and raise something like "PyArrow > > > > detected an old protocol and cannot continue, please use a version < > > > 1.0.0"? > > > > > > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney > > > wrote: > > > > > > > > > Hi Francois -- copying the metadata into memory isn't the end of > the > > > world > > > > > but it's a pretty ugly wart. This affects every IPC protocol > message > > > > > everywhere. > > > > > > > > > > We have an opportunity to address the wart now but such a fix > > > post-1.0.0 > > > > > will be much more difficult. > > > > > > > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > > > > > fsaintjacq...@gmail.com> wrote: > > > > > > > > > > > If the data buffers are still aligned, then I don't think we > should > > > > > > add a breaking change just for avoiding the copy on the metadata? > > I'd > > > > > > expect said metadata to be small enough that zero-copy doesn't > > really > > > > > > affect performance. > > > > > > > > > > > > François > > > > > > > > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield < > > > emkornfi...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > While working on trying to fix undefined behavior for unaligned > > > memory > > > > > > > accesses [1], I ran into an issue with the IPC specification > [2] > > > which > > > > > > > prevents us from ever achieving zero-copy memory mapping and > > having > > > > > > aligned > > > > > > > accesses (i.e. clean UBSan runs). > > > > > > > > > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > > > > > accesses. > > > > > > > > > > > > > > In the IPC format we align each message to 8-byte
[jira] [Created] (ARROW-5986) [Java] Code cleanup for dictionary encoding
Ji Liu created ARROW-5986: - Summary: [Java] Code cleanup for dictionary encoding Key: ARROW-5986 URL: https://issues.apache.org/jira/browse/ARROW-5986 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu In last few weeks, we did some refactor in dictionary encoding. Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} & {{equals}} API in {{ValueVector}} already checked in, some classed are no use anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and related benchmarks & UT. Fortunately, these changes are not made into version 0.14, which makes possible to remove them. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
To be clear, we could make a patch 0.14.x release that includes the necessary compatibility changes. I presume Spark will be able to upgrade to a new patch release (I'd be surprised if not, otherwise how can you get security fixes)? On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler wrote: > Hey Wes, > I understand we don't want to burden 1.0 by maintaining compatibility and > that is fine with me. I'm just try to figure out how to best handle this > situation so Spark users won't get a cryptic error message. It sounds like > it will need to be handled on the Spark side to not allow mixing 1.0 and > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with > compatibility would help, it might depend on when things get released but > we can discuss that in another thread. > > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney wrote: > > > hi Bryan -- well, the reason for the current 0.x version is precisely > > to avoid a situation where we are making decisions on the basis of > > maintaining forward / backward compatibility. > > > > One possible way forward on this is to make a 0.15.0 (0.14.2, so there > > is less trouble for Spark to upgrade) release that supports reading > > _both_ old and new variants of the protocol. > > > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler wrote: > > > > > > Are we going to say that Arrow 1.0 is not compatible with any version > > > before? My concern is that Spark 2.4.x might get stuck on Arrow Java > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not > > work. > > > In Spark 3.0.0, though it will be no problem to update both Java and > > Python > > > to 1.0. Having a compatibility mode so that new readers/writers can > work > > > with old readers using a 4-byte prefix would solve the problem, but if > we > > > don't want to do this will pyarrow be able to raise an error that > clearly > > > the new version does not support the old protocol? For example, would > a > > > pyarrow reader see the 0x and raise something like "PyArrow > > > detected an old protocol and cannot continue, please use a version < > > 1.0.0"? > > > > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney > > wrote: > > > > > > > Hi Francois -- copying the metadata into memory isn't the end of the > > world > > > > but it's a pretty ugly wart. This affects every IPC protocol message > > > > everywhere. > > > > > > > > We have an opportunity to address the wart now but such a fix > > post-1.0.0 > > > > will be much more difficult. > > > > > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > > > > fsaintjacq...@gmail.com> wrote: > > > > > > > > > If the data buffers are still aligned, then I don't think we should > > > > > add a breaking change just for avoiding the copy on the metadata? > I'd > > > > > expect said metadata to be small enough that zero-copy doesn't > really > > > > > affect performance. > > > > > > > > > > François > > > > > > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > > > > While working on trying to fix undefined behavior for unaligned > > memory > > > > > > accesses [1], I ran into an issue with the IPC specification [2] > > which > > > > > > prevents us from ever achieving zero-copy memory mapping and > having > > > > > aligned > > > > > > accesses (i.e. clean UBSan runs). > > > > > > > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > > > > accesses. > > > > > > > > > > > > In the IPC format we align each message to 8-byte boundaries. We > > then > > > > > > write a int32_t integer to to denote the size of flat buffer > > metadata, > > > > > > followed immediately by the flatbuffer metadata. This means the > > > > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > > > > > > > Do people care? A simple fix would be to use int64_t instead of > > > > int32_t > > > > > > for length. However, any fix essentially breaks all previous > > client > > > > > > library versions or incurs a memory copy. > > > > > > > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > > > > [2] https://arrow.apache.org/docs/ipc.html > > > > > > > > > > > >
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
Hey Wes, I understand we don't want to burden 1.0 by maintaining compatibility and that is fine with me. I'm just try to figure out how to best handle this situation so Spark users won't get a cryptic error message. It sounds like it will need to be handled on the Spark side to not allow mixing 1.0 and pre-1.0 versions. I'm not too sure how much a 0.15.0 release with compatibility would help, it might depend on when things get released but we can discuss that in another thread. On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney wrote: > hi Bryan -- well, the reason for the current 0.x version is precisely > to avoid a situation where we are making decisions on the basis of > maintaining forward / backward compatibility. > > One possible way forward on this is to make a 0.15.0 (0.14.2, so there > is less trouble for Spark to upgrade) release that supports reading > _both_ old and new variants of the protocol. > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler wrote: > > > > Are we going to say that Arrow 1.0 is not compatible with any version > > before? My concern is that Spark 2.4.x might get stuck on Arrow Java > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not > work. > > In Spark 3.0.0, though it will be no problem to update both Java and > Python > > to 1.0. Having a compatibility mode so that new readers/writers can work > > with old readers using a 4-byte prefix would solve the problem, but if we > > don't want to do this will pyarrow be able to raise an error that clearly > > the new version does not support the old protocol? For example, would a > > pyarrow reader see the 0x and raise something like "PyArrow > > detected an old protocol and cannot continue, please use a version < > 1.0.0"? > > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney > wrote: > > > > > Hi Francois -- copying the metadata into memory isn't the end of the > world > > > but it's a pretty ugly wart. This affects every IPC protocol message > > > everywhere. > > > > > > We have an opportunity to address the wart now but such a fix > post-1.0.0 > > > will be much more difficult. > > > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > > > fsaintjacq...@gmail.com> wrote: > > > > > > > If the data buffers are still aligned, then I don't think we should > > > > add a breaking change just for avoiding the copy on the metadata? I'd > > > > expect said metadata to be small enough that zero-copy doesn't really > > > > affect performance. > > > > > > > > François > > > > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > > While working on trying to fix undefined behavior for unaligned > memory > > > > > accesses [1], I ran into an issue with the IPC specification [2] > which > > > > > prevents us from ever achieving zero-copy memory mapping and having > > > > aligned > > > > > accesses (i.e. clean UBSan runs). > > > > > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > > > accesses. > > > > > > > > > > In the IPC format we align each message to 8-byte boundaries. We > then > > > > > write a int32_t integer to to denote the size of flat buffer > metadata, > > > > > followed immediately by the flatbuffer metadata. This means the > > > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > > > > > Do people care? A simple fix would be to use int64_t instead of > > > int32_t > > > > > for length. However, any fix essentially breaks all previous > client > > > > > library versions or incurs a memory copy. > > > > > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > > > [2] https://arrow.apache.org/docs/ipc.html > > > > > > > >
Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter
Thanks a lot for Wes and Liya's feedbacks. Agreed that parsing performance of CSV files is important, and I just found a benchmark test for Java CSV library[1][2] which shows FastCSV has obvious advantages. Anyway, I will test it myself. Thanks, Ji Liu [1] https://raw.githubusercontent.com/osiegmar/FastCSV/master/benchmark.png [2] https://github.com/osiegmar/FastCSV -- From:Fan Liya Send Time:2019年7月19日(星期五) 10:14 To:dev Cc:Ji Liu ; Micah Kornfield Subject:Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter Hi Ji, Thanks for proposing this. CSV adapter sounds like a useful feature. Best, Liya Fan On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney wrote: We wrote a custom reader in C++ since performance of parsing CSV files matters a lot -- we wanted to do multi-threaded execution of conversion steps, also. I don't know what the performance of commons-csv is but it might be worth doing some benchmarks to see. On Thu, Jul 18, 2019 at 4:35 AM Ji Liu wrote: > > Hi all, > > Seems there is no adapter to convert CSV data to Arrow data in Java side > which C++ has. Now we already have JDBC adapter, Orc adapter and Avro > adapter (In progress), I think an adapter for CSV would probably also be > nice. > After a brief discuss with @Micah Kornfield, Apache commons-csv [1] seems an > efficient CSV parser that we could potentially leverage but I don't know if > there are other better options. Any inputs and comments would be appreciated. > > Thanks, > Ji Liu[1]https://commons.apache.org/proper/commons-csv/
Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter
Hi Ji, Thanks for proposing this. CSV adapter sounds like a useful feature. Best, Liya Fan On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney wrote: > We wrote a custom reader in C++ since performance of parsing CSV files > matters a lot -- we wanted to do multi-threaded execution of > conversion steps, also. I don't know what the performance of > commons-csv is but it might be worth doing some benchmarks to see. > > On Thu, Jul 18, 2019 at 4:35 AM Ji Liu wrote: > > > > Hi all, > > > > Seems there is no adapter to convert CSV data to Arrow data in Java side > which C++ has. Now we already have JDBC adapter, Orc adapter and Avro > adapter (In progress), I think an adapter for CSV would probably also be > nice. > > After a brief discuss with @Micah Kornfield, Apache commons-csv [1] > seems an efficient CSV parser that we could potentially leverage but I > don't know if there are other better options. Any inputs and comments would > be appreciated. > > > > Thanks, > > Ji Liu[1]https://commons.apache.org/proper/commons-csv/ >
[jira] [Created] (ARROW-5984) [C++] Provide method on AdaptiveIntBuilder for appending integer Array types
Wes McKinney created ARROW-5984: --- Summary: [C++] Provide method on AdaptiveIntBuilder for appending integer Array types Key: ARROW-5984 URL: https://issues.apache.org/jira/browse/ARROW-5984 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney For Int8/16/32, it is not currently possible to do a bulk append -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5983) [C++] Provide bulk method on TypedBufferBuilder for appending a bitmap
Wes McKinney created ARROW-5983: --- Summary: [C++] Provide bulk method on TypedBufferBuilder for appending a bitmap Key: ARROW-5983 URL: https://issues.apache.org/jira/browse/ARROW-5983 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney We have {{TypedBufferBuilder::UnsafeAppend}} for an array of bytes (where non-zero becomes 1), but it would be useful to have also {{UnsafeAppendBits}} so that bitmaps coming from {{arrow::Array}} can also be appended -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5982) [C++] Add methods to append dictionary values and dictionary indices directly into DictionaryBuilder
Wes McKinney created ARROW-5982: --- Summary: [C++] Add methods to append dictionary values and dictionary indices directly into DictionaryBuilder Key: ARROW-5982 URL: https://issues.apache.org/jira/browse/ARROW-5982 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 In scenarios where a developer has an array of dictionary indices already that reference a known dictionary, it is useful to be able to insert the indices directly, circumventing the hash table lookup. The developer will be responsible for keeping things consistent -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [VOTE] Release Apache Arrow 0.14.1 - RC0
+1 (binding) I ran the followings on Debian GNU/Linux sid: * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source 0.14.1 0 * dev/release/verify-release-candidate.sh binaries 0.14.1 0 with: * gcc (Debian 8.3.0-7) 8.3.0 * openjdk version "1.8.0_212" * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741) [x86_64-linux] * Node.JS v12.1.0 * go version go1.11.6 linux/amd64 * nvidia-cuda-dev 9.2.148-7 I re-run C# tests by the following command line sometimes: TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 dev/release/verify-release-candidate.sh source 0.14.1 0 But "sourcelink test" is always failed: + sourcelink test artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb The operation was canceled. I don't think that this is a broker. Thanks, -- kou In "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019 04:54:33 +0200, Krisztián Szűcs wrote: > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 0.14.1. This is a patch release consiting of 47 resolved > JIRA issues[1]. > > This release candidate is based on commit: > 5f564424c71cef12619522cdde59be5f69b31b68 [2] > > The source release rc0 is hosted at [3]. > The binary artifacts are hosted at [4][5][6][7]. > The changelog is located at [8]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. See [9] for how to validate a release candidate. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow 0.14.1 > [ ] +0 > [ ] -1 Do not release this as Apache Arrow 0.14.1 because... > > [1]: > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1 > [2]: > https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68 > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0 > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0 > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0 > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0 > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0 > [8]: > https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md > [9]: > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[jira] [Created] (ARROW-5981) [C++] DictionaryBuilder initialization with Array can fail silently
Wes McKinney created ARROW-5981: --- Summary: [C++] DictionaryBuilder initialization with Array can fail silently Key: ARROW-5981 URL: https://issues.apache.org/jira/browse/ARROW-5981 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 See https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_dict.cc#L267 I think it would be better to expose {{InsertValues}} on {{DictionaryBuilder}} and initialize from a known dictionary that way -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5980) Missing libarrow.so and libarrow_python.so when installing pyarrow
Haowei Yu created ARROW-5980: Summary: Missing libarrow.so and libarrow_python.so when installing pyarrow Key: ARROW-5980 URL: https://issues.apache.org/jira/browse/ARROW-5980 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.14.0 Reporter: Haowei Yu I have installed the pyarrow 0.14.0 but it seems that by default you did not provide symlink of libarrow.so and libarrow_python.so. Only .so file with suffix is provided. Hence, I cannot use the output of pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link option. If you provide symlink, I can pass following to the linker to specify the library to link. e.g. g++ -L/ -larrow -larrow_python However, right now, the ld ouput complains not being able to find -larrow and -larrow_python -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
hi Bryan -- well, the reason for the current 0.x version is precisely to avoid a situation where we are making decisions on the basis of maintaining forward / backward compatibility. One possible way forward on this is to make a 0.15.0 (0.14.2, so there is less trouble for Spark to upgrade) release that supports reading _both_ old and new variants of the protocol. On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler wrote: > > Are we going to say that Arrow 1.0 is not compatible with any version > before? My concern is that Spark 2.4.x might get stuck on Arrow Java > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work. > In Spark 3.0.0, though it will be no problem to update both Java and Python > to 1.0. Having a compatibility mode so that new readers/writers can work > with old readers using a 4-byte prefix would solve the problem, but if we > don't want to do this will pyarrow be able to raise an error that clearly > the new version does not support the old protocol? For example, would a > pyarrow reader see the 0x and raise something like "PyArrow > detected an old protocol and cannot continue, please use a version < 1.0.0"? > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney wrote: > > > Hi Francois -- copying the metadata into memory isn't the end of the world > > but it's a pretty ugly wart. This affects every IPC protocol message > > everywhere. > > > > We have an opportunity to address the wart now but such a fix post-1.0.0 > > will be much more difficult. > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > > fsaintjacq...@gmail.com> wrote: > > > > > If the data buffers are still aligned, then I don't think we should > > > add a breaking change just for avoiding the copy on the metadata? I'd > > > expect said metadata to be small enough that zero-copy doesn't really > > > affect performance. > > > > > > François > > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield > > > wrote: > > > > > > > > While working on trying to fix undefined behavior for unaligned memory > > > > accesses [1], I ran into an issue with the IPC specification [2] which > > > > prevents us from ever achieving zero-copy memory mapping and having > > > aligned > > > > accesses (i.e. clean UBSan runs). > > > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > > accesses. > > > > > > > > In the IPC format we align each message to 8-byte boundaries. We then > > > > write a int32_t integer to to denote the size of flat buffer metadata, > > > > followed immediately by the flatbuffer metadata. This means the > > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > > > Do people care? A simple fix would be to use int64_t instead of > > int32_t > > > > for length. However, any fix essentially breaks all previous client > > > > library versions or incurs a memory copy. > > > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > > [2] https://arrow.apache.org/docs/ipc.html > > > > >
[jira] [Created] (ARROW-5979) [FlightRPC] Expose (de)serialization of protocol types
lidavidm created ARROW-5979: --- Summary: [FlightRPC] Expose (de)serialization of protocol types Key: ARROW-5979 URL: https://issues.apache.org/jira/browse/ARROW-5979 Project: Apache Arrow Issue Type: New Feature Components: FlightRPC Reporter: lidavidm It would be nice to be able to serialize/deserialize Flight types (e.g. FlightInfo) to/from the binary representations, in order to interoperate with systems that might want to provide (say) Flight tickets or FlightInfo without using the Flight protocol. For instance, you might have a search server that exposes a REST interface and wants to provide FlightInfo objects for Flight clients, without having to listen on a separate port. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses
Are we going to say that Arrow 1.0 is not compatible with any version before? My concern is that Spark 2.4.x might get stuck on Arrow Java 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work. In Spark 3.0.0, though it will be no problem to update both Java and Python to 1.0. Having a compatibility mode so that new readers/writers can work with old readers using a 4-byte prefix would solve the problem, but if we don't want to do this will pyarrow be able to raise an error that clearly the new version does not support the old protocol? For example, would a pyarrow reader see the 0x and raise something like "PyArrow detected an old protocol and cannot continue, please use a version < 1.0.0"? On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney wrote: > Hi Francois -- copying the metadata into memory isn't the end of the world > but it's a pretty ugly wart. This affects every IPC protocol message > everywhere. > > We have an opportunity to address the wart now but such a fix post-1.0.0 > will be much more difficult. > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > If the data buffers are still aligned, then I don't think we should > > add a breaking change just for avoiding the copy on the metadata? I'd > > expect said metadata to be small enough that zero-copy doesn't really > > affect performance. > > > > François > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield > > wrote: > > > > > > While working on trying to fix undefined behavior for unaligned memory > > > accesses [1], I ran into an issue with the IPC specification [2] which > > > prevents us from ever achieving zero-copy memory mapping and having > > aligned > > > accesses (i.e. clean UBSan runs). > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned > accesses. > > > > > > In the IPC format we align each message to 8-byte boundaries. We then > > > write a int32_t integer to to denote the size of flat buffer metadata, > > > followed immediately by the flatbuffer metadata. This means the > > > flatbuffer metadata will never be 8 byte aligned. > > > > > > Do people care? A simple fix would be to use int64_t instead of > int32_t > > > for length. However, any fix essentially breaks all previous client > > > library versions or incurs a memory copy. > > > > > > [1] https://github.com/apache/arrow/pull/4757 > > > [2] https://arrow.apache.org/docs/ipc.html > > >
RE: [VOTE] Release Apache Arrow 0.14.1 - RC0
+1 Tested: - C# source verification on Ubuntu 18 - I verified the C# source contained the fixes for the two issues I needed fixed in this patch. -Original Message- From: Krisztián Szűcs Sent: Tuesday, July 16, 2019 9:55 PM To: dev@arrow.apache.org Subject: [VOTE] Release Apache Arrow 0.14.1 - RC0 Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 0.14.1. This is a patch release consiting of 47 resolved JIRA issues[1]. This release candidate is based on commit: 5f564424c71cef12619522cdde59be5f69b31b68 [2] The source release rc0 is hosted at [3]. The binary artifacts are hosted at [4][5][6][7]. The changelog is located at [8]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. See [9] for how to validate a release candidate. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow 0.14.1 [ ] +0 [ ] -1 Do not release this as Apache Arrow 0.14.1 because... [1]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520ARROW%2520AND%2520status%2520in%2520%2528Resolved%252C%2520Closed%2529%2520AND%2520fixVersion%2520%253D%25200.14.1data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930625039sdata=Ltv4Vi3G91xHkFiq9RtWmFCVzChabfeJ1EX5ZCShy4U%3Dreserved=0 [2]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Ftree%2F5f564424c71cef12619522cdde59be5f69b31b68data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930625039sdata=nWStpf%2BqMeLfCcguqMzN9s%2FarPOv%2F32oFxI%2BK9FsQt4%3Dreserved=0 [3]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Farrow%2Fapache-arrow-0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=o6sAxT4fWOCFwmiZgZdx%2B3kLZbXM%2FpamiUAXmGk6HCI%3Dreserved=0 [4]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fcentos-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=wTXSsizkpoSVreQdrgg%2FRPp7sBWiyjK90OfBvTUdoTE%3Dreserved=0 [5]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fdebian-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=94%2BOVuBMncnTLfFHV9AM%2BpL4rhswQZ1exktz1fQwBVk%3Dreserved=0 [6]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fpython-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=afdkBYOdLfmtN5u1p9h5YBdwxHE0cTFriUKR8VdsmHs%3Dreserved=0 [7]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fubuntu-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=lXIeeWfN0i78beynuww%2FJjpwfO%2B7b7bYfHhYnVzP%2Fzs%3Dreserved=0 [8]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2F5f564424c71cef12619522cdde59be5f69b31b68%2FCHANGELOG.mddata=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=lPlhLulc7yV4YwdpmBe%2FCq7sdO7GyntOgVD7aeZxiQM%3Dreserved=0 [9]: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FARROW%2FHow%2Bto%2BVerify%2BRelease%2BCandidatesdata=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=w0LAtJzr5eWHpQRsGzYEI7t0m%2BeQ6w%2Bu7X5LVF6U%2Bus%3Dreserved=0
[jira] [Created] (ARROW-5978) [FlightRPC] [Java] Integration test client doesn't close buffers
lidavidm created ARROW-5978: --- Summary: [FlightRPC] [Java] Integration test client doesn't close buffers Key: ARROW-5978 URL: https://issues.apache.org/jira/browse/ARROW-5978 Project: Apache Arrow Issue Type: Test Components: FlightRPC, Integration, Java Affects Versions: 0.14.0 Reporter: lidavidm Assignee: lidavidm Fix For: 1.0.0 The integration test client doesn't close any of the clients or free any of the buffers it creates. Trying to do so leads to a leak problem on the dictionary vector case. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5977) Method for read_csv to limit which columns are read?
Jordan Samuels created ARROW-5977: - Summary: Method for read_csv to limit which columns are read? Key: ARROW-5977 URL: https://issues.apache.org/jira/browse/ARROW-5977 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.14.0 Reporter: Jordan Samuels In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[DISCUSS][JAVA] Implement a CSV to Arrow adapter
Hi all, Seems there is no adapter to convert CSV data to Arrow data in Java side which C++ has. Now we already have JDBC adapter, Orc adapter and Avro adapter (In progress), I think an adapter for CSV would probably also be nice. After a brief discuss with @Micah Kornfield, Apache commons-csv [1] seems an efficient CSV parser that we could potentially leverage but I don't know if there are other better options. Any inputs and comments would be appreciated. Thanks, Ji Liu[1]https://commons.apache.org/proper/commons-csv/
Re: [VOTE] Release Apache Arrow 0.14.1 - RC0
Hey Zhuo, On Thu, Jul 18, 2019 at 2:23 AM Zhuo Peng wrote: > Hi Krisztián, > > Sorry if it's too late, but is it possible to also include > https://github.com/apache/arrow/pull/4883 in the release? It's late because I'm away from keyboard, Sunday is the closest day when I could draft another release candidate. If other issues are coming up with RC0 and the vote doesn't pass then we can include it in RC1. > This would help > resolve https://github.com/apache/arrow/issues/4472 . > > Thanks, > > Zhuo > > On Wed, Jul 17, 2019 at 3:00 AM Antoine Pitrou wrote: > > > > > +1 (binding). > > > > Tested on Ubuntu 18.04.2 (x86-64) with CUDA enabled: > > > > - binaries verification worked fine > > - source verification worked until the npm step, which failed (I don't > > have npm installed) > > > > Regards > > > > Antoine. > > > > > > Le 17/07/2019 à 04:54, Krisztián Szűcs a écrit : > > > Hi, > > > > > > I would like to propose the following release candidate (RC0) of Apache > > > Arrow version 0.14.1. This is a patch release consiting of 47 resolved > > > JIRA issues[1]. > > > > > > This release candidate is based on commit: > > > 5f564424c71cef12619522cdde59be5f69b31b68 [2] > > > > > > The source release rc0 is hosted at [3]. > > > The binary artifacts are hosted at [4][5][6][7]. > > > The changelog is located at [8]. > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > and vote on the release. See [9] for how to validate a release > candidate. > > > > > > The vote will be open for at least 72 hours. > > > > > > [ ] +1 Release this as Apache Arrow 0.14.1 > > > [ ] +0 > > > [ ] -1 Do not release this as Apache Arrow 0.14.1 because... > > > > > > [1]: > > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1 > > > [2]: > > > > > > https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68 > > > [3]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0 > > > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0 > > > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0 > > > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0 > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0 > > > [8]: > > > > > > https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md > > > [9]: > > > > > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > > > > >
[jira] [Created] (ARROW-5976) [C++] RETURN_IF_ERROR(ctx) should be namespaced
Micah Kornfield created ARROW-5976: -- Summary: [C++] RETURN_IF_ERROR(ctx) should be namespaced Key: ARROW-5976 URL: https://issues.apache.org/jira/browse/ARROW-5976 Project: Apache Arrow Issue Type: Improvement Reporter: Micah Kornfield Assignee: Micah Kornfield Fix For: 1.0.0 RETURN_IF_ERROR is a common macro, it shouldn't be exposed in a header file without namespacing to Arrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5975) [C++][Gandiva] Add method to cast Date(in Milliseconds) to timestamp
Prudhvi Porandla created ARROW-5975: --- Summary: [C++][Gandiva] Add method to cast Date(in Milliseconds) to timestamp Key: ARROW-5975 URL: https://issues.apache.org/jira/browse/ARROW-5975 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Affects Versions: 1.0.0 Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Fix For: 1.0.0 add castTIMESTAMP_date64(date64) method in Gandiva. The input date is in milliseconds. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Release cadence and release vote conventions
I'd can help as well, but not exactly sure where to start. It seems like there are already some JIRAs opened [1] for improving the release? Could someone more familiar with the process pick out the highest priority ones? Do more need to be opened? Thanks, Micah [1] https://issues.apache.org/jira/browse/ARROW-2880?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(%22Developer%20Tools%22%2C%20Packaging)%20and%20summary%20~%20Release On Sat, Jul 13, 2019 at 7:17 AM Wes McKinney wrote: > To be effective at improving the life of release managers, the nightly > release process really should use as close as possible to the same > scripts that the RM uses to produce the release. Otherwise we could > have a situation where the nightlies succeed but there is some problem > that either fails an RC or is unable to be produced at all. > > On Sat, Jul 13, 2019 at 9:12 AM Andy Grove wrote: > > > > I would like to volunteer to help with Java and Rust release process > work, > > especially nightly releases. > > > > Although I'm not that familiar with the Java implementation of Arrow, I > > have been using Java and Maven for a very long time. > > > > Do we envisage a single nightly release process that releases all > languages > > simultaneously? or do we want separate process per language, with > different > > maintainers? > > > > > > > > On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney > wrote: > > > > > On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei > wrote: > > > > > > > > Hi, > > > > > > > > > in future releases we should > > > > > institute a minimum 24-hour "quiet period" after any community > > > > > feedback on a release candidate to allow issues to be examined > > > > > further. > > > > > > > > I agree with this. I'll do so when I do a release manager in > > > > the future. > > > > > > > > > To be able to release more often, two things have to happen: > > > > > > > > > > * More PMC members must engage with the release management role, > > > > > process, and tools > > > > > * Continued improvements to release tooling to make the process > less > > > > > painful for the release manager. For example, it seems we may want > to > > > > > find a different place than Bintray to host binary artifacts > > > > > temporarily during release votes > > > > > > > > My opinion that we need to build nightly release system. > > > > > > > > It uses dev/release/NN-*.sh to build .tar.gz and binary > > > > artifacts from the .tar.gz. > > > > It also uses dev/release/verify-release-candidate.* to > > > > verify build .tar.gz and binary artifacts. > > > > It also uses dev/release/post-NN-*.sh to do post release > > > > tasks. (Some tasks such as uploading a package to packaging > > > > system will be dry-run.) > > > > > > > > > > I agree that having a turn-key release system that's capable of > > > producing nightly packages is the way to do. That way any problems > > > that would block a release will come up as they happen rather than > > > piling up until the very end like they are now. > > > > > > > I needed 10 or more changes for dev/release/ to create > > > > 0.14.0 RC0. (Some of them are still in my local stashes. I > > > > don't have time to create pull requests for them > > > > yet. Because I postponed some tasks of my main > > > > business. I'll create pull requests after I finished the > > > > postponed tasks of my main business.) > > > > > > > > > > Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we need to > > > release again soon because of problems with 0.14.0 please let us know > > > what patches will be needed to make another release. > > > > > > > If we fix problems related to dev/release/ in our normal > > > > development process, release process will be less painful. > > > > > > > > The biggest problem for 0.14.0 RC0 is java/pom.xml related: > > > > https://github.com/apache/arrow/pull/4717 > > > > > > > > It was difficult for me because I don't have Java > > > > knowledge. Release manager needs help from many developers > > > > because release manager may not have knowledge of all > > > > supported languages. Apache Arrow supports 10 over > > > > languages. > > > > > > > > > > > > For Bintray API limit problem, we'll be able to resolve it. > > > > I was added to https://bintray.com/apache/ members: > > > > > > > > https://issues.apache.org/jira/browse/INFRA-18698 > > > > > > > > I'll be able to use Bintray API without limitation in the > > > > future. Release managers should also request the same thing. > > > > > > > > > > This is good, I will add myself. Other PMC members should also add > > > themselves. > > > > > > > > > > > Thanks, > > > > -- > > > > kou > > > > > > > > In lsowxqxidjapc_cofguksj...@mail.gmail.com> > > > > "[DISCUSS] Release cadence and release vote conventions" on Sat, 6 > Jul > > > 2019 16:28:50 -0500, > > > > Wes McKinney wrote: > > > > > > > > > hi folks, > > > > > > > > > > As a reminder,