Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-18 Thread Micah Kornfield
I'm trying to work out the exact steps in my mind for a migration. It seems
like one approach is:

1.  Add a code change which throws a clear exception it encounters -1 for
size.  In java the reasonable place seems to be at [1] (there might be
more?).   The exception should state that the current stream reader isn't
compatible with version 1.0.0 streams (we should have similar ones in each
language).  We can add a note about the environment variable in 2 if we
decide to do it.  Release this change as 0.15.0 or 0.14.2 and ensure at
least Spark upgrades to this version.

2.  Change the reader implementation to support reading both 1.0.0 streams
and be backwards compatible with pre-1.0.0 streams.  Change the writer
implementation to default to writing 1.0.0 streams but have an environment
variable that make it write backwards compatible streams (writer
compatibility seems like it should be optional).  Release this as 1.0.0

3. If provided, remove the environment variable switch in a later release.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67

On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney  wrote:

> To be clear, we could make a patch 0.14.x release that includes the
> necessary compatibility changes. I presume Spark will be able to upgrade to
> a new patch release (I'd be surprised if not, otherwise how can you get
> security fixes)?
>
> On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler  wrote:
>
> > Hey Wes,
> > I understand we don't want to burden 1.0 by maintaining compatibility and
> > that is fine with me. I'm just try to figure out how to best handle this
> > situation so Spark users won't get a cryptic error message. It sounds
> like
> > it will need to be handled on the Spark side to not allow mixing 1.0 and
> > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> > compatibility would help, it might depend on when things get released but
> > we can discuss that in another thread.
> >
> > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney 
> wrote:
> >
> > > hi Bryan -- well, the reason for the current 0.x version is precisely
> > > to avoid a situation where we are making decisions on the basis of
> > > maintaining forward / backward compatibility.
> > >
> > > One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> > > is less trouble for Spark to upgrade) release that supports reading
> > > _both_ old and new variants of the protocol.
> > >
> > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler 
> wrote:
> > > >
> > > > Are we going to say that Arrow 1.0 is not compatible with any version
> > > > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> > > work.
> > > > In Spark 3.0.0, though it will be no problem to update both Java and
> > > Python
> > > > to 1.0. Having a compatibility mode so that new readers/writers can
> > work
> > > > with old readers using a 4-byte prefix would solve the problem, but
> if
> > we
> > > > don't want to do this will pyarrow be able to raise an error that
> > clearly
> > > > the new version does not support the old protocol?  For example,
> would
> > a
> > > > pyarrow reader see the 0x and raise something like "PyArrow
> > > > detected an old protocol and cannot continue, please use a version <
> > > 1.0.0"?
> > > >
> > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney 
> > > wrote:
> > > >
> > > > > Hi Francois -- copying the metadata into memory isn't the end of
> the
> > > world
> > > > > but it's a pretty ugly wart. This affects every IPC protocol
> message
> > > > > everywhere.
> > > > >
> > > > > We have an opportunity to address the wart now but such a fix
> > > post-1.0.0
> > > > > will be much more difficult.
> > > > >
> > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > > > fsaintjacq...@gmail.com> wrote:
> > > > >
> > > > > > If the data buffers are still aligned, then I don't think we
> should
> > > > > > add a breaking change just for avoiding the copy on the metadata?
> > I'd
> > > > > > expect said metadata to be small enough that zero-copy doesn't
> > really
> > > > > > affect performance.
> > > > > >
> > > > > > François
> > > > > >
> > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > While working on trying to fix undefined behavior for unaligned
> > > memory
> > > > > > > accesses [1], I ran into an issue with the IPC specification
> [2]
> > > which
> > > > > > > prevents us from ever achieving zero-copy memory mapping and
> > having
> > > > > > aligned
> > > > > > > accesses (i.e. clean UBSan runs).
> > > > > > >
> > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > > > accesses.
> > > > > > >
> > > > > > > In the IPC format we align each message to 8-byte 

[jira] [Created] (ARROW-5986) [Java] Code cleanup for dictionary encoding

2019-07-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5986:
-

 Summary: [Java] Code cleanup for dictionary encoding
 Key: ARROW-5986
 URL: https://issues.apache.org/jira/browse/ARROW-5986
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In last few weeks, we did some refactor in dictionary encoding.

Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} & 
{{equals}} API in {{ValueVector}} already checked in, some classed are no use 
anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and related 
benchmarks & UT.

Fortunately, these changes are not made into version 0.14, which makes possible 
to remove them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-18 Thread Wes McKinney
To be clear, we could make a patch 0.14.x release that includes the
necessary compatibility changes. I presume Spark will be able to upgrade to
a new patch release (I'd be surprised if not, otherwise how can you get
security fixes)?

On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler  wrote:

> Hey Wes,
> I understand we don't want to burden 1.0 by maintaining compatibility and
> that is fine with me. I'm just try to figure out how to best handle this
> situation so Spark users won't get a cryptic error message. It sounds like
> it will need to be handled on the Spark side to not allow mixing 1.0 and
> pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> compatibility would help, it might depend on when things get released but
> we can discuss that in another thread.
>
> On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney  wrote:
>
> > hi Bryan -- well, the reason for the current 0.x version is precisely
> > to avoid a situation where we are making decisions on the basis of
> > maintaining forward / backward compatibility.
> >
> > One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> > is less trouble for Spark to upgrade) release that supports reading
> > _both_ old and new variants of the protocol.
> >
> > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler  wrote:
> > >
> > > Are we going to say that Arrow 1.0 is not compatible with any version
> > > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> > work.
> > > In Spark 3.0.0, though it will be no problem to update both Java and
> > Python
> > > to 1.0. Having a compatibility mode so that new readers/writers can
> work
> > > with old readers using a 4-byte prefix would solve the problem, but if
> we
> > > don't want to do this will pyarrow be able to raise an error that
> clearly
> > > the new version does not support the old protocol?  For example, would
> a
> > > pyarrow reader see the 0x and raise something like "PyArrow
> > > detected an old protocol and cannot continue, please use a version <
> > 1.0.0"?
> > >
> > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney 
> > wrote:
> > >
> > > > Hi Francois -- copying the metadata into memory isn't the end of the
> > world
> > > > but it's a pretty ugly wart. This affects every IPC protocol message
> > > > everywhere.
> > > >
> > > > We have an opportunity to address the wart now but such a fix
> > post-1.0.0
> > > > will be much more difficult.
> > > >
> > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > > fsaintjacq...@gmail.com> wrote:
> > > >
> > > > > If the data buffers are still aligned, then I don't think we should
> > > > > add a breaking change just for avoiding the copy on the metadata?
> I'd
> > > > > expect said metadata to be small enough that zero-copy doesn't
> really
> > > > > affect performance.
> > > > >
> > > > > François
> > > > >
> > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > While working on trying to fix undefined behavior for unaligned
> > memory
> > > > > > accesses [1], I ran into an issue with the IPC specification [2]
> > which
> > > > > > prevents us from ever achieving zero-copy memory mapping and
> having
> > > > > aligned
> > > > > > accesses (i.e. clean UBSan runs).
> > > > > >
> > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > > accesses.
> > > > > >
> > > > > > In the IPC format we align each message to 8-byte boundaries.  We
> > then
> > > > > > write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > > >
> > > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > > int32_t
> > > > > > for length.  However, any fix essentially breaks all previous
> > client
> > > > > > library versions or incurs a memory copy.
> > > > > >
> > > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > > [2] https://arrow.apache.org/docs/ipc.html
> > > > >
> > > >
> >
>


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-18 Thread Bryan Cutler
Hey Wes,
I understand we don't want to burden 1.0 by maintaining compatibility and
that is fine with me. I'm just try to figure out how to best handle this
situation so Spark users won't get a cryptic error message. It sounds like
it will need to be handled on the Spark side to not allow mixing 1.0 and
pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
compatibility would help, it might depend on when things get released but
we can discuss that in another thread.

On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney  wrote:

> hi Bryan -- well, the reason for the current 0.x version is precisely
> to avoid a situation where we are making decisions on the basis of
> maintaining forward / backward compatibility.
>
> One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> is less trouble for Spark to upgrade) release that supports reading
> _both_ old and new variants of the protocol.
>
> On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler  wrote:
> >
> > Are we going to say that Arrow 1.0 is not compatible with any version
> > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> work.
> > In Spark 3.0.0, though it will be no problem to update both Java and
> Python
> > to 1.0. Having a compatibility mode so that new readers/writers can work
> > with old readers using a 4-byte prefix would solve the problem, but if we
> > don't want to do this will pyarrow be able to raise an error that clearly
> > the new version does not support the old protocol?  For example, would a
> > pyarrow reader see the 0x and raise something like "PyArrow
> > detected an old protocol and cannot continue, please use a version <
> 1.0.0"?
> >
> > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney 
> wrote:
> >
> > > Hi Francois -- copying the metadata into memory isn't the end of the
> world
> > > but it's a pretty ugly wart. This affects every IPC protocol message
> > > everywhere.
> > >
> > > We have an opportunity to address the wart now but such a fix
> post-1.0.0
> > > will be much more difficult.
> > >
> > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > fsaintjacq...@gmail.com> wrote:
> > >
> > > > If the data buffers are still aligned, then I don't think we should
> > > > add a breaking change just for avoiding the copy on the metadata? I'd
> > > > expect said metadata to be small enough that zero-copy doesn't really
> > > > affect performance.
> > > >
> > > > François
> > > >
> > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > > >
> > > > > While working on trying to fix undefined behavior for unaligned
> memory
> > > > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> > > > > prevents us from ever achieving zero-copy memory mapping and having
> > > > aligned
> > > > > accesses (i.e. clean UBSan runs).
> > > > >
> > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > > >
> > > > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> > > > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > >
> > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > int32_t
> > > > > for length.  However, any fix essentially breaks all previous
> client
> > > > > library versions or incurs a memory copy.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > [2] https://arrow.apache.org/docs/ipc.html
> > > >
> > >
>


Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

2019-07-18 Thread Ji Liu
Thanks a lot for Wes and Liya's feedbacks.

Agreed that parsing performance of CSV files is important, and I just found a 
benchmark test for Java CSV library[1][2] which shows FastCSV has obvious 
advantages. Anyway, I will test it myself.


Thanks,
Ji Liu

[1] https://raw.githubusercontent.com/osiegmar/FastCSV/master/benchmark.png
[2] https://github.com/osiegmar/FastCSV


--
From:Fan Liya 
Send Time:2019年7月19日(星期五) 10:14
To:dev 
Cc:Ji Liu ; Micah Kornfield 
Subject:Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

Hi Ji,

Thanks for proposing this. CSV adapter sounds like a useful feature.

Best,
Liya Fan
On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney  wrote:
We wrote a custom reader in C++ since performance of parsing CSV files
 matters a lot -- we wanted to do multi-threaded execution of
 conversion steps, also. I don't know what the performance of
 commons-csv is but it might be worth doing some benchmarks to see.

 On Thu, Jul 18, 2019 at 4:35 AM Ji Liu  wrote:
 >
 > Hi all,
 >
 > Seems there is no adapter to convert CSV data to Arrow data in Java side 
 > which C++ has.  Now we already have JDBC adapter, Orc adapter and Avro 
 > adapter (In progress),  I think an adapter for CSV would probably also be 
 > nice.
 > After a brief discuss with @Micah Kornfield, Apache commons-csv [1] seems an 
 > efficient CSV parser that we could potentially leverage but I don't know if 
 > there are other better options. Any inputs and comments would be appreciated.
 >
 > Thanks,
 > Ji Liu[1]https://commons.apache.org/proper/commons-csv/


Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

2019-07-18 Thread Fan Liya
Hi Ji,

Thanks for proposing this. CSV adapter sounds like a useful feature.

Best,
Liya Fan

On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney  wrote:

> We wrote a custom reader in C++ since performance of parsing CSV files
> matters a lot -- we wanted to do multi-threaded execution of
> conversion steps, also. I don't know what the performance of
> commons-csv is but it might be worth doing some benchmarks to see.
>
> On Thu, Jul 18, 2019 at 4:35 AM Ji Liu  wrote:
> >
> > Hi all,
> >
> > Seems there is no adapter to convert CSV data to Arrow data in Java side
> which C++ has.  Now we already have JDBC adapter, Orc adapter and Avro
> adapter (In progress),  I think an adapter for CSV would probably also be
> nice.
> > After a brief discuss with @Micah Kornfield, Apache commons-csv [1]
> seems an efficient CSV parser that we could potentially leverage but I
> don't know if there are other better options. Any inputs and comments would
> be appreciated.
> >
> > Thanks,
> > Ji Liu[1]https://commons.apache.org/proper/commons-csv/
>


[jira] [Created] (ARROW-5984) [C++] Provide method on AdaptiveIntBuilder for appending integer Array types

2019-07-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5984:
---

 Summary: [C++] Provide method on AdaptiveIntBuilder for appending 
integer Array types
 Key: ARROW-5984
 URL: https://issues.apache.org/jira/browse/ARROW-5984
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


For Int8/16/32, it is not currently possible to do a bulk append



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5983) [C++] Provide bulk method on TypedBufferBuilder for appending a bitmap

2019-07-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5983:
---

 Summary: [C++] Provide bulk method on TypedBufferBuilder for 
appending a bitmap
 Key: ARROW-5983
 URL: https://issues.apache.org/jira/browse/ARROW-5983
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We have {{TypedBufferBuilder::UnsafeAppend}} for an array of bytes (where 
non-zero becomes 1), but it would be useful to have also {{UnsafeAppendBits}} 
so that bitmaps coming from {{arrow::Array}} can also be appended 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5982) [C++] Add methods to append dictionary values and dictionary indices directly into DictionaryBuilder

2019-07-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5982:
---

 Summary: [C++] Add methods to append dictionary values and 
dictionary indices directly into DictionaryBuilder
 Key: ARROW-5982
 URL: https://issues.apache.org/jira/browse/ARROW-5982
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In scenarios where a developer has an array of dictionary indices already that 
reference a known dictionary, it is useful to be able to insert the indices 
directly, circumventing the hash table lookup. The developer will be 
responsible for keeping things consistent



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-18 Thread Sutou Kouhei
+1 (binding)

I ran the followings on Debian GNU/Linux sid:

  * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source 0.14.1 0
  * dev/release/verify-release-candidate.sh binaries 0.14.1 0

with:

  * gcc (Debian 8.3.0-7) 8.3.0
  * openjdk version "1.8.0_212"
  * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741) [x86_64-linux]
  * Node.JS v12.1.0
  * go version go1.11.6 linux/amd64
  * nvidia-cuda-dev 9.2.148-7

I re-run C# tests by the following command line sometimes:

  TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1 
dev/release/verify-release-candidate.sh source 0.14.1 0

But "sourcelink test" is always failed:

  + sourcelink test 
artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
  The operation was canceled.

I don't think that this is a broker.


Thanks,
--
kou

In 
  "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019 04:54:33 +0200,
  Krisztián Szűcs  wrote:

> Hi,
> 
> I would like to propose the following release candidate (RC0) of Apache
> Arrow version 0.14.1. This is a patch release consiting of 47 resolved
> JIRA issues[1].
> 
> This release candidate is based on commit:
> 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> 
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 0.14.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> 
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> [2]:
> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> [8]:
> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> [9]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


[jira] [Created] (ARROW-5981) [C++] DictionaryBuilder initialization with Array can fail silently

2019-07-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5981:
---

 Summary: [C++] DictionaryBuilder initialization with Array can 
fail silently
 Key: ARROW-5981
 URL: https://issues.apache.org/jira/browse/ARROW-5981
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


See

https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_dict.cc#L267

I think it would be better to expose {{InsertValues}} on {{DictionaryBuilder}} 
and initialize from a known dictionary that way



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5980) Missing libarrow.so and libarrow_python.so when installing pyarrow

2019-07-18 Thread Haowei Yu (JIRA)
Haowei Yu created ARROW-5980:


 Summary: Missing libarrow.so and libarrow_python.so when 
installing pyarrow
 Key: ARROW-5980
 URL: https://issues.apache.org/jira/browse/ARROW-5980
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Haowei Yu


I have installed the pyarrow 0.14.0 but it seems that by default you did not 
provide symlink of libarrow.so and libarrow_python.so. Only .so file with 
suffix is provided. Hence, I cannot use the output of pyarrow.get_libraries() 
and pyarrow.get_library_dirs() to build my link option. 

If you provide symlink, I can pass following to the linker to specify the 
library to link. e.g. g++ -L/ -larrow -larrow_python 

However, right now, the ld ouput complains not being able to find -larrow and 
-larrow_python



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-18 Thread Wes McKinney
hi Bryan -- well, the reason for the current 0.x version is precisely
to avoid a situation where we are making decisions on the basis of
maintaining forward / backward compatibility.

One possible way forward on this is to make a 0.15.0 (0.14.2, so there
is less trouble for Spark to upgrade) release that supports reading
_both_ old and new variants of the protocol.

On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler  wrote:
>
> Are we going to say that Arrow 1.0 is not compatible with any version
> before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work.
> In Spark 3.0.0, though it will be no problem to update both Java and Python
> to 1.0. Having a compatibility mode so that new readers/writers can work
> with old readers using a 4-byte prefix would solve the problem, but if we
> don't want to do this will pyarrow be able to raise an error that clearly
> the new version does not support the old protocol?  For example, would a
> pyarrow reader see the 0x and raise something like "PyArrow
> detected an old protocol and cannot continue, please use a version < 1.0.0"?
>
> On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney  wrote:
>
> > Hi Francois -- copying the metadata into memory isn't the end of the world
> > but it's a pretty ugly wart. This affects every IPC protocol message
> > everywhere.
> >
> > We have an opportunity to address the wart now but such a fix post-1.0.0
> > will be much more difficult.
> >
> > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> > > If the data buffers are still aligned, then I don't think we should
> > > add a breaking change just for avoiding the copy on the metadata? I'd
> > > expect said metadata to be small enough that zero-copy doesn't really
> > > affect performance.
> > >
> > > François
> > >
> > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield 
> > > wrote:
> > > >
> > > > While working on trying to fix undefined behavior for unaligned memory
> > > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > > prevents us from ever achieving zero-copy memory mapping and having
> > > aligned
> > > > accesses (i.e. clean UBSan runs).
> > > >
> > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > accesses.
> > > >
> > > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > flatbuffer metadata will never be 8 byte aligned.
> > > >
> > > > Do people care?  A simple fix  would be to use int64_t instead of
> > int32_t
> > > > for length.  However, any fix essentially breaks all previous client
> > > > library versions or incurs a memory copy.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/4757
> > > > [2] https://arrow.apache.org/docs/ipc.html
> > >
> >


[jira] [Created] (ARROW-5979) [FlightRPC] Expose (de)serialization of protocol types

2019-07-18 Thread lidavidm (JIRA)
lidavidm created ARROW-5979:
---

 Summary: [FlightRPC] Expose (de)serialization of protocol types
 Key: ARROW-5979
 URL: https://issues.apache.org/jira/browse/ARROW-5979
 Project: Apache Arrow
  Issue Type: New Feature
  Components: FlightRPC
Reporter: lidavidm


It would be nice to be able to serialize/deserialize Flight types (e.g. 
FlightInfo) to/from the binary representations, in order to interoperate with 
systems that might want to provide (say) Flight tickets or FlightInfo without 
using the Flight protocol. For instance, you might have a search server that 
exposes a REST interface and wants to provide FlightInfo objects for Flight 
clients, without having to listen on a separate port.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-18 Thread Bryan Cutler
Are we going to say that Arrow 1.0 is not compatible with any version
before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work.
In Spark 3.0.0, though it will be no problem to update both Java and Python
to 1.0. Having a compatibility mode so that new readers/writers can work
with old readers using a 4-byte prefix would solve the problem, but if we
don't want to do this will pyarrow be able to raise an error that clearly
the new version does not support the old protocol?  For example, would a
pyarrow reader see the 0x and raise something like "PyArrow
detected an old protocol and cannot continue, please use a version < 1.0.0"?

On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney  wrote:

> Hi Francois -- copying the metadata into memory isn't the end of the world
> but it's a pretty ugly wart. This affects every IPC protocol message
> everywhere.
>
> We have an opportunity to address the wart now but such a fix post-1.0.0
> will be much more difficult.
>
> On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > If the data buffers are still aligned, then I don't think we should
> > add a breaking change just for avoiding the copy on the metadata? I'd
> > expect said metadata to be small enough that zero-copy doesn't really
> > affect performance.
> >
> > François
> >
> > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield 
> > wrote:
> > >
> > > While working on trying to fix undefined behavior for unaligned memory
> > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > prevents us from ever achieving zero-copy memory mapping and having
> > aligned
> > > accesses (i.e. clean UBSan runs).
> > >
> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> > >
> > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > followed immediately  by the flatbuffer metadata.  This means the
> > > flatbuffer metadata will never be 8 byte aligned.
> > >
> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> > > for length.  However, any fix essentially breaks all previous client
> > > library versions or incurs a memory copy.
> > >
> > > [1] https://github.com/apache/arrow/pull/4757
> > > [2] https://arrow.apache.org/docs/ipc.html
> >
>


RE: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-18 Thread Eric Erhardt
+1

Tested:
- C# source verification on Ubuntu 18
- I verified the C# source contained the fixes for the two issues I needed 
fixed in this patch.

-Original Message-
From: Krisztián Szűcs  
Sent: Tuesday, July 16, 2019 9:55 PM
To: dev@arrow.apache.org
Subject: [VOTE] Release Apache Arrow 0.14.1 - RC0

Hi,

I would like to propose the following release candidate (RC0) of Apache Arrow 
version 0.14.1. This is a patch release consiting of 47 resolved JIRA issues[1].

This release candidate is based on commit:
5f564424c71cef12619522cdde59be5f69b31b68 [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests, and vote 
on the release. See [9] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 0.14.1 [ ] +0 [ ] -1 Do not release this as 
Apache Arrow 0.14.1 because...

[1]:
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520ARROW%2520AND%2520status%2520in%2520%2528Resolved%252C%2520Closed%2529%2520AND%2520fixVersion%2520%253D%25200.14.1data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930625039sdata=Ltv4Vi3G91xHkFiq9RtWmFCVzChabfeJ1EX5ZCShy4U%3Dreserved=0
[2]:
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Ftree%2F5f564424c71cef12619522cdde59be5f69b31b68data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930625039sdata=nWStpf%2BqMeLfCcguqMzN9s%2FarPOv%2F32oFxI%2BK9FsQt4%3Dreserved=0
[3]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Farrow%2Fapache-arrow-0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=o6sAxT4fWOCFwmiZgZdx%2B3kLZbXM%2FpamiUAXmGk6HCI%3Dreserved=0
[4]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fcentos-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=wTXSsizkpoSVreQdrgg%2FRPp7sBWiyjK90OfBvTUdoTE%3Dreserved=0
[5]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fdebian-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=94%2BOVuBMncnTLfFHV9AM%2BpL4rhswQZ1exktz1fQwBVk%3Dreserved=0
[6]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fpython-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=afdkBYOdLfmtN5u1p9h5YBdwxHE0cTFriUKR8VdsmHs%3Dreserved=0
[7]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbintray.com%2Fapache%2Farrow%2Fubuntu-rc%2F0.14.1-rc0data=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=lXIeeWfN0i78beynuww%2FJjpwfO%2B7b7bYfHhYnVzP%2Fzs%3Dreserved=0
[8]:
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2F5f564424c71cef12619522cdde59be5f69b31b68%2FCHANGELOG.mddata=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=lPlhLulc7yV4YwdpmBe%2FCq7sdO7GyntOgVD7aeZxiQM%3Dreserved=0
[9]:
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FARROW%2FHow%2Bto%2BVerify%2BRelease%2BCandidatesdata=02%7C01%7CEric.Erhardt%40microsoft.com%7C375d6c8545a0356808d70a62233a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636989288930635039sdata=w0LAtJzr5eWHpQRsGzYEI7t0m%2BeQ6w%2Bu7X5LVF6U%2Bus%3Dreserved=0


[jira] [Created] (ARROW-5978) [FlightRPC] [Java] Integration test client doesn't close buffers

2019-07-18 Thread lidavidm (JIRA)
lidavidm created ARROW-5978:
---

 Summary: [FlightRPC] [Java] Integration test client doesn't close 
buffers
 Key: ARROW-5978
 URL: https://issues.apache.org/jira/browse/ARROW-5978
 Project: Apache Arrow
  Issue Type: Test
  Components: FlightRPC, Integration, Java
Affects Versions: 0.14.0
Reporter: lidavidm
Assignee: lidavidm
 Fix For: 1.0.0


The integration test client doesn't close any of the clients or free any of the 
buffers it creates.

Trying to do so leads to a leak problem on the dictionary vector case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5977) Method for read_csv to limit which columns are read?

2019-07-18 Thread Jordan Samuels (JIRA)
Jordan Samuels created ARROW-5977:
-

 Summary: Method for read_csv to limit which columns are read?
 Key: ARROW-5977
 URL: https://issues.apache.org/jira/browse/ARROW-5977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.0
Reporter: Jordan Samuels


In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this in 
pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[DISCUSS][JAVA] Implement a CSV to Arrow adapter

2019-07-18 Thread Ji Liu
Hi all,

Seems there is no adapter to convert CSV data to Arrow data in Java side which 
C++ has.  Now we already have JDBC adapter, Orc adapter and Avro adapter (In 
progress),  I think an adapter for CSV would probably also be nice. 
After a brief discuss with @Micah Kornfield, Apache commons-csv [1] seems an 
efficient CSV parser that we could potentially leverage but I don't know if 
there are other better options. Any inputs and comments would be appreciated.

Thanks,
Ji Liu[1]https://commons.apache.org/proper/commons-csv/

Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-18 Thread Krisztián Szűcs
Hey Zhuo,

On Thu, Jul 18, 2019 at 2:23 AM Zhuo Peng  wrote:

> Hi Krisztián,
>
> Sorry if it's too late, but is it possible to also include
> https://github.com/apache/arrow/pull/4883 in the release?

It's late because I'm away from keyboard, Sunday is the closest day when
I could draft another release candidate.
If other issues are coming up with RC0 and the vote doesn't pass then we
can include it in RC1.

> This would help
> resolve https://github.com/apache/arrow/issues/4472 .
>
> Thanks,
>
> Zhuo
>
> On Wed, Jul 17, 2019 at 3:00 AM Antoine Pitrou  wrote:
>
> >
> > +1 (binding).
> >
> > Tested on Ubuntu 18.04.2 (x86-64) with CUDA enabled:
> >
> > - binaries verification worked fine
> > - source verification worked until the npm step, which failed (I don't
> > have npm installed)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 17/07/2019 à 04:54, Krisztián Szűcs a écrit :
> > > Hi,
> > >
> > > I would like to propose the following release candidate (RC0) of Apache
> > > Arrow version 0.14.1. This is a patch release consiting of 47 resolved
> > > JIRA issues[1].
> > >
> > > This release candidate is based on commit:
> > > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7].
> > > The changelog is located at [8].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [9] for how to validate a release
> candidate.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow 0.14.1
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> > >
> > > [1]:
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> > > [2]:
> > >
> >
> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> > > [3]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> > > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> > > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> > > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> > > [8]:
> > >
> >
> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> > > [9]:
> > >
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > >
> >
>


[jira] [Created] (ARROW-5976) [C++] RETURN_IF_ERROR(ctx) should be namespaced

2019-07-18 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5976:
--

 Summary: [C++] RETURN_IF_ERROR(ctx) should be namespaced
 Key: ARROW-5976
 URL: https://issues.apache.org/jira/browse/ARROW-5976
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield
 Fix For: 1.0.0


RETURN_IF_ERROR is a common macro, it shouldn't be exposed in a header file 
without namespacing to Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5975) [C++][Gandiva] Add method to cast Date(in Milliseconds) to timestamp

2019-07-18 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-5975:
---

 Summary: [C++][Gandiva] Add method to cast Date(in Milliseconds) 
to timestamp
 Key: ARROW-5975
 URL: https://issues.apache.org/jira/browse/ARROW-5975
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Affects Versions: 1.0.0
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla
 Fix For: 1.0.0


add castTIMESTAMP_date64(date64) method in Gandiva. The input date is in 
milliseconds.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Release cadence and release vote conventions

2019-07-18 Thread Micah Kornfield
I'd can help as well, but not exactly sure where to start.  It seems like
there are already some JIRAs opened [1]
for improving the release?  Could someone more familiar with the process
pick out the highest priority ones? Do more need to be opened?

Thanks,
Micah

[1]
https://issues.apache.org/jira/browse/ARROW-2880?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(%22Developer%20Tools%22%2C%20Packaging)%20and%20summary%20~%20Release

On Sat, Jul 13, 2019 at 7:17 AM Wes McKinney  wrote:

> To be effective at improving the life of release managers, the nightly
> release process really should use as close as possible to the same
> scripts that the RM uses to produce the release. Otherwise we could
> have a situation where the nightlies succeed but there is some problem
> that either fails an RC or is unable to be produced at all.
>
> On Sat, Jul 13, 2019 at 9:12 AM Andy Grove  wrote:
> >
> > I would like to volunteer to help with Java and Rust release process
> work,
> > especially nightly releases.
> >
> > Although I'm not that familiar with the Java implementation of Arrow, I
> > have been using Java and Maven for a very long time.
> >
> > Do we envisage a single nightly release process that releases all
> languages
> > simultaneously? or do we want separate process per language, with
> different
> > maintainers?
> >
> >
> >
> > On Wed, Jul 10, 2019 at 8:18 AM Wes McKinney 
> wrote:
> >
> > > On Sun, Jul 7, 2019 at 7:40 PM Sutou Kouhei 
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > in future releases we should
> > > > > institute a minimum 24-hour "quiet period" after any community
> > > > > feedback on a release candidate to allow issues to be examined
> > > > > further.
> > > >
> > > > I agree with this. I'll do so when I do a release manager in
> > > > the future.
> > > >
> > > > > To be able to release more often, two things have to happen:
> > > > >
> > > > > * More PMC members must engage with the release management role,
> > > > > process, and tools
> > > > > * Continued improvements to release tooling to make the process
> less
> > > > > painful for the release manager. For example, it seems we may want
> to
> > > > > find a different place than Bintray to host binary artifacts
> > > > > temporarily during release votes
> > > >
> > > > My opinion that we need to build nightly release system.
> > > >
> > > > It uses dev/release/NN-*.sh to build .tar.gz and binary
> > > > artifacts from the .tar.gz.
> > > > It also uses dev/release/verify-release-candidate.* to
> > > > verify build .tar.gz and binary artifacts.
> > > > It also uses dev/release/post-NN-*.sh to do post release
> > > > tasks. (Some tasks such as uploading a package to packaging
> > > > system will be dry-run.)
> > > >
> > >
> > > I agree that having a turn-key release system that's capable of
> > > producing nightly packages is the way to do. That way any problems
> > > that would block a release will come up as they happen rather than
> > > piling up until the very end like they are now.
> > >
> > > > I needed 10 or more changes for dev/release/ to create
> > > > 0.14.0 RC0. (Some of them are still in my local stashes. I
> > > > don't have time to create pull requests for them
> > > > yet. Because I postponed some tasks of my main
> > > > business. I'll create pull requests after I finished the
> > > > postponed tasks of my main business.)
> > > >
> > >
> > > Thanks. I'll follow up on the 0.14.1/0.15.0 thread -- since we need to
> > > release again soon because of problems with 0.14.0 please let us know
> > > what patches will be needed to make another release.
> > >
> > > > If we fix problems related to dev/release/ in our normal
> > > > development process, release process will be less painful.
> > > >
> > > > The biggest problem for 0.14.0 RC0 is java/pom.xml related:
> > > >   https://github.com/apache/arrow/pull/4717
> > > >
> > > > It was difficult for me because I don't have Java
> > > > knowledge. Release manager needs help from many developers
> > > > because release manager may not have knowledge of all
> > > > supported languages. Apache Arrow supports 10 over
> > > > languages.
> > > >
> > > >
> > > > For Bintray API limit problem, we'll be able to resolve it.
> > > > I was added to https://bintray.com/apache/ members:
> > > >
> > > >   https://issues.apache.org/jira/browse/INFRA-18698
> > > >
> > > > I'll be able to use Bintray API without limitation in the
> > > > future. Release managers should also request the same thing.
> > > >
> > >
> > > This is good, I will add myself. Other PMC members should also add
> > > themselves.
> > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In  lsowxqxidjapc_cofguksj...@mail.gmail.com>
> > > >   "[DISCUSS] Release cadence and release vote conventions" on Sat, 6
> Jul
> > > 2019 16:28:50 -0500,
> > > >   Wes McKinney  wrote:
> > > >
> > > > > hi folks,
> > > > >
> > > > > As a reminder,