Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Micah Kornfield
I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:

> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> 0.15.0 with this change sooner rather than later might be prudent.
>
> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> wrote:
> >
> >
> > Hello,
> >
> > Recently we've discussed breaking the IPC format to fix a long-standing
> > alignment issue.  See this discussion:
> >
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> >
> > Should we first do a 0.15.0 in order to get those format fixes right?
> > Once that is fine and settled we can move to the 1.0.0 release?
> >
> > Regards
> >
> > Antoine.
>


Re: [Memo] API Behavior changes

2019-07-22 Thread Fan Liya
@Wes Mckineey,

Thanks for the good suggestion.

Best,
Liya Fan

On Mon, Jul 22, 2019 at 8:23 PM Wes McKinney  wrote:

> You could also use labels in JIRA to mark issues that introduce API changes
>
> On Mon, Jul 22, 2019 at 4:42 AM Fan Liya  wrote:
> >
> > @Uwe L. Korn
> >
> > Thanks a lot for the good suggestion.
> > I will create a new file to track the changes.
> >
> > Best,
> > Liya Fan
> >
> > On Mon, Jul 22, 2019 at 5:03 PM Uwe L. Korn  wrote:
> >
> > > Hallo Liya,
> > >
> > > what about having this as part of the repository, e.g.
> > > java/api-changes.md? We have an auto-generated changelog that is quite
> > > verbose but having such documentation for consumers of the Java library
> > > would be really helpful as it is gives a denser packed information on
> > > upgrading versions.
> > >
> > > Cheers
> > > Uwe
> > >
> > > On Mon, Jul 22, 2019, at 4:54 AM, Fan Liya wrote:
> > > > Hi all,
> > > >
> > > > Let's track the API behavior changes in this email thread, so as not
> > > forget
> > > > about them for the next release.
> > > >
> > > > ARROW-5842 : the
> > > > semantics of lastSet in ListVector changes. In the past, it refers
> to the
> > > > next index that will be set; now it points to the last index that is
> > > > actually set.
> > > >
> > > > ARROW-5973 : The
> > > > semantics of the get methods for VarCharVector, VarBinaryVector, and
> > > > FixedSizeBinaryVector changes. If the past, if the validity bit is
> clear,
> > > > the methods throw throws an IllegalStateException when
> > > > NULL_CHECKING_ENABLED is set, or returns an empty object when the
> flag is
> > > > not set. Now, the get methods returns a null if the validity bit is
> > > clear.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > >
>


Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-22 Thread Micah Kornfield
Hi Wes,
Are there currently files that need to be moved?

Thanks,
Micah

On Monday, July 22, 2019, Wes McKinney  wrote:

> Sort of tangentially related, but while we are on the topic:
>
> Please, if you would, avoid checking binary test data files into the
> main repository. Use https://github.com/apache/arrow-testing if you
> truly need to check in binary data -- something to look out for in
> code reviews
>
> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield 
> wrote:
> >
> > Hi Jacques,
> > Thanks for the clarifications. I think the distinction is useful.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> >
> >
> > What do you think about creating a "contrib" directory and moving the
> JDBC
> > and AVRO adapters into it? We should also probably provide more
> description
> > in pom.xml to make it clear for downstream consumers.
> >
> > We should probably come up with a name other than adapters for
> > readers/writer ("converters"?) and use it in the directory structure for
> > the existing Orc implementation?
> >
> > Thanks,
> > Micah
> >
> >
> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau 
> wrote:
> >
> > > As I read through your responses, I think it might be useful to talk
> about
> > > adapters versus native Arrow readers/writers. Adapters are something
> that
> > > adapt an existing API to produce and/or consume Arrow data. A native
> > > reader/writer is something that understand the format directly and
> does not
> > > have intermediate representations or APIs the data moves through beyond
> > > those that needs to be used to complete work.
> > >
> > > If people want to write adapters for Arrow, I see that as useful but
> very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> > >
> > > Further comments inline.
> > >
> > >
> > >> Could you expand on what level of detail you would like to see a
> design
> > >> document?
> > >>
> > >
> > > A couple paragraphs seems sufficient. This is the goals of the
> > > implementation. We target existing functionality X. It is an adapter.
> Or it
> > > is a native impl. This is the expected memory and processing
> > > characteristics, etc.  I've never been one for huge amount of design
> but
> > > I've seen a number of recent patches appear where this is no upfront
> > > discussion. Making sure that multiple buy into a design is the best
> way to
> > > ensure long-term maintenance and use.
> > >
> > >
> > >> I think this should be optional (the same argument below about
> predicates
> > >> apply so I won't repeat them).
> > >>
> > >
> > > Per my comments above, maybe adapter versus native reader clarifies
> > > things. For example, I've been working on a native avro read
> > > implementation. It is little more than chicken scratch at this point
> but
> > > its goals, vision and design are very different than the adapter that
> is
> > > being produced atm.
> > >
> > >
> > >> Can you clarify the intent of this objective.  Is it mainly to tie in
> with
> > >> the existing Java arrow memory book keeping?  Performance?  Something
> > >> else?
> > >>
> > >
> > > Arrow is designed to be off-heap. If you have large variable amounts of
> > > on-heap memory in an application, it starts to make it very hard to
> make
> > > decisions about off-heap versus on-heap memory since those divisions
> are by
> > > and large static in nature. It's fine for short lived applications but
> for
> > > long lived applications, if you're working with a large amount of
> data, you
> > > want to keep most of your memory in one pool. In the context of Arrow,
> this
> > > is going to naturally be off-heap memory.
> > >
> > >
> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
> > >> situation.  Starting off with a known good implementation of
> conversion to
> > >> Arrow can allow us to both to profile hot-spots and provide a
> comparison
> > >> of
> > >> implementations to verify correctness.
> > >>
> > >
> > > I'm not clear what message we're sending as a community if we produce
> low
> > > performance components. The whole of Arrow is to increase performance,
> not
> > > decrease it. I'm targeting good, not perfect. At the same time, from my
> > > perspective, Arrow development should not be approached in the same way
> > > that general Java app development should be. If we hold a high
> standard,
> > > we'll have less total integrations initially but I think we'll solve
> more
> > > real world problems.
> > >
> > > There is also the question of how widely adoptable we want Arrow
> libraries
> > >> to be.
> > >> It isn't surprising to me that Impala's Avro reader is an order of
> > >> magnitude faster then the stock Java one.  As far as I know Impala's
> is a
> > >> C++ implementation that does JIT with LLVM.  We could try to use it
> 

Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Jacques Nadeau
There are two main things that have been important to us in Dremio around
threading:

Separate threading model from algorithms. We chose to do parallelization at
the engine level instead of the operation level. This allows us to
substantially increase parallelization while still maintaining a strong
thread prioritization model. This contrasts to some systems like Apache
Impala which chose to implement threading at the operation level. This has
ultimately hurt their ability for individual workloads to scale out within
a node. See the experimental features around MT_DOP when the tried to
retreat from this model and struggled to do so. It serves as an example of
the challenges if you don't separate data algorithms from threading early
on in design [1]. This intention was core to how we designed Gandiva, where
an external driver makes decisions around threading and the actual
algorithm only does small amounts of work before yielding to the driver.
This allows a driver to make parallelization and scheduling decisions
without having to know the internals of the algorithm. (In Dremio, these
are all covered under the interfaces described in Operator [2] and it's
subclasses that together provide a very simple state of operation states
for the driver to understand.

The second is that the majority of the data we work with these days is
primarily in high latency cloud storage. While we may stage data locally, a
huge amount of reads are impacted by the performance of cloud stores. To
cover these performance behaviors we did two things, the first was
introduce a very simple to use  async reading interface for data, seen at
[3] and introduce a collaborative way that individual tasks could declare
their blocking state to a central coordinator [4]. Happy to cover these in
more detail if people are interested. In general, using these techniques
have allowed us to tune many systems to a situation where the (highly)
variable latency of cloud stores like S3 and ADLS can be mostly cloaked by
aggressive read ahead and what we call predictive pipelining (where reading
is guided based on latency performance characteristics along with knowledge
of columnar formats like Parquet).

[1]
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_mt_dop.html#mt_dop
[2]
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/spi/Operator.java
[3]
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/store/dfs/async/AsyncByteReader.java
[4]
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/threads/sharedres/SharedResourceManager.java

On Mon, Jul 22, 2019 at 9:56 AM Antoine Pitrou  wrote:

>
> Le 22/07/2019 à 18:52, Wes McKinney a écrit :
> >
> > Probably the way is to introduce async-capable read APIs into the file
> > interfaces. For example:
> >
> > file->ReadAsyncBlock(thread_ctx, ...);
> >
> > That way the file implementation can decide whether asynchronous logic
> > is actually needed.
> > I doubt very much that a one-size-fits-all
> > concurrency solution can be developed -- in some applications
> > coarse-grained IO and CPU task scheduling may be warranted, but we
> > need to have a solution for finer-grained scenarios where
> >
> > * In the memory-mapped case, there is no overhead and
> > * The programming model is not too burdensome to the library developer
>
> Well, the asynchronous I/O programming model *will* be burdensome at
> least until C++ gets coroutines (which may happen in C++20, and
> therefore be usable somewhere around 2024 for Arrow?).
>
> Regards
>
> Antoine.
>


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Krisztián Szűcs
On Tue, Jul 23, 2019 at 12:31 AM Krisztián Szűcs 
wrote:

> The remaining tasks are:
> - Updating website (after https://github.com/apache/arrow/pull/4922 is
> merged)
>
I'm generating the apidocs and updating the changelog.
I can send the ANNOUNCEMENT once the site gets updated.

> - Update JavaScript packages
>
Paul has published the JS packages.

> - Update R packages
>
Anyone would like to help with it?

>
> On Mon, Jul 22, 2019 at 9:52 PM Krisztián Szűcs 
> wrote:
>
>> Added a warning about that.
>>
>> On Mon, Jul 22, 2019 at 9:38 PM Wes McKinney  wrote:
>>
>>> hi folks -- we had a small snafu with the post-release tasks because
>>> this patch release did not follow our normal release procedure where
>>> the release candidate is usually based off of master.
>>>
>>> When we prepare a patch release that is based on backported commits
>>> into a maintenance branch, we DO NOT need to rebase master or any PRs.
>>> So we need to update the release management instructions to indicate
>>> that these steps should be skipped for future patch releases (or any
>>> release that isn't based on master at some point in time).
>>>
>>> - Wes
>>>
>>> On Mon, Jul 22, 2019 at 10:46 AM Krisztián Szűcs
>>>  wrote:
>>> >
>>> > Hi,
>>> >
>>> > The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1)
>>> votes.
>>> > Thanks for helping verify the RC!
>>> > I'm moving on to the post-release tasks [1] once github resolves its
>>> > partially
>>> > degraded service issues [2]. Any help is appreciated.
>>> >
>>> > - Krisztian
>>> >
>>> > [1]:
>>> >
>>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
>>> > [2]: https://www.githubstatus.com/
>>> >
>>> > On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs <
>>> szucs.kriszt...@gmail.com>
>>> > wrote:
>>> >
>>> > > +1 (binding)
>>> > >
>>> > > Ran both the source and binary verification scripts on macOS Mojave.
>>> > > Also tested the wheels in python docker containers and on OSX.
>>> > >
>>> > > On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei 
>>> wrote:
>>> > >
>>> > >> +1 (binding)
>>> > >>
>>> > >> I ran the followings on Debian GNU/Linux sid:
>>> > >>
>>> > >>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
>>> > >> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh
>>> source
>>> > >> 0.14.1 0
>>> > >>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>>> > >>
>>> > >> with:
>>> > >>
>>> > >>   * gcc (Debian 8.3.0-7) 8.3.0
>>> > >>   * openjdk version "1.8.0_212"
>>> > >>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741)
>>> [x86_64-linux]
>>> > >>   * Node.JS v12.1.0
>>> > >>   * go version go1.11.6 linux/amd64
>>> > >>   * nvidia-cuda-dev 9.2.148-7
>>> > >>
>>> > >> I re-run C# tests by the following command line sometimes:
>>> > >>
>>> > >>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
>>> > >> dev/release/verify-release-candidate.sh source 0.14.1 0
>>> > >>
>>> > >> But "sourcelink test" is always failed:
>>> > >>
>>> > >>   + sourcelink test
>>> > >> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
>>> > >>   The operation was canceled.
>>> > >>
>>> > >> I don't think that this is a broker.
>>> > >>
>>> > >>
>>> > >> Thanks,
>>> > >> --
>>> > >> kou
>>> > >>
>>> > >> In <
>>> cahm19a5jpetwjj4uj-1zoqjzqdcejj-ky673uv83jtfcoyp...@mail.gmail.com>
>>> > >>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019
>>> 04:54:33
>>> > >> +0200,
>>> > >>   Krisztián Szűcs  wrote:
>>> > >>
>>> > >> > Hi,
>>> > >> >
>>> > >> > I would like to propose the following release candidate (RC0) of
>>> Apache
>>> > >> > Arrow version 0.14.1. This is a patch release consiting of 47
>>> resolved
>>> > >> > JIRA issues[1].
>>> > >> >
>>> > >> > This release candidate is based on commit:
>>> > >> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
>>> > >> >
>>> > >> > The source release rc0 is hosted at [3].
>>> > >> > The binary artifacts are hosted at [4][5][6][7].
>>> > >> > The changelog is located at [8].
>>> > >> >
>>> > >> > Please download, verify checksums and signatures, run the unit
>>> tests,
>>> > >> > and vote on the release. See [9] for how to validate a release
>>> > >> candidate.
>>> > >> >
>>> > >> > The vote will be open for at least 72 hours.
>>> > >> >
>>> > >> > [ ] +1 Release this as Apache Arrow 0.14.1
>>> > >> > [ ] +0
>>> > >> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
>>> > >> >
>>> > >> > [1]:
>>> > >> >
>>> > >>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
>>> > >> > [2]:
>>> > >> >
>>> > >>
>>> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
>>> > >> > [3]:
>>> > >>
>>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
>>> > >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
>>> > >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
>>> 

[jira] [Created] (ARROW-6010) [Release] JAVA_HOME is inproperly set in the gen apidocs docker container

2019-07-22 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6010:
--

 Summary: [Release] JAVA_HOME is inproperly set in the gen apidocs 
docker container
 Key: ARROW-6010
 URL: https://issues.apache.org/jira/browse/ARROW-6010
 Project: Apache Arrow
  Issue Type: Task
Reporter: Krisztian Szucs


Maven and openjdk are installed by both the system package manager and conda, 
and JAVA_HOME is set to /opt/conda eventually.

So javadoc fails with: [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-site-plugin:3.5.1:site (default-site) on project 
arrow-java-root: Error generating maven-javadoc-plugin:3.0.0-M1:aggregate: 
Unable to find javadoc command: The environment vari
able JAVA_HOME is not correctly set. -> [Help 1]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6009) [Release][JS] Ignore NPM errors in the javascript release script

2019-07-22 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6009:
--

 Summary: [Release][JS] Ignore NPM errors in the javascript release 
script
 Key: ARROW-6009
 URL: https://issues.apache.org/jira/browse/ARROW-6009
 Project: Apache Arrow
  Issue Type: Task
Reporter: Krisztian Szucs


Use {npx lerna exec --no-bail -- npm publish} in the npm-release.sh script.

cc [~paultaylor]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6008) [Release] Don't parallelize the bintray upload script

2019-07-22 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6008:
--

 Summary: [Release] Don't parallelize the bintray upload script
 Key: ARROW-6008
 URL: https://issues.apache.org/jira/browse/ARROW-6008
 Project: Apache Arrow
  Issue Type: Task
Reporter: Krisztian Szucs
 Fix For: 1.0.0
 Attachments: binary-upload.patch

It was spawning a lot of docker containers, and resulted fragile uploads.
Patch provided by [~kou] is attached.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6007) [Release] Use SNAPSHOT versions in pom.xml files after release

2019-07-22 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-6007:
--

 Summary: [Release] Use SNAPSHOT versions in pom.xml files after 
release
 Key: ARROW-6007
 URL: https://issues.apache.org/jira/browse/ARROW-6007
 Project: Apache Arrow
  Issue Type: Task
Reporter: Krisztian Szucs
 Fix For: 1.0.0


Before the 0.14.1 release I had to change the version number to have -SNAPSHOT 
postfixes, otherwise mvn refused to prepare the release.

See the required commit: 
https://github.com/apache/arrow/commit/f533bc539e9ce4342d1b04966a7cd6aa5c1a1412



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Krisztián Szűcs
The remaining tasks are:
- Updating website (after https://github.com/apache/arrow/pull/4922 is
merged)
- Update JavaScript packages
- Update R packages

On Mon, Jul 22, 2019 at 9:52 PM Krisztián Szűcs 
wrote:

> Added a warning about that.
>
> On Mon, Jul 22, 2019 at 9:38 PM Wes McKinney  wrote:
>
>> hi folks -- we had a small snafu with the post-release tasks because
>> this patch release did not follow our normal release procedure where
>> the release candidate is usually based off of master.
>>
>> When we prepare a patch release that is based on backported commits
>> into a maintenance branch, we DO NOT need to rebase master or any PRs.
>> So we need to update the release management instructions to indicate
>> that these steps should be skipped for future patch releases (or any
>> release that isn't based on master at some point in time).
>>
>> - Wes
>>
>> On Mon, Jul 22, 2019 at 10:46 AM Krisztián Szűcs
>>  wrote:
>> >
>> > Hi,
>> >
>> > The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1)
>> votes.
>> > Thanks for helping verify the RC!
>> > I'm moving on to the post-release tasks [1] once github resolves its
>> > partially
>> > degraded service issues [2]. Any help is appreciated.
>> >
>> > - Krisztian
>> >
>> > [1]:
>> >
>> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
>> > [2]: https://www.githubstatus.com/
>> >
>> > On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs <
>> szucs.kriszt...@gmail.com>
>> > wrote:
>> >
>> > > +1 (binding)
>> > >
>> > > Ran both the source and binary verification scripts on macOS Mojave.
>> > > Also tested the wheels in python docker containers and on OSX.
>> > >
>> > > On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei 
>> wrote:
>> > >
>> > >> +1 (binding)
>> > >>
>> > >> I ran the followings on Debian GNU/Linux sid:
>> > >>
>> > >>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
>> > >> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source
>> > >> 0.14.1 0
>> > >>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>> > >>
>> > >> with:
>> > >>
>> > >>   * gcc (Debian 8.3.0-7) 8.3.0
>> > >>   * openjdk version "1.8.0_212"
>> > >>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741)
>> [x86_64-linux]
>> > >>   * Node.JS v12.1.0
>> > >>   * go version go1.11.6 linux/amd64
>> > >>   * nvidia-cuda-dev 9.2.148-7
>> > >>
>> > >> I re-run C# tests by the following command line sometimes:
>> > >>
>> > >>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
>> > >> dev/release/verify-release-candidate.sh source 0.14.1 0
>> > >>
>> > >> But "sourcelink test" is always failed:
>> > >>
>> > >>   + sourcelink test
>> > >> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
>> > >>   The operation was canceled.
>> > >>
>> > >> I don't think that this is a broker.
>> > >>
>> > >>
>> > >> Thanks,
>> > >> --
>> > >> kou
>> > >>
>> > >> In <
>> cahm19a5jpetwjj4uj-1zoqjzqdcejj-ky673uv83jtfcoyp...@mail.gmail.com>
>> > >>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019
>> 04:54:33
>> > >> +0200,
>> > >>   Krisztián Szűcs  wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > I would like to propose the following release candidate (RC0) of
>> Apache
>> > >> > Arrow version 0.14.1. This is a patch release consiting of 47
>> resolved
>> > >> > JIRA issues[1].
>> > >> >
>> > >> > This release candidate is based on commit:
>> > >> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
>> > >> >
>> > >> > The source release rc0 is hosted at [3].
>> > >> > The binary artifacts are hosted at [4][5][6][7].
>> > >> > The changelog is located at [8].
>> > >> >
>> > >> > Please download, verify checksums and signatures, run the unit
>> tests,
>> > >> > and vote on the release. See [9] for how to validate a release
>> > >> candidate.
>> > >> >
>> > >> > The vote will be open for at least 72 hours.
>> > >> >
>> > >> > [ ] +1 Release this as Apache Arrow 0.14.1
>> > >> > [ ] +0
>> > >> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
>> > >> >
>> > >> > [1]:
>> > >> >
>> > >>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
>> > >> > [2]:
>> > >> >
>> > >>
>> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
>> > >> > [3]:
>> > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
>> > >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
>> > >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
>> > >> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
>> > >> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
>> > >> > [8]:
>> > >> >
>> > >>
>> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
>> > >> > [9]:
>> > >> >
>> > >>
>> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>> > >>
>> > >
>>
>


[jira] [Created] (ARROW-6006) [C++] Error reading an empty IPC stream with a dictionary-encoded column

2019-07-22 Thread Steven Fackler (JIRA)
Steven Fackler created ARROW-6006:
-

 Summary: [C++] Error reading an empty IPC stream with a 
dictionary-encoded column
 Key: ARROW-6006
 URL: https://issues.apache.org/jira/browse/ARROW-6006
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Steven Fackler


 
{code:java}
#include 
#include 
#include 
void check(arrow::Status status) {
 if (!status.ok()) {
 status.Abort();
 }
}
int main() {
 auto type = arrow::dictionary(arrow::int8(), arrow::utf8());
 auto f0 = arrow::field("f0", type);
 auto schema = arrow::schema({f0});
std::shared_ptr os;
 check(arrow::io::BufferOutputStream::Create(0, arrow::default_memory_pool(), 
&os));
std::shared_ptr writer;
 check(arrow::ipc::RecordBatchStreamWriter::Open(&*os, schema, &writer));
 check(writer->Close());
std::shared_ptr buffer;
 check(os->Finish(&buffer));
arrow::io::BufferReader is(buffer);
std::shared_ptr reader;
 check(arrow::ipc::RecordBatchStreamReader::Open(&is, &reader));
std::shared_ptr batch;
 check(reader->ReadNext(&batch));
}
{code}
 
{noformat}
-- Arrow Fatal Error --
Invalid: Expected message in stream, was null or length 0{noformat}
It seems like this was caused by 
[https://github.com/apache/arrow/commit/e68ca7f9aed876a1afcad81a417afb87c94ee951],
 which moved the dictionary values from the DataType to the array itself.

I initially thought I could work around this by writing a zero-length table but 
that doesn't seem to actually work.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6005) arrow::FileReader::GetRecordBatchReader() does not behave as documented since ARROW-1012

2019-07-22 Thread Martin (JIRA)
Martin created ARROW-6005:
-

 Summary: arrow::FileReader::GetRecordBatchReader() does not behave 
as documented since ARROW-1012
 Key: ARROW-6005
 URL: https://issues.apache.org/jira/browse/ARROW-6005
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1, 0.14.0
Reporter: Martin


GetRecordBatchReader() should

"Return a RecordBatchReader of row groups selected from row_group_indices, the
ordering in row_group_indices matters." (that is what the doxygen string says),

*but:*

Since change ARROW-1012, it ignores the {{row_group_indices}} argument.

The {{row_group_indices_}} in the {{RowGroupRecordBatchReader}} that is created 
are never used.

Either the documentation should be changed, or the behavior should be reverted. 
 I would prefer the latter, as I do not know how to make sure to read a 
specific row groups anymore...



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6004) [C++] CSV reader ignore_empty_lines option doesn't handle empty lines

2019-07-22 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6004:
--

 Summary: [C++] CSV reader ignore_empty_lines option doesn't handle 
empty lines
 Key: ARROW-6004
 URL: https://issues.apache.org/jira/browse/ARROW-6004
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


Followup to https://issues.apache.org/jira/browse/ARROW-5747. If 
{{ignore_empty_lines}} is false and there are empty lines, it fails to parse 
(again, with {{Invalid: Empty CSV file}}).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6003) [C++] Better input validation and error messaging in CSV reader

2019-07-22 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6003:
--

 Summary: [C++] Better input validation and error messaging in CSV 
reader
 Key: ARROW-6003
 URL: https://issues.apache.org/jira/browse/ARROW-6003
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson


Followup to https://issues.apache.org/jira/browse/ARROW-5747. The error 
message(s) are not great when you give bad input. For example, if I give too 
many or too few {{column_names}}, the error I get is {{Invalid: Empty CSV 
file}}. In fact, that's about the only error message I've seen from the CSV 
reader, no matter what I've thrown at it.

It would be better if error messages were more specific so that I as a user 
might know how to fix my bad input.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`

2019-07-22 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6002:


 Summary: [C++][Gandiva] TestCastFunctions does not test int64 
casting`
 Key: ARROW-6002
 URL: https://issues.apache.org/jira/browse/ARROW-6002
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Benjamin Kietzman


{{outputs[2]}} (corresponds to cast from float32) is checked twice 
https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478
 while {{outputs[1]}} is not checked (corresponds to cast from int64)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Krisztián Szűcs
Added a warning about that.

On Mon, Jul 22, 2019 at 9:38 PM Wes McKinney  wrote:

> hi folks -- we had a small snafu with the post-release tasks because
> this patch release did not follow our normal release procedure where
> the release candidate is usually based off of master.
>
> When we prepare a patch release that is based on backported commits
> into a maintenance branch, we DO NOT need to rebase master or any PRs.
> So we need to update the release management instructions to indicate
> that these steps should be skipped for future patch releases (or any
> release that isn't based on master at some point in time).
>
> - Wes
>
> On Mon, Jul 22, 2019 at 10:46 AM Krisztián Szűcs
>  wrote:
> >
> > Hi,
> >
> > The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1)
> votes.
> > Thanks for helping verify the RC!
> > I'm moving on to the post-release tasks [1] once github resolves its
> > partially
> > degraded service issues [2]. Any help is appreciated.
> >
> > - Krisztian
> >
> > [1]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
> > [2]: https://www.githubstatus.com/
> >
> > On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Ran both the source and binary verification scripts on macOS Mojave.
> > > Also tested the wheels in python docker containers and on OSX.
> > >
> > > On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei 
> wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> I ran the followings on Debian GNU/Linux sid:
> > >>
> > >>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
> > >> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source
> > >> 0.14.1 0
> > >>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
> > >>
> > >> with:
> > >>
> > >>   * gcc (Debian 8.3.0-7) 8.3.0
> > >>   * openjdk version "1.8.0_212"
> > >>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741)
> [x86_64-linux]
> > >>   * Node.JS v12.1.0
> > >>   * go version go1.11.6 linux/amd64
> > >>   * nvidia-cuda-dev 9.2.148-7
> > >>
> > >> I re-run C# tests by the following command line sometimes:
> > >>
> > >>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
> > >> dev/release/verify-release-candidate.sh source 0.14.1 0
> > >>
> > >> But "sourcelink test" is always failed:
> > >>
> > >>   + sourcelink test
> > >> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
> > >>   The operation was canceled.
> > >>
> > >> I don't think that this is a broker.
> > >>
> > >>
> > >> Thanks,
> > >> --
> > >> kou
> > >>
> > >> In <
> cahm19a5jpetwjj4uj-1zoqjzqdcejj-ky673uv83jtfcoyp...@mail.gmail.com>
> > >>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019
> 04:54:33
> > >> +0200,
> > >>   Krisztián Szűcs  wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I would like to propose the following release candidate (RC0) of
> Apache
> > >> > Arrow version 0.14.1. This is a patch release consiting of 47
> resolved
> > >> > JIRA issues[1].
> > >> >
> > >> > This release candidate is based on commit:
> > >> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> > >> >
> > >> > The source release rc0 is hosted at [3].
> > >> > The binary artifacts are hosted at [4][5][6][7].
> > >> > The changelog is located at [8].
> > >> >
> > >> > Please download, verify checksums and signatures, run the unit
> tests,
> > >> > and vote on the release. See [9] for how to validate a release
> > >> candidate.
> > >> >
> > >> > The vote will be open for at least 72 hours.
> > >> >
> > >> > [ ] +1 Release this as Apache Arrow 0.14.1
> > >> > [ ] +0
> > >> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> > >> >
> > >> > [1]:
> > >> >
> > >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> > >> > [2]:
> > >> >
> > >>
> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> > >> > [3]:
> > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> > >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> > >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> > >> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> > >> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> > >> > [8]:
> > >> >
> > >>
> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> > >> > [9]:
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > >>
> > >
>


Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-22 Thread Wes McKinney
Sort of tangentially related, but while we are on the topic:

Please, if you would, avoid checking binary test data files into the
main repository. Use https://github.com/apache/arrow-testing if you
truly need to check in binary data -- something to look out for in
code reviews

On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield  wrote:
>
> Hi Jacques,
> Thanks for the clarifications. I think the distinction is useful.
>
> If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
>
>
> What do you think about creating a "contrib" directory and moving the JDBC
> and AVRO adapters into it? We should also probably provide more description
> in pom.xml to make it clear for downstream consumers.
>
> We should probably come up with a name other than adapters for
> readers/writer ("converters"?) and use it in the directory structure for
> the existing Orc implementation?
>
> Thanks,
> Micah
>
>
> On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau  wrote:
>
> > As I read through your responses, I think it might be useful to talk about
> > adapters versus native Arrow readers/writers. Adapters are something that
> > adapt an existing API to produce and/or consume Arrow data. A native
> > reader/writer is something that understand the format directly and does not
> > have intermediate representations or APIs the data moves through beyond
> > those that needs to be used to complete work.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
> >
> > Further comments inline.
> >
> >
> >> Could you expand on what level of detail you would like to see a design
> >> document?
> >>
> >
> > A couple paragraphs seems sufficient. This is the goals of the
> > implementation. We target existing functionality X. It is an adapter. Or it
> > is a native impl. This is the expected memory and processing
> > characteristics, etc.  I've never been one for huge amount of design but
> > I've seen a number of recent patches appear where this is no upfront
> > discussion. Making sure that multiple buy into a design is the best way to
> > ensure long-term maintenance and use.
> >
> >
> >> I think this should be optional (the same argument below about predicates
> >> apply so I won't repeat them).
> >>
> >
> > Per my comments above, maybe adapter versus native reader clarifies
> > things. For example, I've been working on a native avro read
> > implementation. It is little more than chicken scratch at this point but
> > its goals, vision and design are very different than the adapter that is
> > being produced atm.
> >
> >
> >> Can you clarify the intent of this objective.  Is it mainly to tie in with
> >> the existing Java arrow memory book keeping?  Performance?  Something
> >> else?
> >>
> >
> > Arrow is designed to be off-heap. If you have large variable amounts of
> > on-heap memory in an application, it starts to make it very hard to make
> > decisions about off-heap versus on-heap memory since those divisions are by
> > and large static in nature. It's fine for short lived applications but for
> > long lived applications, if you're working with a large amount of data, you
> > want to keep most of your memory in one pool. In the context of Arrow, this
> > is going to naturally be off-heap memory.
> >
> >
> >> I'm afraid this might lead to a "perfect is the enemy of the good"
> >> situation.  Starting off with a known good implementation of conversion to
> >> Arrow can allow us to both to profile hot-spots and provide a comparison
> >> of
> >> implementations to verify correctness.
> >>
> >
> > I'm not clear what message we're sending as a community if we produce low
> > performance components. The whole of Arrow is to increase performance, not
> > decrease it. I'm targeting good, not perfect. At the same time, from my
> > perspective, Arrow development should not be approached in the same way
> > that general Java app development should be. If we hold a high standard,
> > we'll have less total integrations initially but I think we'll solve more
> > real world problems.
> >
> > There is also the question of how widely adoptable we want Arrow libraries
> >> to be.
> >> It isn't surprising to me that Impala's Avro reader is an order of
> >> magnitude faster then the stock Java one.  As far as I know Impala's is a
> >> C++ implementation that does JIT with LLVM.  We could try to use it as a
> >> basis for converting to Arrow but I think this might limit adoption in
> >> some
> >> circumstances.  Some organizations/people might be hesitant to adopt the
> >> technology due to:
> >> 1.  Use of JNI.
> >> 2.  Use LLVM to do JIT.
> >>
> >> It seems that as long as we have a reasonably general interface to
> >> data-sources we should be able to optimize/refactor aggressively

Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Wes McKinney
hi folks -- we had a small snafu with the post-release tasks because
this patch release did not follow our normal release procedure where
the release candidate is usually based off of master.

When we prepare a patch release that is based on backported commits
into a maintenance branch, we DO NOT need to rebase master or any PRs.
So we need to update the release management instructions to indicate
that these steps should be skipped for future patch releases (or any
release that isn't based on master at some point in time).

- Wes

On Mon, Jul 22, 2019 at 10:46 AM Krisztián Szűcs
 wrote:
>
> Hi,
>
> The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1) votes.
> Thanks for helping verify the RC!
> I'm moving on to the post-release tasks [1] once github resolves its
> partially
> degraded service issues [2]. Any help is appreciated.
>
> - Krisztian
>
> [1]:
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
> [2]: https://www.githubstatus.com/
>
> On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs 
> wrote:
>
> > +1 (binding)
> >
> > Ran both the source and binary verification scripts on macOS Mojave.
> > Also tested the wheels in python docker containers and on OSX.
> >
> > On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei  wrote:
> >
> >> +1 (binding)
> >>
> >> I ran the followings on Debian GNU/Linux sid:
> >>
> >>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
> >> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source
> >> 0.14.1 0
> >>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
> >>
> >> with:
> >>
> >>   * gcc (Debian 8.3.0-7) 8.3.0
> >>   * openjdk version "1.8.0_212"
> >>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741) [x86_64-linux]
> >>   * Node.JS v12.1.0
> >>   * go version go1.11.6 linux/amd64
> >>   * nvidia-cuda-dev 9.2.148-7
> >>
> >> I re-run C# tests by the following command line sometimes:
> >>
> >>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
> >> dev/release/verify-release-candidate.sh source 0.14.1 0
> >>
> >> But "sourcelink test" is always failed:
> >>
> >>   + sourcelink test
> >> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
> >>   The operation was canceled.
> >>
> >> I don't think that this is a broker.
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019 04:54:33
> >> +0200,
> >>   Krisztián Szűcs  wrote:
> >>
> >> > Hi,
> >> >
> >> > I would like to propose the following release candidate (RC0) of Apache
> >> > Arrow version 0.14.1. This is a patch release consiting of 47 resolved
> >> > JIRA issues[1].
> >> >
> >> > This release candidate is based on commit:
> >> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> >> >
> >> > The source release rc0 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7].
> >> > The changelog is located at [8].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> > and vote on the release. See [9] for how to validate a release
> >> candidate.
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow 0.14.1
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> >> >
> >> > [1]:
> >> >
> >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> >> > [2]:
> >> >
> >> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> >> > [3]:
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> >> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> >> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> >> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> >> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> >> > [8]:
> >> >
> >> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> >> > [9]:
> >> >
> >> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> >>
> >


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Wes McKinney
I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou  wrote:
>
>
> Hello,
>
> Recently we've discussed breaking the IPC format to fix a long-standing
> alignment issue.  See this discussion:
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>
> Should we first do a 0.15.0 in order to get those format fixes right?
> Once that is fine and settled we can move to the 1.0.0 release?
>
> Regards
>
> Antoine.


Re: [DISCUSS] [Gandiva] Adding query plan to Gandiva protobuf definition

2019-07-22 Thread Andy Grove
Thanks, Jacques and Wes.

I agree that this needs discussion and a design document. I have put
together this Google doc to get the ball rolling:

https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing

Thanks,

Andy.

On Mon, Jul 22, 2019 at 6:39 AM Wes McKinney  wrote:

> I agree that I'd also like to see a design / goals document so clarify
> the scope (and the non-goals, too)
>
> In general, I would hesitate to add anything higher level to the
> Gandiva protos -- there is already confusion from people who believe
> that Gandiva is a "query engine" where it is actually a query engine
> subsystem (execution kernel compiler/generator). See for example the
> thread just a week ago [1]
>
> If you add higher level query plan structures to the proto file, I
> fear it will generate more confusion. If the plan ends up being to
> have a larger proto file, it would be good to move it someplace that
> isn't Gandiva-specific and clearly indicate that Gandiva is
> responsible for code generation for certain structures in the proto.
> We can also address some of these issues through better project
> documentation and READMEs.
>
> [1]:
> https://lists.apache.org/thread.html/212db05e98549f5938f3af41dade51d7a3e47255178a6c76652adc79@%3Cdev.arrow.apache.org%3E
>
> On Sun, Jul 21, 2019 at 4:23 PM Jacques Nadeau  wrote:
> >
> > Some thoughts:
> >
> >1. I think it would make sense to start with a design
> >discussion/document about the goals and what we think is
> implementation
> >specific versus generally applicable. In general, a distributed
> execution
> >plan seems pretty implementation specific. My sense is that you'd
> never run
> >a distributed execution plan outside of the knowledge of the
> particular
> >execution environment it is running within. Part of that is usually
> >distributed execution also includes lifecycle management. For
> example, if
> >you're going to have work-stealing  or early termination in your
> execution
> >engine, those are operations that stitch into execution coordination
> (and
> >thus a specific impl). If distributed execution is always engine
> specific,
> >why try to create a general one for multiple engines?
> >2. With regards to making Gandiva protos more generic: I'd like to see
> >more clarity on #1. On one hand, extending things so they are reused
> is
> >good. On the other hand, the more consumers of an interface, the more
> >overloads/non-impls you have for each consumer of it.
> >
> >
> > On Sat, Jul 20, 2019 at 10:18 AM Andy Grove 
> wrote:
> >
> > > I recently created a small PoC of distributed query execution on
> Kubernetes
> > > using the Rust implementation of Apache Arrow and the DataFusion query
> > > engine [1].
> > >
> > > This PoC uses gRPC to pass query plans to executor nodes and the proto
> file
> > > [2] is largely based on the Gandiva proto file [3]. The PoC is very
> basic
> > > but I think it demonstrates the power of having query plans as part of
> the
> > > proto file. This would allow distributed applications to be built
> based on
> > > Arrow standards in a way that is not dependent on any particular
> > > implementation of Arrow and would even allow mixing and matching query
> > > engines.
> > >
> > > I wanted to start this discussion to see what the appetite is here for
> > > accepting PRs to add query plan structures to the Gandiva proto file
> and
> > > also whether we can consider making this an Arrow proto file rather
> than
> > > being Gandiva-specific, over time.
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1] https://github.com/andygrove/ballista
> > >
> > > [2]
> > >
> > >
> https://github.com/andygrove/ballista/blob/master/proto/ballista/ballista.proto
> > >
> > > [3]
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> > >
>


[jira] [Created] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)
David Lee created ARROW-6001:


 Summary: Add from_pydict(), from_pylist() and to_pylist() to 
pyarrow.Table + improve pandas.to_dict()
 Key: ARROW-6001
 URL: https://issues.apache.org/jira/browse/ARROW-6001
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: David Lee


I noticed that pyarrow.Table.to_pydict() exists, but there is no 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm attached some code I've written which I've been using to handle arrow to 
ordered dictionaries and arrow to lists of dictionaries.. I've also included an 
example where this can be used to speed up pandas.to_dict() by a factor of 20x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))
start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))
start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))
star

[jira] [Created] (ARROW-6000) [Python] Expose LargeBinaryType and LargeStringType

2019-07-22 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6000:
-

 Summary: [Python] Expose LargeBinaryType and LargeStringType
 Key: ARROW-6000
 URL: https://issues.apache.org/jira/browse/ARROW-6000
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 1.0.0
Reporter: Antoine Pitrou






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-22 Thread Antoine Pitrou


Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:
https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Antoine Pitrou


Le 22/07/2019 à 18:52, Wes McKinney a écrit :
> 
> Probably the way is to introduce async-capable read APIs into the file
> interfaces. For example:
> 
> file->ReadAsyncBlock(thread_ctx, ...);
> 
> That way the file implementation can decide whether asynchronous logic
> is actually needed.
> I doubt very much that a one-size-fits-all
> concurrency solution can be developed -- in some applications
> coarse-grained IO and CPU task scheduling may be warranted, but we
> need to have a solution for finer-grained scenarios where
> 
> * In the memory-mapped case, there is no overhead and
> * The programming model is not too burdensome to the library developer

Well, the asynchronous I/O programming model *will* be burdensome at
least until C++ gets coroutines (which may happen in C++20, and
therefore be usable somewhere around 2024 for Arrow?).

Regards

Antoine.


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Wes McKinney
On Mon, Jul 22, 2019 at 11:42 AM Antoine Pitrou  wrote:
>
> On Mon, 22 Jul 2019 11:07:43 -0500
> Wes McKinney  wrote:
> >
> > Right, which is why I'm suggesting a simple model to allow threads
> > that are waiting on IO to allow other threads to execute.
>
> If you are doing memory-mapped IO, how do you plan to tell whether and
> when you'll be going to wait for IO?
>

Probably the way is to introduce async-capable read APIs into the file
interfaces. For example:

file->ReadAsyncBlock(thread_ctx, ...);

That way the file implementation can decide whether asynchronous logic
is actually needed. I doubt very much that a one-size-fits-all
concurrency solution can be developed -- in some applications
coarse-grained IO and CPU task scheduling may be warranted, but we
need to have a solution for finer-grained scenarios where

* In the memory-mapped case, there is no overhead and
* The programming model is not too burdensome to the library developer

> Regards
>
> Antoine.
>
>


Re: Error building cuDF on new Arrow with std::variant backport

2019-07-22 Thread Keith Kraus
We're working on that now, will report back once we have something more 
concrete to act on. Thanks!

-Keith

On 7/22/19, 12:51 PM, "Antoine Pitrou"  wrote:


Hi Keith,

Can you try to further reduce the reduce your reproducer until you find
the offending construct?

Regards

Antoine.


Le 22/07/2019 à 18:46, Keith Kraus a écrit :
> I temporarily removed the csr related code that has the namespace clash 
and confirmed that the same compilation warnings and errors still occur.
> 
> On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:
> 
> The namespace collision is a definite possibility, especially if you 
are
> using g++ which seems to be less smart about inferring types vs 
methods
> than clang is.
> 
> On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 
> wrote:
> 
> > Hi Micah,
> >
> > We were able to build Arrow standalone with both c++ 11 and 14, but 
cuDF
> > needs c++ 14.
> >
> > I found this line[1] in one of our cuda files after sending and 
realized
> > we may have a collision/polluted namespace. Does that sound like a
> > possibility?
> >
> > Thanks,
> > Paul
> >
> > 1.
> > 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
> >
> > On 7/19/19 8:41 PM, Micah Kornfield wrote:
> >
> > Hi Paul,
> > This actually looks like it might be a problem with arrow-4800.   
Did the
> > build of arrow use c++14 or c++11?
> >
> > Thanks,
> > Micah
> >
> > On Friday, July 19, 2019, Paul Taylor  
wrote:
> >
> >> We're updating cuDF to Arrow 0.14 but encountering errors building 
that
> >> look related to PR #4259 
. We
> >> can build Arrow itself, but we can't build cuDF when we include 
Arrow
> >> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
> >>
> >> Has anyone seen these before or know of a fix?
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>>
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function
> >>> 'void arrow::Result::AssignVariant(mpark::variant >>> const char*>&&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected ')' before 'const'
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function
> >>> 'void arrow::Result::AssignVariant(const mpark::variant >>> arrow::Status, const char*>&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:24: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:32: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: 
error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^

Re: Error building cuDF on new Arrow with std::variant backport

2019-07-22 Thread Antoine Pitrou


Hi Keith,

Can you try to further reduce the reduce your reproducer until you find
the offending construct?

Regards

Antoine.


Le 22/07/2019 à 18:46, Keith Kraus a écrit :
> I temporarily removed the csr related code that has the namespace clash and 
> confirmed that the same compilation warnings and errors still occur.
> 
> On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:
> 
> The namespace collision is a definite possibility, especially if you are
> using g++ which seems to be less smart about inferring types vs methods
> than clang is.
> 
> On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 
> wrote:
> 
> > Hi Micah,
> >
> > We were able to build Arrow standalone with both c++ 11 and 14, but cuDF
> > needs c++ 14.
> >
> > I found this line[1] in one of our cuda files after sending and realized
> > we may have a collision/polluted namespace. Does that sound like a
> > possibility?
> >
> > Thanks,
> > Paul
> >
> > 1.
> > 
> https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
> >
> > On 7/19/19 8:41 PM, Micah Kornfield wrote:
> >
> > Hi Paul,
> > This actually looks like it might be a problem with arrow-4800.   Did 
> the
> > build of arrow use c++14 or c++11?
> >
> > Thanks,
> > Micah
> >
> > On Friday, July 19, 2019, Paul Taylor  wrote:
> >
> >> We're updating cuDF to Arrow 0.14 but encountering errors building that
> >> look related to PR #4259 . 
> We
> >> can build Arrow itself, but we can't build cuDF when we include Arrow
> >> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
> >>
> >> Has anyone seen these before or know of a fix?
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>>
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
> function
> >>> 'void arrow::Result::AssignVariant(mpark::variant >>> const char*>&&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
> >>> expected ')' before 'const'
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
> function
> >>> 'void arrow::Result::AssignVariant(const mpark::variant >>> arrow::Status, const char*>&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:24: error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:32: error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
> >>> expected ')' before 'const'
> >>>
> >>
> >>
> >
> 
> 
> 
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original mess

Re: Error building cuDF on new Arrow with std::variant backport

2019-07-22 Thread Keith Kraus
I temporarily removed the csr related code that has the namespace clash and 
confirmed that the same compilation warnings and errors still occur.

On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:

The namespace collision is a definite possibility, especially if you are
using g++ which seems to be less smart about inferring types vs methods
than clang is.

On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 
wrote:

> Hi Micah,
>
> We were able to build Arrow standalone with both c++ 11 and 14, but cuDF
> needs c++ 14.
>
> I found this line[1] in one of our cuda files after sending and realized
> we may have a collision/polluted namespace. Does that sound like a
> possibility?
>
> Thanks,
> Paul
>
> 1.
> 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
>
> On 7/19/19 8:41 PM, Micah Kornfield wrote:
>
> Hi Paul,
> This actually looks like it might be a problem with arrow-4800.   Did the
> build of arrow use c++14 or c++11?
>
> Thanks,
> Micah
>
> On Friday, July 19, 2019, Paul Taylor  wrote:
>
>> We're updating cuDF to Arrow 0.14 but encountering errors building that
>> look related to PR #4259 . We
>> can build Arrow itself, but we can't build cuDF when we include Arrow
>> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
>>
>> Has anyone seen these before or know of a fix?
>>
>> Thanks,
>>
>> Paul
>>
>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>>
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member function
>>> 'void arrow::Result::AssignVariant(mpark::variant>> const char*>&&)':
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
>>> expected primary-expression before 'const'
>>>  variant_.~variant();
>>>   ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
>>> expected ')' before 'const'
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member function
>>> 'void arrow::Result::AssignVariant(const mpark::variant>> arrow::Status, const char*>&)':
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:24: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:32: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
>>> expected primary-expression before 'const'
>>>  variant_.~variant();
>>>   ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
>>> expected ')' before 'const'
>>>
>>
>>
>



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Antoine Pitrou
On Mon, 22 Jul 2019 11:07:43 -0500
Wes McKinney  wrote:
> 
> Right, which is why I'm suggesting a simple model to allow threads
> that are waiting on IO to allow other threads to execute.

If you are doing memory-mapped IO, how do you plan to tell whether and
when you'll be going to wait for IO?

Regards

Antoine.




[jira] [Created] (ARROW-5999) [C++] Required header files missing when built with -DARROW_DATASET=OFF

2019-07-22 Thread Steven Fackler (JIRA)
Steven Fackler created ARROW-5999:
-

 Summary: [C++] Required header files missing when built with 
-DARROW_DATASET=OFF
 Key: ARROW-5999
 URL: https://issues.apache.org/jira/browse/ARROW-5999
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.0
Reporter: Steven Fackler


 
{noformat}
In file included from /opt/arrow/include/arrow/type_fwd.h:23:0,
 from /opt/arrow/include/arrow/type.h:29,
 from /opt/arrow/include/arrow/array.h:32,
 from /opt/arrow/include/arrow/api.h:23,
 from src/bindings.cc:1:
/opt/arrow/include/arrow/util/iterator.h:20:10: fatal error: 
arrow/dataset/visibility.h: No such file or directory
 #include "arrow/dataset/visibility.h"
  ^~~~{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Wes McKinney
On Mon, Jul 22, 2019 at 10:49 AM Antoine Pitrou  wrote:
>
>
> Le 18/07/2019 à 00:25, Wes McKinney a écrit :
> >
> > * We look forward in the stream until we find a complete Thrift data
> > page header. This may trigger 0 or more (possibly multiple) Read calls
> > to the underlying "file" handle. In the default case, the data is all
> > actually in memory so the reads are zero copy buffer slices.
>
> If the file is memory-mapped, it doesn't mean everything is in RAM.
> Starting to read a page may incur a page fault and some unexpected
> blocking I/O.
>
> The solution to hide I/O costs could be to use madvise() (in which case
> the background read is done by the kernel without any need for
> user-visible IO threads).  Similarly, on a regular file one can use
> fadvise().  This may mean that the whole issue of "how to hide I/O for a
> given source" may be stream-specific (for example, if a file is
> S3-backed, perhaps you want to issue a HTTP fetch in background?).
>

I think we need to be designing around remote filesystems with
unpredictable latency and throughput. Anyone involved in data
warehousing systems in the cloud is going to be intimately familiar
with these issues -- a system that's designed around local disk and
memory-mapping generally isn't going to adapt well to remote
filesystems.

> > # Model B (CPU and IO work split into tasks that execute on different
> > thread queues)
> >
> > Pros
> > - Not sure
> >
> > Cons
> > - Could cause performance issues if the IO tasks are mostly free (e.g.
> > due to buffering)
>
> In the model B, the decision of whether to use a background thread or
> some other means of hiding I/O costs could also be pushed down into the
> stream implementation.
>
> > I think we need to investigate some asynchronous C++ programming libraries 
> > like
> >
> > https://github.com/facebook/folly/tree/master/folly/fibers
> >
> > to see how organizations with mature C++ practices are handling these
> > issues from a programming model standpoint
>
> Well, right now our model is synchronous I/O.  If we want to switch to
> asynchronous I/O we'll have to redesign a lot of APIs.  Also, since C++
> doesn't have a convenient story for asynchronous I/O or coroutines
> (yet), this will make programming similarly significantly more painful,
> which is (IMO) something we'd like to avoid.  And I'm not mentioning the
> problem of mapping the C++ asynchronous I/O model on the corresponding
> Python primitives...
>

Right, which is why I'm suggesting a simple model to allow threads
that are waiting on IO to allow other threads to execute. Currently
they block.

>
> More generally, I'm wary of significantly complicating our I/O handling
> until we have reliable reproducers of I/O-originated performance issues
> with Arrow.
>

If it helps, I can spend some time implementing Model A as it relates
to reading Parquet files in parallel. If you introduce a small amount
of latency into reads (10-50ms per read call -- such as you would
experience using Amazon S3) the current synchronous approach will have
significant IO-wait-related performance issues.

> Regards
>
> Antoine.


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Antoine Pitrou
On Mon, 22 Jul 2019 08:40:08 -0700
Brian Hulette  wrote:
> To me, the most important aspect of this proposal is the addition of sparse
> encodings, and I'm curious if there are any more objections to that
> specifically. So far I believe the only one is that it will make
> computation libraries more complicated. This is absolutely true, but I
> think it's worth that cost.

It's not just computation libraries, it's any library peeking inside
Arrow data.  Currently, the Arrow data types are simple, which makes it
easy and non-intimidating to build data processing utilities around
them.  If we start adding sophisticated encodings, we also raise the
cost of supporting Arrow for third-party libraries.

Regards

Antoine.




Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

2019-07-22 Thread Antoine Pitrou


Le 18/07/2019 à 00:25, Wes McKinney a écrit :
> 
> * We look forward in the stream until we find a complete Thrift data
> page header. This may trigger 0 or more (possibly multiple) Read calls
> to the underlying "file" handle. In the default case, the data is all
> actually in memory so the reads are zero copy buffer slices.

If the file is memory-mapped, it doesn't mean everything is in RAM.
Starting to read a page may incur a page fault and some unexpected
blocking I/O.

The solution to hide I/O costs could be to use madvise() (in which case
the background read is done by the kernel without any need for
user-visible IO threads).  Similarly, on a regular file one can use
fadvise().  This may mean that the whole issue of "how to hide I/O for a
given source" may be stream-specific (for example, if a file is
S3-backed, perhaps you want to issue a HTTP fetch in background?).

> # Model B (CPU and IO work split into tasks that execute on different
> thread queues)
> 
> Pros
> - Not sure
> 
> Cons
> - Could cause performance issues if the IO tasks are mostly free (e.g.
> due to buffering)

In the model B, the decision of whether to use a background thread or
some other means of hiding I/O costs could also be pushed down into the
stream implementation.

> I think we need to investigate some asynchronous C++ programming libraries 
> like
> 
> https://github.com/facebook/folly/tree/master/folly/fibers
> 
> to see how organizations with mature C++ practices are handling these
> issues from a programming model standpoint

Well, right now our model is synchronous I/O.  If we want to switch to
asynchronous I/O we'll have to redesign a lot of APIs.  Also, since C++
doesn't have a convenient story for asynchronous I/O or coroutines
(yet), this will make programming similarly significantly more painful,
which is (IMO) something we'd like to avoid.  And I'm not mentioning the
problem of mapping the C++ asynchronous I/O model on the corresponding
Python primitives...


More generally, I'm wary of significantly complicating our I/O handling
until we have reliable reproducers of I/O-originated performance issues
with Arrow.

Regards

Antoine.


Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Krisztián Szűcs
Hi,

The 0.14.1 RC0 vote carries with 4 binding +1 (and 1 non-binding +1) votes.
Thanks for helping verify the RC!
I'm moving on to the post-release tasks [1] once github resolves its
partially
degraded service issues [2]. Any help is appreciated.

- Krisztian

[1]:
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-Post-releasetasks
[2]: https://www.githubstatus.com/

On Mon, Jul 22, 2019 at 4:23 PM Krisztián Szűcs 
wrote:

> +1 (binding)
>
> Ran both the source and binary verification scripts on macOS Mojave.
> Also tested the wheels in python docker containers and on OSX.
>
> On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei  wrote:
>
>> +1 (binding)
>>
>> I ran the followings on Debian GNU/Linux sid:
>>
>>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
>> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source
>> 0.14.1 0
>>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>>
>> with:
>>
>>   * gcc (Debian 8.3.0-7) 8.3.0
>>   * openjdk version "1.8.0_212"
>>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741) [x86_64-linux]
>>   * Node.JS v12.1.0
>>   * go version go1.11.6 linux/amd64
>>   * nvidia-cuda-dev 9.2.148-7
>>
>> I re-run C# tests by the following command line sometimes:
>>
>>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
>> dev/release/verify-release-candidate.sh source 0.14.1 0
>>
>> But "sourcelink test" is always failed:
>>
>>   + sourcelink test
>> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
>>   The operation was canceled.
>>
>> I don't think that this is a broker.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019 04:54:33
>> +0200,
>>   Krisztián Szűcs  wrote:
>>
>> > Hi,
>> >
>> > I would like to propose the following release candidate (RC0) of Apache
>> > Arrow version 0.14.1. This is a patch release consiting of 47 resolved
>> > JIRA issues[1].
>> >
>> > This release candidate is based on commit:
>> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
>> >
>> > The source release rc0 is hosted at [3].
>> > The binary artifacts are hosted at [4][5][6][7].
>> > The changelog is located at [8].
>> >
>> > Please download, verify checksums and signatures, run the unit tests,
>> > and vote on the release. See [9] for how to validate a release
>> candidate.
>> >
>> > The vote will be open for at least 72 hours.
>> >
>> > [ ] +1 Release this as Apache Arrow 0.14.1
>> > [ ] +0
>> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
>> >
>> > [1]:
>> >
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
>> > [2]:
>> >
>> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
>> > [3]:
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
>> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
>> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
>> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
>> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
>> > [8]:
>> >
>> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
>> > [9]:
>> >
>> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>>
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Brian Hulette
To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth that cost.

It's been suggested on this list and elsewhere [1] that sparse encodings
that can be operated on without fully decompressing should be added to the
Arrow format. The longer we continue to develop computation libraries
without considering those schemes, the harder it will be to add them.

[1]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html


On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney  wrote:

> On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou 
> wrote:
> >
> > On Fri, 12 Jul 2019 20:37:15 -0700
> > Micah Kornfield  wrote:
> > >
> > > If the latter, I wonder why Parquet cannot simply be used instead of
> > > > reinventing something similar but different.
> > >
> > > This is a reasonable point.  However there is  continuum here between
> file
> > > size and read and write times.  Parquet will likely always be the
> smallest
> > > with the largest times to convert to and from Arrow.  An uncompressed
> > > Feather/Arrow file will likely always take the most space but will much
> > > faster conversion times.
> >
> > I'm curious whether the Parquet conversion times are inherent to the
> > Parquet format or due to inefficiencies in the implementation.
> >
>
> Parquet is fundamentally more complex to decode. Consider several
> layers of logic that must happen for values to end up in the right
> place
>
> * Data pages are usually compressed, and a column consists of many
> data pages each having a Thrift header that must be deserialized
> * Values are usually dictionary-encoded, dictionary indices are
> encoded using hybrid bit-packed / RLE scheme
> * Null/not-null is encoded in definition levels
> * Only non-null values are stored, so when decoding to Arrow, values
> have to be "moved into place"
>
> The current C++ implementation could certainly be made faster. One
> consideration with Parquet is that the files are much smaller, so when
> you are reading them over the network the effective end-to-end time
> including IO and deserialization will frequently win.
>
> > Regards
> >
> > Antoine.
> >
> >
>


Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-22 Thread Micah Kornfield
Hi Jacques,
Thanks for the clarifications. I think the distinction is useful.

If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.


What do you think about creating a "contrib" directory and moving the JDBC
and AVRO adapters into it? We should also probably provide more description
in pom.xml to make it clear for downstream consumers.

We should probably come up with a name other than adapters for
readers/writer ("converters"?) and use it in the directory structure for
the existing Orc implementation?

Thanks,
Micah


On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau  wrote:

> As I read through your responses, I think it might be useful to talk about
> adapters versus native Arrow readers/writers. Adapters are something that
> adapt an existing API to produce and/or consume Arrow data. A native
> reader/writer is something that understand the format directly and does not
> have intermediate representations or APIs the data moves through beyond
> those that needs to be used to complete work.
>
> If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.
>
> Further comments inline.
>
>
>> Could you expand on what level of detail you would like to see a design
>> document?
>>
>
> A couple paragraphs seems sufficient. This is the goals of the
> implementation. We target existing functionality X. It is an adapter. Or it
> is a native impl. This is the expected memory and processing
> characteristics, etc.  I've never been one for huge amount of design but
> I've seen a number of recent patches appear where this is no upfront
> discussion. Making sure that multiple buy into a design is the best way to
> ensure long-term maintenance and use.
>
>
>> I think this should be optional (the same argument below about predicates
>> apply so I won't repeat them).
>>
>
> Per my comments above, maybe adapter versus native reader clarifies
> things. For example, I've been working on a native avro read
> implementation. It is little more than chicken scratch at this point but
> its goals, vision and design are very different than the adapter that is
> being produced atm.
>
>
>> Can you clarify the intent of this objective.  Is it mainly to tie in with
>> the existing Java arrow memory book keeping?  Performance?  Something
>> else?
>>
>
> Arrow is designed to be off-heap. If you have large variable amounts of
> on-heap memory in an application, it starts to make it very hard to make
> decisions about off-heap versus on-heap memory since those divisions are by
> and large static in nature. It's fine for short lived applications but for
> long lived applications, if you're working with a large amount of data, you
> want to keep most of your memory in one pool. In the context of Arrow, this
> is going to naturally be off-heap memory.
>
>
>> I'm afraid this might lead to a "perfect is the enemy of the good"
>> situation.  Starting off with a known good implementation of conversion to
>> Arrow can allow us to both to profile hot-spots and provide a comparison
>> of
>> implementations to verify correctness.
>>
>
> I'm not clear what message we're sending as a community if we produce low
> performance components. The whole of Arrow is to increase performance, not
> decrease it. I'm targeting good, not perfect. At the same time, from my
> perspective, Arrow development should not be approached in the same way
> that general Java app development should be. If we hold a high standard,
> we'll have less total integrations initially but I think we'll solve more
> real world problems.
>
> There is also the question of how widely adoptable we want Arrow libraries
>> to be.
>> It isn't surprising to me that Impala's Avro reader is an order of
>> magnitude faster then the stock Java one.  As far as I know Impala's is a
>> C++ implementation that does JIT with LLVM.  We could try to use it as a
>> basis for converting to Arrow but I think this might limit adoption in
>> some
>> circumstances.  Some organizations/people might be hesitant to adopt the
>> technology due to:
>> 1.  Use of JNI.
>> 2.  Use LLVM to do JIT.
>>
>> It seems that as long as we have a reasonably general interface to
>> data-sources we should be able to optimize/refactor aggressively when
>> needed.
>>
>
> This is somewhat the crux of the problem. It goes a little bit to who our
> consuming audience is and what we're trying to deliver. I'll also say that
> trying to build a high-quality implementation on top of low-quality
> implementation or library-based adapter is worse than starting from
> scratch. I believe this is especially true in Java where developers are
> trained to trust hotspot and that things will be good enough. That is great
> in a web app but not in systems software where we (and I expect othe

Re: [VOTE] Release Apache Arrow 0.14.1 - RC0

2019-07-22 Thread Krisztián Szűcs
+1 (binding)

Ran both the source and binary verification scripts on macOS Mojave.
Also tested the wheels in python docker containers and on OSX.

On Thu, Jul 18, 2019 at 11:48 PM Sutou Kouhei  wrote:

> +1 (binding)
>
> I ran the followings on Debian GNU/Linux sid:
>
>   * TEST_CSHARP=0 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
> CUDA_TOOLKIT_ROOT=/usr dev/release/verify-release-candidate.sh source
> 0.14.1 0
>   * dev/release/verify-release-candidate.sh binaries 0.14.1 0
>
> with:
>
>   * gcc (Debian 8.3.0-7) 8.3.0
>   * openjdk version "1.8.0_212"
>   * ruby 2.7.0dev (2019-07-16T13:03:25Z trunk 6ab95fb741) [x86_64-linux]
>   * Node.JS v12.1.0
>   * go version go1.11.6 linux/amd64
>   * nvidia-cuda-dev 9.2.148-7
>
> I re-run C# tests by the following command line sometimes:
>
>   TEST_DEFAULT=0 TEST_SOURCE=1 TEST_CSHARP=1
> dev/release/verify-release-candidate.sh source 0.14.1 0
>
> But "sourcelink test" is always failed:
>
>   + sourcelink test
> artifacts/Apache.Arrow/Release/netstandard1.3/Apache.Arrow.pdb
>   The operation was canceled.
>
> I don't think that this is a broker.
>
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow 0.14.1 - RC0" on Wed, 17 Jul 2019 04:54:33
> +0200,
>   Krisztián Szűcs  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> > Arrow version 0.14.1. This is a patch release consiting of 47 resolved
> > JIRA issues[1].
> >
> > This release candidate is based on commit:
> > 5f564424c71cef12619522cdde59be5f69b31b68 [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7].
> > The changelog is located at [8].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [9] for how to validate a release candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 0.14.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 0.14.1 because...
> >
> > [1]:
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.14.1
> > [2]:
> >
> https://github.com/apache/arrow/tree/5f564424c71cef12619522cdde59be5f69b31b68
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.14.1-rc0
> > [4]: https://bintray.com/apache/arrow/centos-rc/0.14.1-rc0
> > [5]: https://bintray.com/apache/arrow/debian-rc/0.14.1-rc0
> > [6]: https://bintray.com/apache/arrow/python-rc/0.14.1-rc0
> > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.14.1-rc0
> > [8]:
> >
> https://github.com/apache/arrow/blob/5f564424c71cef12619522cdde59be5f69b31b68/CHANGELOG.md
> > [9]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>


Re: [DISCUSS] [Gandiva] Adding query plan to Gandiva protobuf definition

2019-07-22 Thread Wes McKinney
I agree that I'd also like to see a design / goals document so clarify
the scope (and the non-goals, too)

In general, I would hesitate to add anything higher level to the
Gandiva protos -- there is already confusion from people who believe
that Gandiva is a "query engine" where it is actually a query engine
subsystem (execution kernel compiler/generator). See for example the
thread just a week ago [1]

If you add higher level query plan structures to the proto file, I
fear it will generate more confusion. If the plan ends up being to
have a larger proto file, it would be good to move it someplace that
isn't Gandiva-specific and clearly indicate that Gandiva is
responsible for code generation for certain structures in the proto.
We can also address some of these issues through better project
documentation and READMEs.

[1]: 
https://lists.apache.org/thread.html/212db05e98549f5938f3af41dade51d7a3e47255178a6c76652adc79@%3Cdev.arrow.apache.org%3E

On Sun, Jul 21, 2019 at 4:23 PM Jacques Nadeau  wrote:
>
> Some thoughts:
>
>1. I think it would make sense to start with a design
>discussion/document about the goals and what we think is implementation
>specific versus generally applicable. In general, a distributed execution
>plan seems pretty implementation specific. My sense is that you'd never run
>a distributed execution plan outside of the knowledge of the particular
>execution environment it is running within. Part of that is usually
>distributed execution also includes lifecycle management. For example, if
>you're going to have work-stealing  or early termination in your execution
>engine, those are operations that stitch into execution coordination (and
>thus a specific impl). If distributed execution is always engine specific,
>why try to create a general one for multiple engines?
>2. With regards to making Gandiva protos more generic: I'd like to see
>more clarity on #1. On one hand, extending things so they are reused is
>good. On the other hand, the more consumers of an interface, the more
>overloads/non-impls you have for each consumer of it.
>
>
> On Sat, Jul 20, 2019 at 10:18 AM Andy Grove  wrote:
>
> > I recently created a small PoC of distributed query execution on Kubernetes
> > using the Rust implementation of Apache Arrow and the DataFusion query
> > engine [1].
> >
> > This PoC uses gRPC to pass query plans to executor nodes and the proto file
> > [2] is largely based on the Gandiva proto file [3]. The PoC is very basic
> > but I think it demonstrates the power of having query plans as part of the
> > proto file. This would allow distributed applications to be built based on
> > Arrow standards in a way that is not dependent on any particular
> > implementation of Arrow and would even allow mixing and matching query
> > engines.
> >
> > I wanted to start this discussion to see what the appetite is here for
> > accepting PRs to add query plan structures to the Gandiva proto file and
> > also whether we can consider making this an Arrow proto file rather than
> > being Gandiva-specific, over time.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1] https://github.com/andygrove/ballista
> >
> > [2]
> >
> > https://github.com/andygrove/ballista/blob/master/proto/ballista/ballista.proto
> >
> > [3]
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >


Re: [Memo] API Behavior changes

2019-07-22 Thread Wes McKinney
You could also use labels in JIRA to mark issues that introduce API changes

On Mon, Jul 22, 2019 at 4:42 AM Fan Liya  wrote:
>
> @Uwe L. Korn
>
> Thanks a lot for the good suggestion.
> I will create a new file to track the changes.
>
> Best,
> Liya Fan
>
> On Mon, Jul 22, 2019 at 5:03 PM Uwe L. Korn  wrote:
>
> > Hallo Liya,
> >
> > what about having this as part of the repository, e.g.
> > java/api-changes.md? We have an auto-generated changelog that is quite
> > verbose but having such documentation for consumers of the Java library
> > would be really helpful as it is gives a denser packed information on
> > upgrading versions.
> >
> > Cheers
> > Uwe
> >
> > On Mon, Jul 22, 2019, at 4:54 AM, Fan Liya wrote:
> > > Hi all,
> > >
> > > Let's track the API behavior changes in this email thread, so as not
> > forget
> > > about them for the next release.
> > >
> > > ARROW-5842 : the
> > > semantics of lastSet in ListVector changes. In the past, it refers to the
> > > next index that will be set; now it points to the last index that is
> > > actually set.
> > >
> > > ARROW-5973 : The
> > > semantics of the get methods for VarCharVector, VarBinaryVector, and
> > > FixedSizeBinaryVector changes. If the past, if the validity bit is clear,
> > > the methods throw throws an IllegalStateException when
> > > NULL_CHECKING_ENABLED is set, or returns an empty object when the flag is
> > > not set. Now, the get methods returns a null if the validity bit is
> > clear.
> > >
> > > Best,
> > > Liya Fan
> > >
> >


[jira] [Created] (ARROW-5998) [Java] Open a document to track the API changes

2019-07-22 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5998:
---

 Summary: [Java] Open a document to track the API changes
 Key: ARROW-5998
 URL: https://issues.apache.org/jira/browse/ARROW-5998
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a document to track the API behavior changes, so as not forget about 
them for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Memo] API Behavior changes

2019-07-22 Thread Fan Liya
@Uwe L. Korn

Thanks a lot for the good suggestion.
I will create a new file to track the changes.

Best,
Liya Fan

On Mon, Jul 22, 2019 at 5:03 PM Uwe L. Korn  wrote:

> Hallo Liya,
>
> what about having this as part of the repository, e.g.
> java/api-changes.md? We have an auto-generated changelog that is quite
> verbose but having such documentation for consumers of the Java library
> would be really helpful as it is gives a denser packed information on
> upgrading versions.
>
> Cheers
> Uwe
>
> On Mon, Jul 22, 2019, at 4:54 AM, Fan Liya wrote:
> > Hi all,
> >
> > Let's track the API behavior changes in this email thread, so as not
> forget
> > about them for the next release.
> >
> > ARROW-5842 : the
> > semantics of lastSet in ListVector changes. In the past, it refers to the
> > next index that will be set; now it points to the last index that is
> > actually set.
> >
> > ARROW-5973 : The
> > semantics of the get methods for VarCharVector, VarBinaryVector, and
> > FixedSizeBinaryVector changes. If the past, if the validity bit is clear,
> > the methods throw throws an IllegalStateException when
> > NULL_CHECKING_ENABLED is set, or returns an empty object when the flag is
> > not set. Now, the get methods returns a null if the validity bit is
> clear.
> >
> > Best,
> > Liya Fan
> >
>


Re: [Memo] API Behavior changes

2019-07-22 Thread Uwe L. Korn
Hallo Liya,

what about having this as part of the repository, e.g. java/api-changes.md? We 
have an auto-generated changelog that is quite verbose but having such 
documentation for consumers of the Java library would be really helpful as it 
is gives a denser packed information on upgrading versions.

Cheers
Uwe

On Mon, Jul 22, 2019, at 4:54 AM, Fan Liya wrote:
> Hi all,
> 
> Let's track the API behavior changes in this email thread, so as not forget
> about them for the next release.
> 
> ARROW-5842 : the
> semantics of lastSet in ListVector changes. In the past, it refers to the
> next index that will be set; now it points to the last index that is
> actually set.
> 
> ARROW-5973 : The
> semantics of the get methods for VarCharVector, VarBinaryVector, and
> FixedSizeBinaryVector changes. If the past, if the validity bit is clear,
> the methods throw throws an IllegalStateException when
> NULL_CHECKING_ENABLED is set, or returns an empty object when the flag is
> not set. Now, the get methods returns a null if the validity bit is clear.
> 
> Best,
> Liya Fan
>


[jira] [Created] (ARROW-5997) [Java] Support dictionary encoding for Union type

2019-07-22 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5997:
-

 Summary: [Java] Support dictionary encoding for Union type
 Key: ARROW-5997
 URL: https://issues.apache.org/jira/browse/ARROW-5997
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


Now only Union type is not supported in dictionary encoding.

In the last several weeks, we did some refactor for encoding and now it's time 
to support Union type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)