Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

2019-10-25 Thread Micah Kornfield
I think at least the wording was confusing because you raised questions on
the PR and Antoine commented here.

I agree with your analysis that it probably would not be hard to support.
But don't feel too strongly either way on this particular point, aside from
coming to a resolution.   If I had to choose I'd prefer allowing Delta
dictionaries in files.

On Friday, October 25, 2019, Wes McKinney  wrote:

> Can we discuss the delta dictionary issue a bit more? I admit I don't
> share that same concerns.
>
> From the perspective of a file and stream producer, the code paths
> should be nearly identical. The differences with the file format are:
>
> * Magic numbers to detect that it is the "file format"
> * Accumulated metadata at the footer
>
> If a file has any dictionaries at all, then they must all be
> reconstructed before reading a record batch. So let's say we have a
> file like
>
> DICTIONARY ID=0, isDelta=FALSE
> BATCH 0
> BATCH 1
> BATCH 2
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 3
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 4
>
> I do not see any harm in this -- the only downside is that you won't
> know what "state" the dictionary was in for the first 3 batches.
> Viewing dictionary encoding strictly as a data representation method,
> the batches 0-2 and 3 represent the same data even if their in-memory
> dictionaries are larger than they were than the moment in which they
> were written
>
> Note that the code path for "processing" the dictionaries as a first
> step will use the same code as the stream path. It should not be a
> great deal of work to write test cases for this
>
> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield 
> wrote:
> >
> > Hi Antoine,
> > There is a defined order for dictionaries in metadata.  What isn't well
> > defined is relative ordering between record batches and Delta
> dictionaries.
> >
> >  However, this point seems confusing. I can't think of a real-world use
> > case we're it would be valuable enough to include, so I will remove Delta
> > dictionaries.
> >
> > So let's cancel this vote and I'll start a new one after the update.
> >
> > Thanks,
> > Micah
> >
> > On Thursday, October 24, 2019, Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> > > >
> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > > > dictionary batch and multiple "delta" dictionary batches.
> > >
> > > This is a bit weird.  If the file format can carry delta dictionaries,
> > > it means order is significant, so it may as well contain dictionary
> > > redefinitions.
> > >
> > > If the file format is meant to be truly readable in random order, then
> > > it should also forbid delta dictionaries.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


[CI] Docker-compose refactor and GitHub Actions

2019-10-25 Thread Krisztián Szűcs
Hi,

During the release of 0.15.1-RC0 I literally had to wait days
to ensure that the Travis, Appveyor and Crossbow builds
were all passing for the release branch. Additionally each
newly added patch was delaying the process by 8 hrs or so
(actually felt like 16).

Recently I've been working on to incorporate the advantages
of the Buildbot setup into our current docker-compose
configuration, including support for multiple architectures
and platforms, reusing docker images and caching dependency
installation steps. It tries to follow the semantics of ursabot,
but using only docker-compose and tiny shell scripts.

This refactoring also includes GitHub Actions workflows for
Windows and macOS as well, reusing the same (bash) builds
scripts. The docker configuration and the scripts are CI agnostic.
Last but not least, I've managed to clean up a lot of things
including every travis builds, and three Appveyor builds.
As an example the ci [3] and dev [4] folders got much cleaner.

The majority of the builds are passing [2], but due to the size
of the pull request [1] reviews for relevant workflows like the
JavaScript, C#, Rust, JNI, etc. would be much appreciated.
I'll be on vacation until Wednesday, but will try to respond on
both GH and the ML.

Thanks, Krisztian

[1]: https://github.com/apache/arrow/pull/5589
[2]: https://github.com/apache/arrow/runs/275685241
[3]: 
https://github.com/apache/arrow/tree/9c7e7289b9c9486c13a02e7cb5682a0f9f274ec6/ci
[4]: 
https://github.com/apache/arrow/tree/9c7e7289b9c9486c13a02e7cb5682a0f9f274ec6/dev


[jira] [Created] (ARROW-6997) [Packaging] Add support for RHEL

2019-10-25 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6997:
---

 Summary: [Packaging] Add support for RHEL
 Key: ARROW-6997
 URL: https://issues.apache.org/jira/browse/ARROW-6997
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as 
{{7Server}} from {{7}}. (Is it available on BinTray?)

We also need to update install information. We can't install {{epel-release}} 
by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum 
install 
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See 
https://fedoraproject.org/wiki/EPEL for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6998) Ability to read from URL for pyarrow's read_feather

2019-10-25 Thread Ryan McCarthy (Jira)
Ryan McCarthy created ARROW-6998:


 Summary: Ability to read from URL for pyarrow's read_feather
 Key: ARROW-6998
 URL: https://issues.apache.org/jira/browse/ARROW-6998
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Ryan McCarthy


See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for 
more info. Many of the pandas `read_format()` methods allow you supply a URL 
except for the `read_feather()` method. This would be a nice to have feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6996) [Python] Expose boolean filter kernel on Table

2019-10-25 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-6996:
---

 Summary: [Python] Expose boolean filter kernel on Table
 Key: ARROW-6996
 URL: https://issues.apache.org/jira/browse/ARROW-6996
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe Korn


This is currently only implemented for Array but would also be useful on Tables 
and ChunkedArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[VOTE] Release Apache Arrow 0.15.1 - RC0

2019-10-25 Thread Krisztián Szűcs
Hi,

I would like to propose the following release candidate (RC0) of Apache
Arrow version 0.15.1. This is a patch release consisting of 36 resolved
JIRA issues[1].

This release candidate is based on commit:
b789226ccb2124285792107c758bb3b40b3d082a [2]

The source release rc0 is hosted at [3].
The binary artifacts are hosted at [4][5][6][7].
The changelog is located at [8].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [9] for how to validate a release candidate.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 0.15.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow 0.15.1 because...

[1]: 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.15.1
[2]: 
https://github.com/apache/arrow/tree/b789226ccb2124285792107c758bb3b40b3d082a
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.15.1-rc0
[4]: https://bintray.com/apache/arrow/centos-rc/0.15.1-rc0
[5]: https://bintray.com/apache/arrow/debian-rc/0.15.1-rc0
[6]: https://bintray.com/apache/arrow/python-rc/0.15.1-rc0
[7]: https://bintray.com/apache/arrow/ubuntu-rc/0.15.1-rc0
[8]: 
https://github.com/apache/arrow/blob/b789226ccb2124285792107c758bb3b40b3d082a/CHANGELOG.md
[9]: 
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates


Re: Installing Apache Arrow RHEL 7

2019-10-25 Thread Wes McKinney
Does

https://dl.bintray.com/apache/arrow/centos/7/x86_64/repodata/repomd.xml

work? I'm copying dev@ in case Kou is not subscribed to user@

On Fri, Oct 25, 2019 at 12:39 PM Brian Klaassens  wrote:
>
> According to https://arrow.apache.org/install/ for Centos I can add a repo 
> and install from there.
>
>
>
> sudo tee /etc/yum.repos.d/Apache-Arrow.repo <
> [apache-arrow]
>
> name=Apache Arrow
>
> baseurl=https://dl.bintray.com/apache/arrow/centos/\$releasever/\$basearch/
>
> gpgcheck=1
>
> enabled=1
>
> gpgkey=https://dl.bintray.com/apache/arrow/centos/RPM-GPG-KEY-apache-arrow
>
> REPO
>
> yum install -y epel-release
>
>
>
> The only problem is that the base URL is not constructed correctly.
>
> https://dl.bintray.com/apache/arrow/centos/7Server/x86_64/repodata/repomd.xml
>
>
>
> I tried to cheat and changed the base URL to.
>
> https://dl.bintray.com/apache/arrow/centos/7/x86_64/repodata/repomd.xml
>
> The result was
>
>
>
> Loaded plugins: amazon-id, rhui-lb, search-disabled-repos
>
> apache-arrow| 2.9 
> kB  00:00:00
>
> No package epel-release available.
>
> Error: Nothing to do
>
>
>
> Is there a way for me to get this to work?
>
>
>
> This is the OS I’m on.
>
> cat /etc/os-release
>
> NAME="Red Hat Enterprise Linux Server"
>
> VERSION="7.7 (Maipo)"
>
> ID="rhel"
>
> ID_LIKE="fedora"
>
> VARIANT="Server"
>
> VARIANT_ID="server"
>
> VERSION_ID="7.7"
>
> PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
>
> ANSI_COLOR="0;31"
>
> CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
>
> HOME_URL="https://www.redhat.com/;
>
> BUG_REPORT_URL="https://bugzilla.redhat.com/;
>
>
>
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
>
> REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
>
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
>
> REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
>
> ***
> IMPORTANT MESSAGE FOR RECIPIENTS IN THE U.S.A.:
> This message may constitute an advertisement of a BD group's products or 
> services or a solicitation of interest in them. If this is such a message and 
> you would like to opt out of receiving future advertisements or solicitations 
> from this BD group, please forward this e-mail to optoutbygr...@bd.com. 
> [BD.v1.0]
> ***
> This message (which includes any attachments) is intended only for the 
> designated recipient(s). It may contain confidential or proprietary 
> information and may be subject to the attorney-client privilege or other 
> confidentiality protections. If you are not a designated recipient, you may 
> not review, use, copy or distribute this message. If you received this in 
> error, please notify the sender by reply e-mail and delete this message. 
> Thank you.
> ***
> Corporate Headquarters Mailing Address: BD (Becton, Dickinson and Company) 1 
> Becton Drive Franklin Lakes, NJ 07417 U.S.A.


[jira] [Created] (ARROW-6995) [Packaging][Crossbow] The windows conda artifacts are not uploaded to GitHub releases

2019-10-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6995:
--

 Summary: [Packaging][Crossbow] The windows conda artifacts are not 
uploaded to GitHub releases
 Key: ARROW-6995
 URL: https://issues.apache.org/jira/browse/ARROW-6995
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


The artifacts should be uploaded under the appropriate tag: 
https://github.com/ursa-labs/crossbow/releases/tag/ursabot-289-azure-conda-win-vs2015-py37

Most certainly the artifacts are produced in a different directory than 
previously, so the uploading script cannot find it 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2180



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable

2019-10-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6994:
---

 Summary: [C++] Research jemalloc memory page reclamation 
configuration on macOS when background_thread option is unavailable
 Key: ARROW-6994
 URL: https://issues.apache.org/jira/browse/ARROW-6994
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In ARROW-6977, this was disabled on macOS, but this will potentially have 
negative performance and memory implications that were intended to have been 
fixed in ARROW-6910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Horizontal scaling design suggestion: Apache arrow flight

2019-10-25 Thread Vinay Kesarwani
Hi Ryan,

Thanks for your quick response.

I am aligned with your references and would like to discuss further to take
it forward.

Thanks,
Vinay

On Fri, Oct 18, 2019 at 11:51 PM Ryan Murray  wrote:

> Hey Vinay,
>
> This Spark source might be of interest [1]. We had discussed the
> possibility of it being moved into Arrow proper as a contrib module when
> more stable.
>
> This is doing something similar to what you are suggesting: talking to a
> cluster of Flight servers from Spark. This deals more with the client side
> and less with the server side however. It talks to a single Flight
> 'coordinator' and uses getSchema/getFlightInfo to tell the coordinator it
> wants a particular dataset. The coordinator then gives a list of flight
> tickets with portions of the required datasets. A client can a) ask for the
> entire dataset from the coordinator b) iterate serially through the tickets
> and assemble the whole dataset on the client side or (in the case of the
> Spark connector) fetch tickets in parallel.
>
> I think the server side as you described above doesn't yet exist in a
> standalone form although the spark connector was developed in conjunction
> with [2] as the server. This is however highly dependent on the
> implementation details of the Dremio engine as it is taking care of the
> coordination between the flight workers. The idea is identical to yours
> however: a coordinator engine, a distributed store for engine meta, worker
> engines which create/serve the Arrow buffers.
>
> Would be happy to discuss further if you are interested in working on this
> stuff!
>
> Best,
> Ryan
>
> [1] https://github.com/rymurr/flight-spark-source
> [2] https://github.com/dremio-hub/dremio-flight-connector
>
> On Fri, Oct 18, 2019 at 3:05 PM Vinay Kesarwani 
> wrote:
>
> > Hi,
> >
> > I am trying to establish following architecture
> >
> > My approach for flight horizontal scaling is to launch
> > 1-Apache flight server in each node
> > 2-one node declared as coordinator
> > 3-Publish coordinator info to a shared service [zookeeper]
> > 4-Launch worker node --> get coordinator node info from [zookeeper]
> > 5-Worker publishes its info to [zookeeper] to consumed by others
> >
> > Client connects to coordinator:
> > 1- Calls getFlightInfo(desc)
> > 2-Here Co-coordinator node overrides getFlightInfo()
> > 3-getFlightInfo() method internally get worker info based on the
> descriptor
> > from zookeeper
> > 4-Client consumes data from each endpoint in iterative manner OR in
> > parallel[not sure how]
> > -->getData()
> >
> > PutData()
> > 5-Client calls putdata() to put data in different nodes in flight stream
> > 6-Iterate through the endpoints and matches worker node IP
> > 7-if Worker IP matches with endpoint; worker put data in that node flight
> > server.
> > 8-On putting any new stream/updated; worker node info is updated in
> > zookeeper
> > 9-In case worker IP doesn't match with the endpoint we need to put data
> in
> > any other worker node; and publish the info in zookeeper.
> >
> > [in future distributed-client and distributed end point] example: spark
> > workers to Apache arrow flight cluster
> >
> > [image: image]
> > <
> >
> https://user-images.githubusercontent.com/6141965/67092386-b0012c00-f1cc-11e9-9ce2-d657001a85f7.png
> > >
> >
> > Just wanted to discuss if any PR is in progress for horizontal scaling in
> > Arrow flight, or any design doc is under discussion.
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> 
> Check out our GitHub , join our community
> site  & Download Dremio
> 
>


[jira] [Created] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread Gawain BOLTON (Jira)
Gawain BOLTON created ARROW-6992:


 Summary: [C++]: Undefined Behavior sanitizer build option fails 
with GCC
 Key: ARROW-6992
 URL: https://issues.apache.org/jira/browse/ARROW-6992
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Gawain BOLTON


Then build with the "undefined behaviour sanitizer" option 
(-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
{noformat}
c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
‘function’{noformat}
It appears that GCC has never had a "-fsanitize=function" option.

I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6991) [Packaging][deb] Add support for Ubuntu 19.10

2019-10-25 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6991:
---

 Summary: [Packaging][deb] Add support for Ubuntu 19.10
 Key: ARROW-6991
 URL: https://issues.apache.org/jira/browse/ARROW-6991
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)