[pyarrow] Parquet page header size limit

2019-04-16 Thread shyam narayan singh
Hi

While reading a custom parquet file that has extra information embedded
(some custom stats), pyarrow is failing to read it.


Traceback (most recent call last):

  File "/tmp/pytest.py", line 19, in 

table = dataset.read()

  File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line
214, in read

use_threads=use_threads)

  File "pyarrow/_parquet.pyx", line 737, in
pyarrow._parquet.ParquetReader.read_all

  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status

pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
Invalid data

Deserializing page header failed.



Looking at the code, I realised that SerializedPageReader throws exception
if the page header size goes beyond 16k (default max). There is a setter
method for the max page header size that is used only in tests.


Is there a way to get around the problem?


Regards

Shyam


Re: [VOTE] Proposal accepted: change to Arrow Flight protocol: endpoint URIs

2019-04-16 Thread David Li
Thanks all for the comments! I am on vacation, and will refresh the draft
PR as soon as I return.

Best,
David

On Wed, Apr 17, 2019, 00:49 Antoine Pitrou  wrote:

>
> Hello,
>
> This vote closes with 4 binding approvals (+1) and zero disapprovals.
> There were also several non-binding approvals.  The proposal is
> therefore accepted.
>
> Congrats to David Li and everyone who participated in the discussion.
>
> Now the corresponding PR should be refreshed and reviewed:
> https://github.com/apache/arrow/pull/4047
>
> Regards
>
> Antoine.
>
>
> On Mon, 8 Apr 2019 20:36:26 +0200
> Antoine Pitrou  wrote:
> > Hello,
> >
> > David Li has proposed to make the following change to the Flight gRPC
> > service definition, as explained in this document:
> >
> https://docs.google.com/document/d/1Eps9eHvBc_qM8nRsTVwVCuWwHoEtQ-a-8Lv5dswuQoM/
> >
> > The proposed change is to replace (host, port) pairs to identify
> > endpoints with RFC 3986-compliant URIs.  This will help describe with
> > much more flexibility how a given Flight stream can be reached, for
> > example by allowing different transport protocols (gRPC over TLS or Unix
> > sockets can be reasonably implemented, but in the future we may also
> > want to implement transport protocols that are not gRPC-based, for
> > example a REST protocol directly over HTTP).
> >
> > An example URI is "grpc+tcp://192.168.0.1:3337".
> >
> > Please vote whether to accept the changes. The vote will be open for at
> > least 72 hours.
> >
> > [ ] +1 Accept this change to the Flight protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Best regards
> >
> > Antoine.
> >
>
>
>
>


[jira] [Created] (ARROW-5176) [Python] Automate formatting of python files

2019-04-16 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5176:


 Summary: [Python] Automate formatting of python files
 Key: ARROW-5176
 URL: https://issues.apache.org/jira/browse/ARROW-5176
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Benjamin Kietzman


[Black](https://github.com/ambv/black) is a tool for automatically formatting 
python code in ways which flake8 and our other linters approve of. Adding it to 
the project will allow more reliably formatted python code and fill a similar 
role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] Benchmarking infrastructure

2019-04-16 Thread Francois Saint-Jacques
Hello,

A small status update, I recently implemented archery [1] a tool for Arrow
benchmarks comparison [2]. The documentation ([3] and [4]) is in the
pull-request. The primary goal is to compare 2 commits (and/or build
directories) for performance regressions. For now, it supports C++
benchmarks.
This is accessible via the command `archery benchmark diff`. The end result
is
a one comparison per line, with an regression indicator.

Currently, there is no facility to perform a single "run", e.g. run
benchmarks
in the current workspace without comparing to a previous version. This was
initially implemented in [5] but depended heavily on ctest (with no control
on
execution). Once [1] is merged, I'll re-implement single run (ARROW-5071)
this
in term of archery, since it already execute and parses C++ benchmarks.

The next goal is to be able to push the results into an upstream database,
be
it the one defined in dev/benchmarking, or codespeed as Areg proposed. The
steps required for this:
- ARROW-5071: Run and format benchmark results for upstream consumption
  (ideally under the `archery benchmark run` sub-command)
- ARROW-5175: Make a list of benchmarks to include in regression checks
- ARROW-4716: Collect machine and benchmarks context
- ARROW-TBD: Push benchmark results to upstream database

In parallel, with ARROW-4827, Krisztian and I are working on 2 related
buildbot
sub-projects enabling some regression detection:
- Triggering on-demand benchmark comparison via comments in PR
   (as proposed by Wes)
- Regression check on master merge (without database support)

François

P.S.
A collateral of this PR is that archery is a modular python library and can
be
used for other purposes, e.g. it could centralize orphaned scripts in dev/,
e.g. linting, release, and merge since it offers utilities to handle arrow
sources, git, cmake and exposes a usable CLI interface (with documentation).

[1] https://github.com/apache/arrow/pull/4141
[2] https://jira.apache.org/jira/browse/ARROW-4827
[3]
https://github.com/apache/arrow/blob/512ae64bc074a0b620966131f9338d4a1eed2356/docs/source/developers/benchmarks.rst
[4]
https://github.com/apache/arrow/pull/4141/files#diff-7a8805436a6884ddf74fe3eaec697e71R216
[5] https://github.com/apache/arrow/pull/4077

On Fri, Mar 29, 2019 at 3:21 PM Melik-Adamyan, Areg <
areg.melik-adam...@intel.com> wrote:

> >When you say "output is parsed", how is that exactly? We don't have any
> scripts in the repository to do this yet (I have some comments on this
> below). We also have to collect machine information and insert that into
> the database. From my >perspective we have quite a bit of engineering work
> on this topic ("benchmark execution and data collection") to do.
> Yes I wrote one as a test.  Then it can do POST to the needed endpoint the
> JSON structure. Everything else will be done in the
>
> >My team and I have some physical hardware (including an Aarch64 Jetson
> TX2 machine, might be interesting to see what the ARM64 results look like)
> where we'd like to run benchmarks and upload the results also, so we need
> to write some documentation about how to add a new machine and set up a
> cron job of some kind.
> If it can run Linux, then we can setup it.
>
> >I'd like to eventually have a bot that we can ask to run a benchmark
> comparison versus master. Reporting on all PRs automatically might be quite
> a bit of work (and load on the machines)
> You should be able to choose the comparison between any two points:
> master-PR, master now - master yesterday, etc.
>
> >I thought the idea (based on our past e-mail discussions) was that we
> would implement benchmark collectors (as programs in the Arrow git
> repository) for each benchmarking framework, starting with gbenchmark and
> expanding to include ASV (for Python) and then others
> I'll open a PR and happy to put it into Arrow.
>
> >It seems like writing the benchmark collector script that runs the
> benchmarks, collects machine information, and inserts data into an instance
> of the database is the next milestone. Until that's done it seems difficult
> to do much else
> Ok, will update the Jira 5070 and link the 5071.
>
> Thanks.
>


[jira] [Created] (ARROW-5175) [Benchmarking] Decide which benchmarks are part of regression checks

2019-04-16 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5175:
-

 Summary: [Benchmarking] Decide which benchmarks are part of 
regression checks
 Key: ARROW-5175
 URL: https://issues.apache.org/jira/browse/ARROW-5175
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Proposal accepted: change to Arrow Flight protocol: endpoint URIs

2019-04-16 Thread Antoine Pitrou


Hello,

This vote closes with 4 binding approvals (+1) and zero disapprovals.
There were also several non-binding approvals.  The proposal is
therefore accepted.

Congrats to David Li and everyone who participated in the discussion.

Now the corresponding PR should be refreshed and reviewed:
https://github.com/apache/arrow/pull/4047

Regards

Antoine.


On Mon, 8 Apr 2019 20:36:26 +0200
Antoine Pitrou  wrote:
> Hello,
> 
> David Li has proposed to make the following change to the Flight gRPC
> service definition, as explained in this document:
> https://docs.google.com/document/d/1Eps9eHvBc_qM8nRsTVwVCuWwHoEtQ-a-8Lv5dswuQoM/
> 
> The proposed change is to replace (host, port) pairs to identify
> endpoints with RFC 3986-compliant URIs.  This will help describe with
> much more flexibility how a given Flight stream can be reached, for
> example by allowing different transport protocols (gRPC over TLS or Unix
> sockets can be reasonably implemented, but in the future we may also
> want to implement transport protocols that are not gRPC-based, for
> example a REST protocol directly over HTTP).
> 
> An example URI is "grpc+tcp://192.168.0.1:3337".
> 
> Please vote whether to accept the changes. The vote will be open for at
> least 72 hours.
> 
> [ ] +1 Accept this change to the Flight protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> Best regards
> 
> Antoine.
> 





[jira] [Created] (ARROW-5174) [Go] implement Stringer for DataTypes

2019-04-16 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5174:
--

 Summary: [Go] implement Stringer for DataTypes
 Key: ARROW-5174
 URL: https://issues.apache.org/jira/browse/ARROW-5174
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5173) [Go] handle multiple concatenated streams back-to-back

2019-04-16 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5173:
--

 Summary: [Go] handle multiple concatenated streams back-to-back
 Key: ARROW-5173
 URL: https://issues.apache.org/jira/browse/ARROW-5173
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5172) [Go] implement reading fixed-size binary arrays from Arrow file

2019-04-16 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5172:
--

 Summary: [Go] implement reading fixed-size binary arrays from 
Arrow file
 Key: ARROW-5172
 URL: https://issues.apache.org/jira/browse/ARROW-5172
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Sebastien Binet
Assignee: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: What's the proper procedure to publish a docker image to dockerhub?

2019-04-16 Thread Alberto Ramón
Then in this scenario the most easy is this:

1- Create repository from dockerHub (ask to Wes about the name)(mine is
albertozgz)

2-Create a local image with the name of your repository
[image: image.png]

3- Upload Image to docker Hub  (this process will require Login / password)

docker pull albertozgz/turbodbc_extrator:turbodbc11100

4- Go to Docker Hub and check the new repository:

 [image: image.png]

If you want have any problem, feel free to let me know directly





On Tue, 16 Apr 2019 at 05:28, Wes McKinney  wrote:

> It is not in compliance to publish any official Apache artifacts unless the
> PMC has a release vote on them (or the source artifact that produces them).
> You are free to publish a Docker image under a non-Apache dockerhub account
> of course.
>
> Wes
>
> On Tue, Apr 16, 2019, 10:10 AM Micah Kornfield 
> wrote:
>
> > I'm not sure the policy here but I think if this something official then
> > the PMC would have to set it up and control it.   Could someone on the
> PMC
> > chime in?
> >
> > On Monday, April 15, 2019, Zhiyuan Zheng 
> wrote:
> >
> > > Thanks Alberto!
> > >
> > > If we are able to create an official repository solely for Apache
> Arrow,
> > > it's more flexible to publish new images in future.
> > >
> > > How to create such a repository ?
> > >
> > > 16.04.2019, 01:27, "Alberto Ramón" :
> > > > Hello Zhiyuan
> > > >
> > > > I can help you if you need help with this process
> > > > The best option is request a offical repository for Apache Arrow
> > Project
> > > (se
> > > > are the ones that start with '_' Redis example
> > > > 
> > > >
> > > > On Mon, 15 Apr 2019 at 15:21, Zhiyuan Zheng <
> zhiyuan.zh...@yandex.com>
> > > > wrote:
> > > >
> > > >>  Hi,
> > > >>
> > > >>  DataFusion is a component which is an in-memory query engine using
> > > Apache
> > > >>  Arrow as the memory model.
> > > >>
> > > >>  I have created a Dockerfile for DataFusion (
> > > >>  https://issues.apache.org/jira/browse/ARROW-4467) for it.
> > > >>
> > > >>  In order to help user to start using DataFusion for some simple
> real
> > > world
> > > >>  use cases, I would like to publish a docker image with tag
> > > >>  'apache/arrow-datafusion' to the DockerHub.
> > > >>
> > > >>  What's the procedure to publish a docker image to DockerHub
> prefixed
> > > with
> > > >>  apache?
> > > >>
> > > >>  Cheers,
> > > >>  Zhiyuan
> > >
> >
>