Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-09 Thread Wes McKinney
On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau  wrote:
>
> I'm not dismissing the there are issues but I also don't feel like there
> has been constant discussion for months on the list that INFRA is not being
> responsive to Arrow community requests. It seems like you might be saying a
> couple different things one of two things (or both?)?
>
> 1) The Arrow infrastructure requirements are vastly different than other
> projects. Because of Arrow's specialized requirements, we need things that
> no other project needs.
> 2) There are many projects that want CircleCI, Buildkite and Azure
> pipelines but Infrastructure is not responsive. This is putting a big
> damper on the success of the Arrow project.

Yes, I'm saying both of these things.

1. Yes, Arrow is special -- validating the project requires running a
dozen or more different builds (with dozens more nightly builds) that
test different parts of the project. Different language components, a
large and diverse packaging matrix, and interproject integration tests
and integration with external projects (e.g. Apache Spark adn others)

2. Yes, the limited GitHub App availability is hurting us.

I'm OK to place this concern in the "Community Health" section and
spend more time building a comprehensive case about how Infra's
conservatism around Apps is causing us to work with one hand tied
behind our back. I know that I'm not the only one who is unhappy, but
I'll let the others speak for themselves.

> For each of these, if we're asking the board to do something, we should say
> more and more clearly. Sure, CI is a pain in the Arrow project's a**. I
> also agree that community health is impacted by the challenge to merge
> things. I also share the perspective that the foundation has been slow to
> adopt new technologies and has been way to religious about svn. However, If
> we're asking the board to do something, what is it?

Allow GitHub Apps that do not require write access to the code itself,
set up appropriate checks and balances to ensure that the Foundation's
IP provenance webhooks are preserved.

> Looking at the two things you might be saying...
> If 1, are we confident in that? Many other projects have pretty complex
> build matrices I think. (I haven't thought about this and evaluated the
> other projects...maybe it is true.) If 1, we should clarify why we think
> we're different. If that is the case, what are asking for from the board.
>
> If 2, and you are proposing throwing stones at INFRA, we should back it up
> with INFRA tickets and numbers (e.g. how many projects have wanted these
> things and for how long). We should reference multiple threads on the INFRA
> mailing list where we voiced certain concerns and many other people voiced
> similar concerns and INFRA turned a deaf ear or blind eye (maybe these
> exist, I haven't spent much time on the INFRA list lately). As it stands,
> the one ticket referenced in this thread is a ticket that has only one
> project asking for a new integration that has been open for less than a
> week. That may be annoying but it doesn't seem like something that has
> gotten to the level that we need to get the boards help.
>
> In a nutshell, I agree that this is impacting the health and growth of the
> project but think we should cover that in the community health section of
> the report. I'm less a fan of saying this is an issue the board needs to
> help us solve unless it has been a constant point of pain that we've
> attempted to elevate multiple times in infra forums and experienced
> unreasonable responses. The board is a blunt instrument and should only be
> used when we have depleted every other avenue for resolution.
>

Yes, I'm happy to spend more time building a comprehensive case before
escalating it to the board level. However, Apache Arrow is a high
profile project and it is not a good luck to have a PMC in a
fast-growing project growing disgruntled with the Foundation's
policies in this way. We've been struggling visibly for a long time
with our CI scalability, and I think we should have all the options on
the table to utilize GitHub-integrated tools to help us find a way out
of the mess that we are in.

>
> On Wed, Oct 9, 2019 at 9:44 PM Wes McKinney  wrote:
>
> > hi Jacques,
> >
> > I think we need to share the concerns that many PMC members have over
> > the constraints that INFRA is placing on us. Can we rephrase the
> > concern in a way that is more helpful?
> >
> > Firstly, I respect and appreciate the ASF's desire to limit write
> > access to committers only from an IP provenance perspective. I
> > understand that GitHub webhooks are used to log actions taken in
> > repositories to secure IP provenance. I do not think a third party
> > application should be given the ability to commit or modify a
> > repository -- all write operations on the .git repository should be
> > initiated by committers.
> >
> > However, GitHub is the main platform for producing open source
> > software, and tools a

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-09 Thread Jacques Nadeau
I'm not dismissing the there are issues but I also don't feel like there
has been constant discussion for months on the list that INFRA is not being
responsive to Arrow community requests. It seems like you might be saying a
couple different things one of two things (or both?)?

1) The Arrow infrastructure requirements are vastly different than other
projects. Because of Arrow's specialized requirements, we need things that
no other project needs.
2) There are many projects that want CircleCI, Buildkite and Azure
pipelines but Infrastructure is not responsive. This is putting a big
damper on the success of the Arrow project.

For each of these, if we're asking the board to do something, we should say
more and more clearly. Sure, CI is a pain in the Arrow project's a**. I
also agree that community health is impacted by the challenge to merge
things. I also share the perspective that the foundation has been slow to
adopt new technologies and has been way to religious about svn. However, If
we're asking the board to do something, what is it?

Looking at the two things you might be saying...
If 1, are we confident in that? Many other projects have pretty complex
build matrices I think. (I haven't thought about this and evaluated the
other projects...maybe it is true.) If 1, we should clarify why we think
we're different. If that is the case, what are asking for from the board.

If 2, and you are proposing throwing stones at INFRA, we should back it up
with INFRA tickets and numbers (e.g. how many projects have wanted these
things and for how long). We should reference multiple threads on the INFRA
mailing list where we voiced certain concerns and many other people voiced
similar concerns and INFRA turned a deaf ear or blind eye (maybe these
exist, I haven't spent much time on the INFRA list lately). As it stands,
the one ticket referenced in this thread is a ticket that has only one
project asking for a new integration that has been open for less than a
week. That may be annoying but it doesn't seem like something that has
gotten to the level that we need to get the boards help.

In a nutshell, I agree that this is impacting the health and growth of the
project but think we should cover that in the community health section of
the report. I'm less a fan of saying this is an issue the board needs to
help us solve unless it has been a constant point of pain that we've
attempted to elevate multiple times in infra forums and experienced
unreasonable responses. The board is a blunt instrument and should only be
used when we have depleted every other avenue for resolution.




On Wed, Oct 9, 2019 at 9:44 PM Wes McKinney  wrote:

> hi Jacques,
>
> I think we need to share the concerns that many PMC members have over
> the constraints that INFRA is placing on us. Can we rephrase the
> concern in a way that is more helpful?
>
> Firstly, I respect and appreciate the ASF's desire to limit write
> access to committers only from an IP provenance perspective. I
> understand that GitHub webhooks are used to log actions taken in
> repositories to secure IP provenance. I do not think a third party
> application should be given the ability to commit or modify a
> repository -- all write operations on the .git repository should be
> initiated by committers.
>
> However, GitHub is the main platform for producing open source
> software, and tools are being created to help produce open source more
> efficiently. It is frustrating for us to not be able to take advantage
> of the tools that are available to everyone else on GitHub. I brought
> up the recent request about Buildkite as being representative of this
> (after learning that Google has been making a lot of use of it), but
> we have previously been denied use of CircleCI and Azure Pipelines
> since those services require even more permissions (AFAIK) than in the
> case of Buildkite. From our use in
> https://github.com/ursa-labs/crossbow CircleCI and Azure seem to be a
> lot better than Travis CI and Appveyor
>
> I think the ASF is going to face an existential crisis in the near
> future whether it wants to live in 2020 or 2000. It feels like GitHub
> is treated somewhat as ersatz SVN "because people want to use git +
> GitHub instead of SVN"
>
> In the same way that the cloud revolutionized software startups,
> enabling small groups of developers to build large SaaS applications,
> the same kind of leverage is becoming available to open source
> developers to set up infrastructure to automate and scale open source
> projects. I think projects considering joining the Foundation are
> going to look at these issues around App usage and decide that they
> would rather be in control of their own infrastructure.
>
> I can set aside even more time and money from my non-profit
> organization's modest budget to do CI work for Apache Arrow. The
> amount that we have invested already is very large, and continues to
> grow. I'm raising these issues because as Member of the Foundation I'

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Renjie Liu
It would be fine in that case.

Wes McKinney  于 2019年10月10日周四 下午12:58写道:

> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu 
> wrote:
> >
> > 1. There already exists a low level parquet writer which can produce
> > parquet file, so unit test should be fine. But writer from arrow to
> parquet
> > doesn't exist yet, and it may take some period of time to finish it.
> > 2. In fact my data are randomly generated and it's definitely
> reproducible.
> > However, I don't think it would be good idea to randomly generate data
> > everytime we run ci because it would be difficult to debug. For example
> PR
> > a introduced a bug, which is triggerred in other PR's build it would be
> > confusing for contributors.
>
> Presumably any random data generation would use a fixed seed precisely
> to be reproducible.
>
> > 3. I think it would be good idea to spend effort on integration test with
> > parquet because it's an important use case of arrow. Also similar
> approach
> > could be extended to other language and other file format(avro, orc).
> >
> >
> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney 
> wrote:
> >
> > > There are a number of issues worth discussion.
> > >
> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> > > It's OK to be reliant on other libraries in the short term to produce
> > > files to test against, but does not strike me as a sustainable
> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
> > > to be if you can't write targeted "endogenous" unit tests
> > >
> > > 2. Reproducible data generation
> > >
> > > I think if you're going to test against a pre-generated corpus, you
> > > should make sure that generating the corpus is reproducible for other
> > > developers (i.e. with a Dockerfile), and can be extended by adding new
> > > files or random data generation.
> > >
> > > I additionally would prefer generating the test corpus at test time
> > > rather than checking in binary files. If this isn't viable right now
> > > we can create an "arrow-rust-crutch" git repository for you to stash
> > > binary files until some of these testing scalability issues are
> > > addressed.
> > >
> > > If we're going to spend energy on Parquet integration testing with
> > > Java, this would be a good opportunity to do the work in a way where
> > > the C++ Parquet library can also participate (since we ought to be
> > > doing integration tests with Java, and we can also read JSON files to
> > > Arrow).
> > >
> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu 
> > > wrote:
> > > >
> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove 
> > > wrote:
> > > >
> > > > > I'm very interested in helping to find a solution to this because
> we
> > > really
> > > > > do need integration tests for Rust to make sure we're compatible
> with
> > > other
> > > > > implementations... there is also the ongoing CI dockerization work
> > > that I
> > > > > feel is related.
> > > > >
> > > > > I haven't looked at the current integration tests yet and would
> > > appreciate
> > > > > some pointers on how all of this works (do we have docs?) or where
> to
> > > start
> > > > > looking.
> > > > >
> > > > I have a test in my latest PR:
> https://github.com/apache/arrow/pull/5523
> > > > And here is the generated data:
> > > > https://github.com/apache/arrow-testing/pull/11
> > > > As with program to generate these data, it's just a simple java
> program.
> > > > I'm not sure whether we need to integrate it into arrow.
> > > >
> > > > >
> > > > > I imagine the integration test could follow the approach that
> Renjie is
> > > > > outlining where we call Java to generate some files and then call
> Rust
> > > to
> > > > > parse them?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu  >
> > > wrote:
> > > > >
> > > > > > Hi:
> > > > > >
> > > > > > I'm developing rust version of reader which reads parquet into
> arrow
> > > > > array.
> > > > > > To verify the correct of this reader, I use the following
> approach:
> > > > > >
> > > > > >
> > > > > >1. Define schema with protobuf.
> > > > > >2. Generate json data of this schema using other language with
> > > more
> > > > > >sophisticated implementation (e.g. java)
> > > > > >3. Generate parquet data of this schema using other language
> with
> > > more
> > > > > >sophisticated implementation (e.g. java)
> > > > > >4. Write tests to read json file, and parquet file into memory
> > > (arrow
> > > > > >array), then compare json data with arrow data.
> > > > > >
> > > > > >  I think with this method we can guarantee the correctness of
> arrow
> > > > > reader
> > > > > > because json format is ubiquitous and their implementation are
> more
> > > > > stable.
> > > > > >
> > > > > > Any comment is appreciated.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Renjie Liu
> > > > Software Engineer, MVAD
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Wes McKinney
On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu  wrote:
>
> 1. There already exists a low level parquet writer which can produce
> parquet file, so unit test should be fine. But writer from arrow to parquet
> doesn't exist yet, and it may take some period of time to finish it.
> 2. In fact my data are randomly generated and it's definitely reproducible.
> However, I don't think it would be good idea to randomly generate data
> everytime we run ci because it would be difficult to debug. For example PR
> a introduced a bug, which is triggerred in other PR's build it would be
> confusing for contributors.

Presumably any random data generation would use a fixed seed precisely
to be reproducible.

> 3. I think it would be good idea to spend effort on integration test with
> parquet because it's an important use case of arrow. Also similar approach
> could be extended to other language and other file format(avro, orc).
>
>
> On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney  wrote:
>
> > There are a number of issues worth discussion.
> >
> > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> > It's OK to be reliant on other libraries in the short term to produce
> > files to test against, but does not strike me as a sustainable
> > long-term plan. Fixing bugs can be a lot more difficult than it needs
> > to be if you can't write targeted "endogenous" unit tests
> >
> > 2. Reproducible data generation
> >
> > I think if you're going to test against a pre-generated corpus, you
> > should make sure that generating the corpus is reproducible for other
> > developers (i.e. with a Dockerfile), and can be extended by adding new
> > files or random data generation.
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files. If this isn't viable right now
> > we can create an "arrow-rust-crutch" git repository for you to stash
> > binary files until some of these testing scalability issues are
> > addressed.
> >
> > If we're going to spend energy on Parquet integration testing with
> > Java, this would be a good opportunity to do the work in a way where
> > the C++ Parquet library can also participate (since we ought to be
> > doing integration tests with Java, and we can also read JSON files to
> > Arrow).
> >
> > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu 
> > wrote:
> > >
> > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove 
> > wrote:
> > >
> > > > I'm very interested in helping to find a solution to this because we
> > really
> > > > do need integration tests for Rust to make sure we're compatible with
> > other
> > > > implementations... there is also the ongoing CI dockerization work
> > that I
> > > > feel is related.
> > > >
> > > > I haven't looked at the current integration tests yet and would
> > appreciate
> > > > some pointers on how all of this works (do we have docs?) or where to
> > start
> > > > looking.
> > > >
> > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > > And here is the generated data:
> > > https://github.com/apache/arrow-testing/pull/11
> > > As with program to generate these data, it's just a simple java program.
> > > I'm not sure whether we need to integrate it into arrow.
> > >
> > > >
> > > > I imagine the integration test could follow the approach that Renjie is
> > > > outlining where we call Java to generate some files and then call Rust
> > to
> > > > parse them?
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu 
> > wrote:
> > > >
> > > > > Hi:
> > > > >
> > > > > I'm developing rust version of reader which reads parquet into arrow
> > > > array.
> > > > > To verify the correct of this reader, I use the following approach:
> > > > >
> > > > >
> > > > >1. Define schema with protobuf.
> > > > >2. Generate json data of this schema using other language with
> > more
> > > > >sophisticated implementation (e.g. java)
> > > > >3. Generate parquet data of this schema using other language with
> > more
> > > > >sophisticated implementation (e.g. java)
> > > > >4. Write tests to read json file, and parquet file into memory
> > (arrow
> > > > >array), then compare json data with arrow data.
> > > > >
> > > > >  I think with this method we can guarantee the correctness of arrow
> > > > reader
> > > > > because json format is ubiquitous and their implementation are more
> > > > stable.
> > > > >
> > > > > Any comment is appreciated.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Renjie Liu
> > > Software Engineer, MVAD
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD


Re: Looking ahead to 1.0

2019-10-09 Thread Wes McKinney
Hi John,

Since the 1.0.0 release is focused on Format stability, probably the
only real "blockers" will be ensuring that we have hardened multiple
implementations (in particular C++ and Java) of the columnar format as
specified with integration tests to prove it. The issues you listed
sound more like C++ library changes to me?

If you want to propose Format-related changes, that would need to
happen right away otherwise the ship will sail on that.

- Wes

On Wed, Oct 9, 2019 at 9:08 PM John Muehlhausen  wrote:
>
> ARROW-5916
> ARROW-6836/6837
>
> These are of particular interest to me because they enable recordbatch
> "incrementalism" which is useful for streaming applications:
>
> ARROW-5916 allows a recordbatch to pre-allocate space for future records
> that have not yet been populated, making it safe for readers to consume the
> partial batch.
>
> ARROW-6836/6837 allows a file of record batches to be extended at the end,
> without re-writing the beginning, while including the idea that the
> custom_metadata may change with each update.  (custom_metadata in the
> Schema is not a good candidate because Schema also appears at the beginning
> of the file.)
>
> While these are not blockers for me quite yet, they soon will be!  If I
> wanted to ensure that these are in 1.0, what is my deadline for
> implementation and test cases?  Can such a note be made on the wiki?
> Should I change the priority in Jira?
>
> Thanks,
> John
>
> On Wed, Oct 9, 2019 at 2:57 PM Neal Richardson 
> wrote:
>
> > Congratulations everyone on 0.15! I know a lot of hard work went into
> > it, not only in the software itself but also in the build and release
> > process.
> >
> > Once you've caught your breath from the release, we should start
> > thinking about what's in scope for our next release, the big 1.0. To
> > get us started (or restarted, since we did discuss 1.0 before the
> > flatbuffer alignment issue came up), I've created
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
> > based on our past release wiki pages.
> >
> > A good place to begin would be to list, either in "blocker" Jiras or
> > bullet points on the document, the key features and tasks we must
> > resolve before 1.0. For example, I get the sense that we need to
> > overhaul the documentation, but that should be expressed in a more
> > concrete, actionable way.
> >
> > Neal
> >


Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-09 Thread Wes McKinney
hi Jacques,

I think we need to share the concerns that many PMC members have over
the constraints that INFRA is placing on us. Can we rephrase the
concern in a way that is more helpful?

Firstly, I respect and appreciate the ASF's desire to limit write
access to committers only from an IP provenance perspective. I
understand that GitHub webhooks are used to log actions taken in
repositories to secure IP provenance. I do not think a third party
application should be given the ability to commit or modify a
repository -- all write operations on the .git repository should be
initiated by committers.

However, GitHub is the main platform for producing open source
software, and tools are being created to help produce open source more
efficiently. It is frustrating for us to not be able to take advantage
of the tools that are available to everyone else on GitHub. I brought
up the recent request about Buildkite as being representative of this
(after learning that Google has been making a lot of use of it), but
we have previously been denied use of CircleCI and Azure Pipelines
since those services require even more permissions (AFAIK) than in the
case of Buildkite. From our use in
https://github.com/ursa-labs/crossbow CircleCI and Azure seem to be a
lot better than Travis CI and Appveyor

I think the ASF is going to face an existential crisis in the near
future whether it wants to live in 2020 or 2000. It feels like GitHub
is treated somewhat as ersatz SVN "because people want to use git +
GitHub instead of SVN"

In the same way that the cloud revolutionized software startups,
enabling small groups of developers to build large SaaS applications,
the same kind of leverage is becoming available to open source
developers to set up infrastructure to automate and scale open source
projects. I think projects considering joining the Foundation are
going to look at these issues around App usage and decide that they
would rather be in control of their own infrastructure.

I can set aside even more time and money from my non-profit
organization's modest budget to do CI work for Apache Arrow. The
amount that we have invested already is very large, and continues to
grow. I'm raising these issues because as Member of the Foundation I'm
concerned that fast-growing projects like ours are not being
adequately served by INFRA, and we probably aren't the only project
that will face these issues. All that is needed is for INFRA to let us
use third party GitHub Apps and monitor any potentially destructive
actions that they may take, such as modifying unrelated repository
webhooks related to IP provenance.

- Wes

On Wed, Oct 9, 2019 at 9:33 PM Jacques Nadeau  wrote:
>
> I think we need to more direct in listing issues for the board.
>
> What have we done? What do we want them to do?
>
> In general, any large org is going to be slow to add new deep integrations
> into GitHub. I don't think we should expect Apache to be any different (it
> took several years before we could merge things through github for
> example). If I were on the INFRA side, I think I would look and see how
> many different people are asking for BuildKite before considering
> integration. It seems like we only opened the JIRA 6 days ago and no other
> projects have requested access to this?
>
> I'm not clear why this is a board issue. What do we think the board can do
> for us that we can't solve ourselves and need them to solve? Remember, a
> board solution to a problem is typically very removed from what matters to
> individuals on a project.
>
>
>
>
>
>
> On Tue, Oct 8, 2019 at 7:03 AM Wes McKinney  wrote:
>
> > New draft
> >
> > ## Description:
> > The mission of Apache Arrow is the creation and maintenance of software
> > related
> > to columnar in-memory processing and data interchange
> >
> > ## Issues:
> >
> > * We are struggling with Continuous Integration scalability as the project
> > has
> >   definitely outgrown what Travis CI and Appveyor can do for us. Some
> >   contributors have shown reluctance to submit patches they aren't sure
> > about
> >   because they don't want to pile on the build queue. We are exploring
> >   alternative solutions such as Buildbot, Buildkite, and GitHub Actions to
> >   provide a path to migrate away from Travis CI / Appveyor. In our request
> > to
> >   Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD
> >   service like Buildkite may not be able to be connected to the @apache
> > GitHub
> >   account on account of requiring admin access to repository webhooks, but
> > no
> >   ability to modify source code. There are workarounds (building custom
> > OAuth
> >   bots) that could enable us to use Buildkite, but it would require extra
> >   development and result in a less refined experience for community
> > members.
> >
> > ## Membership Data:
> > * Apache Arrow was founded 2016-01-19 (4 years ago)
> > * There are currently 48 committers and 28 PMC members in this project.
> > * The Committer-to-P

[jira] [Created] (ARROW-6844) List columns read broken with 0.15.0

2019-10-09 Thread Benoit Rostykus (Jira)
Benoit Rostykus created ARROW-6844:
--

 Summary: List columns read broken with 0.15.0
 Key: ARROW-6844
 URL: https://issues.apache.org/jira/browse/ARROW-6844
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.0
Reporter: Benoit Rostykus


Columns of type `array` (such as `array`, 
`array`...) are not readable anymore using `pyarrow == 0.15.0` (but were 
with `pyarrow == 0.14.1`) when the original writer of the parquet file is 
`parquet-mr 1.9.1`.

```
import pyarrow.parquet as pq

pf = pq.ParquetFile('sample.gz.parquet')

print(pf.read(columns=['profile_ids']))
```
with 0.14.1:
```
pyarrow.Table
profile_ids: list
 child 0, element: int64

...
```
with 0.15.0:

```

Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 253, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1131, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 0 with type list 
is inconsistent with schema list

```

I've tested parquet files coming from multiple tables (with various schemas) 
created with `parquet-mr`, couldn't read any `array` column 
anymore.

 

I _think_ the bug was introduced with [this 
commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].

I think the root of the issue comes from the fact that `parquet-mr` writes the 
inner struct name as `"element"` by default (see 
[here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]),
 whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example [this 
test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]).
 The round-tripping tests write/read in pyarrow only obviously won't catch this.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Renjie Liu
1. There already exists a low level parquet writer which can produce
parquet file, so unit test should be fine. But writer from arrow to parquet
doesn't exist yet, and it may take some period of time to finish it.
2. In fact my data are randomly generated and it's definitely reproducible.
However, I don't think it would be good idea to randomly generate data
everytime we run ci because it would be difficult to debug. For example PR
a introduced a bug, which is triggerred in other PR's build it would be
confusing for contributors.
3. I think it would be good idea to spend effort on integration test with
parquet because it's an important use case of arrow. Also similar approach
could be extended to other language and other file format(avro, orc).


On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney  wrote:

> There are a number of issues worth discussion.
>
> 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> It's OK to be reliant on other libraries in the short term to produce
> files to test against, but does not strike me as a sustainable
> long-term plan. Fixing bugs can be a lot more difficult than it needs
> to be if you can't write targeted "endogenous" unit tests
>
> 2. Reproducible data generation
>
> I think if you're going to test against a pre-generated corpus, you
> should make sure that generating the corpus is reproducible for other
> developers (i.e. with a Dockerfile), and can be extended by adding new
> files or random data generation.
>
> I additionally would prefer generating the test corpus at test time
> rather than checking in binary files. If this isn't viable right now
> we can create an "arrow-rust-crutch" git repository for you to stash
> binary files until some of these testing scalability issues are
> addressed.
>
> If we're going to spend energy on Parquet integration testing with
> Java, this would be a good opportunity to do the work in a way where
> the C++ Parquet library can also participate (since we ought to be
> doing integration tests with Java, and we can also read JSON files to
> Arrow).
>
> On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu 
> wrote:
> >
> > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove 
> wrote:
> >
> > > I'm very interested in helping to find a solution to this because we
> really
> > > do need integration tests for Rust to make sure we're compatible with
> other
> > > implementations... there is also the ongoing CI dockerization work
> that I
> > > feel is related.
> > >
> > > I haven't looked at the current integration tests yet and would
> appreciate
> > > some pointers on how all of this works (do we have docs?) or where to
> start
> > > looking.
> > >
> > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > And here is the generated data:
> > https://github.com/apache/arrow-testing/pull/11
> > As with program to generate these data, it's just a simple java program.
> > I'm not sure whether we need to integrate it into arrow.
> >
> > >
> > > I imagine the integration test could follow the approach that Renjie is
> > > outlining where we call Java to generate some files and then call Rust
> to
> > > parse them?
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu 
> wrote:
> > >
> > > > Hi:
> > > >
> > > > I'm developing rust version of reader which reads parquet into arrow
> > > array.
> > > > To verify the correct of this reader, I use the following approach:
> > > >
> > > >
> > > >1. Define schema with protobuf.
> > > >2. Generate json data of this schema using other language with
> more
> > > >sophisticated implementation (e.g. java)
> > > >3. Generate parquet data of this schema using other language with
> more
> > > >sophisticated implementation (e.g. java)
> > > >4. Write tests to read json file, and parquet file into memory
> (arrow
> > > >array), then compare json data with arrow data.
> > > >
> > > >  I think with this method we can guarantee the correctness of arrow
> > > reader
> > > > because json format is ubiquitous and their implementation are more
> > > stable.
> > > >
> > > > Any comment is appreciated.
> > > >
> > >
> >
> >
> > --
> > Renjie Liu
> > Software Engineer, MVAD
>


-- 
Renjie Liu
Software Engineer, MVAD


Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-09 Thread Jacques Nadeau
I think we need to more direct in listing issues for the board.

What have we done? What do we want them to do?

In general, any large org is going to be slow to add new deep integrations
into GitHub. I don't think we should expect Apache to be any different (it
took several years before we could merge things through github for
example). If I were on the INFRA side, I think I would look and see how
many different people are asking for BuildKite before considering
integration. It seems like we only opened the JIRA 6 days ago and no other
projects have requested access to this?

I'm not clear why this is a board issue. What do we think the board can do
for us that we can't solve ourselves and need them to solve? Remember, a
board solution to a problem is typically very removed from what matters to
individuals on a project.






On Tue, Oct 8, 2019 at 7:03 AM Wes McKinney  wrote:

> New draft
>
> ## Description:
> The mission of Apache Arrow is the creation and maintenance of software
> related
> to columnar in-memory processing and data interchange
>
> ## Issues:
>
> * We are struggling with Continuous Integration scalability as the project
> has
>   definitely outgrown what Travis CI and Appveyor can do for us. Some
>   contributors have shown reluctance to submit patches they aren't sure
> about
>   because they don't want to pile on the build queue. We are exploring
>   alternative solutions such as Buildbot, Buildkite, and GitHub Actions to
>   provide a path to migrate away from Travis CI / Appveyor. In our request
> to
>   Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD
>   service like Buildkite may not be able to be connected to the @apache
> GitHub
>   account on account of requiring admin access to repository webhooks, but
> no
>   ability to modify source code. There are workarounds (building custom
> OAuth
>   bots) that could enable us to use Buildkite, but it would require extra
>   development and result in a less refined experience for community
> members.
>
> ## Membership Data:
> * Apache Arrow was founded 2016-01-19 (4 years ago)
> * There are currently 48 committers and 28 PMC members in this project.
> * The Committer-to-PMC ratio is roughly 3:2.
>
> Community changes, past quarter:
> - Micah Kornfield was added to the PMC on 2019-08-21
> - Sebastien Binet was added to the PMC on 2019-08-21
> - Ben Kietzman was added as committer on 2019-09-07
> - David Li was added as committer on 2019-08-30
> - Kenta Murata was added as committer on 2019-09-05
> - Neal Richardson was added as committer on 2019-09-05
> - Praveen Kumar was added as committer on 2019-07-14
>
> ## Project Activity:
>
> * The project has just made a 0.15.0 release.
> * We are discussing ways to make the Arrow libraries as accessible as
> possible
>   to downstream projects for minimal use cases while allowing the
> development
>   of more comprehensive "standard libraries" with larger dependency stacks
> in
>   the project
> * We plan to make a 1.0.0 release as our next major release, at which time
> we
>   will declare that the Arrow binary protocol is stable with forward and
>   backward compatibility guarantees
>
> ## Community Health:
>
> * The community is overall healthy, with the aforementioned concerns
> around CI
>   scalability. New contributors frequently take notice of the long build
> queue
>   times when submitting pull requests.
>
> On Tue, Oct 8, 2019 at 8:58 AM Wes McKinney  wrote:
> >
> > Yes, I agree with raising the issue to the board.
> >
> > On Tue, Oct 8, 2019 at 8:31 AM Antoine Pitrou 
> wrote:
> > >
> > >
> > > I agree.  Especially given that the constraints imposed by Infra don't
> > > help solving the problem.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 08/10/2019 à 15:02, Uwe L. Korn a écrit :
> > > > I'm not sure what qualifies for "board attention" but it seems that
> CI is a critical problem in Apache projects, not just Arrow. Should we
> raise that?
> > > >
> > > > Uwe
> > > >
> > > > On Tue, Oct 8, 2019, at 12:00 AM, Wes McKinney wrote:
> > > >> Here is a start for our Q3 board report
> > > >>
> > > >> ## Description:
> > > >> The mission of Apache Arrow is the creation and maintenance of
> software related
> > > >> to columnar in-memory processing and data interchange
> > > >>
> > > >> ## Issues:
> > > >> There are no issues requiring board attention at this time
> > > >>
> > > >> ## Membership Data:
> > > >> * Apache Arrow was founded 2016-01-19 (4 years ago)
> > > >> * There are currently 48 committers and 28 PMC members in this
> project.
> > > >> * The Committer-to-PMC ratio is roughly 3:2.
> > > >>
> > > >> Community changes, past quarter:
> > > >> - Micah Kornfield was added to the PMC on 2019-08-21
> > > >> - Sebastien Binet was added to the PMC on 2019-08-21
> > > >> - Ben Kietzman was added as committer on 2019-09-07
> > > >> - David Li was added as committer on 2019-08-30
> > > >> - Kenta Murata was added as committer on 2019-09-05
> > 

Re: Looking ahead to 1.0

2019-10-09 Thread John Muehlhausen
ARROW-5916
ARROW-6836/6837

These are of particular interest to me because they enable recordbatch
"incrementalism" which is useful for streaming applications:

ARROW-5916 allows a recordbatch to pre-allocate space for future records
that have not yet been populated, making it safe for readers to consume the
partial batch.

ARROW-6836/6837 allows a file of record batches to be extended at the end,
without re-writing the beginning, while including the idea that the
custom_metadata may change with each update.  (custom_metadata in the
Schema is not a good candidate because Schema also appears at the beginning
of the file.)

While these are not blockers for me quite yet, they soon will be!  If I
wanted to ensure that these are in 1.0, what is my deadline for
implementation and test cases?  Can such a note be made on the wiki?
Should I change the priority in Jira?

Thanks,
John

On Wed, Oct 9, 2019 at 2:57 PM Neal Richardson 
wrote:

> Congratulations everyone on 0.15! I know a lot of hard work went into
> it, not only in the software itself but also in the build and release
> process.
>
> Once you've caught your breath from the release, we should start
> thinking about what's in scope for our next release, the big 1.0. To
> get us started (or restarted, since we did discuss 1.0 before the
> flatbuffer alignment issue came up), I've created
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
> based on our past release wiki pages.
>
> A good place to begin would be to list, either in "blocker" Jiras or
> bullet points on the document, the key features and tasks we must
> resolve before 1.0. For example, I get the sense that we need to
> overhaul the documentation, but that should be expressed in a more
> concrete, actionable way.
>
> Neal
>


[jira] [Created] (ARROW-6843) [Website] Disable deploy on pull request

2019-10-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6843:
---

 Summary: [Website] Disable deploy on pull request
 Key: ARROW-6843
 URL: https://issues.apache.org/jira/browse/ARROW-6843
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Can't find myself in contributor list

2019-10-09 Thread Hengruo Zhang
Got it. 6408 was reverted. That makes sense.

On Wed, Oct 9, 2019 at 3:19 PM Wes McKinney  wrote:

> I'm seeing
>
> $ git hist | grep Hengruo
> * f9cd2958a 2019-10-09 | ARROW-6274: [Rust] [DataFusion] Add support
> for writing results to CSV [Hengruo Zhang]
> * 3145e9bef 2019-09-08 | ARROW-6408: [Rust] use "if cfg!" pattern
> [Hengruo Zhang]
>
> So there's only 1 commit in the last 1 month. This doesn't appear to
> be enough to be guaranteed to show up in the Pulse view (1 week or 1
> month views)
>
> On Wed, Oct 9, 2019 at 4:53 PM paddy horan  wrote:
> >
> > It might also be due to our merge tool.  PRs are merged locally and
> pushed to master (with the corresponding PR on github being “closed” rather
> than “merged”).  This might not be reflected in the pulse view.
> >
> > P
> >
> > 
> > From: Wes McKinney 
> > Sent: Wednesday, October 9, 2019 4:06:59 PM
> > To: dev 
> > Subject: Re: Can't find myself in contributor list
> >
> > GitHub only shows the top 100 contributors to the project in
> >
> > https://github.com/apache/arrow/graphs/contributors
> >
> > Similarly I think you need more commits to show up in the Pulse view
> >
> > On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang  wrote:
> > >
> > > Hi,
> > >
> > > My two PRs have been already merged to the master branch, but I cannot
> > > see me in the contributor list of GitHub, even if I narrowed down the
> > > time span so that there are only less than 50 people. And I can't even
> > > find my merging in https://github.com/apache/arrow/pulse .
> > >
> > > Could you please provide some possible reasons for this?
> > >
> > > PRs:
> > > https://github.com/apache/arrow/pull/5577
> > > https://github.com/apache/arrow/pull/5303
> > >
> > > Thanks,
> > > Hengruo
>


[jira] [Created] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6842:
---

 Summary: [Website] Jekyll error building website
 Key: ARROW-6842
 URL: https://issues.apache.org/jira/browse/ARROW-6842
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Wes McKinney
 Fix For: 1.0.0


I'm getting the following error locally on a fresh checkout and {{bundle 
install --path vendor/bundle}}

{code}
$ bundle exec jekyll serve
Configuration file: /home/wesm/code/arrow-site/_config.yml
Source: /home/wesm/code/arrow-site
   Destination: build
 Incremental build: disabled. Enable with --incremental
  Generating... 
jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
{code}

Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Can't find myself in contributor list

2019-10-09 Thread Wes McKinney
I'm seeing

$ git hist | grep Hengruo
* f9cd2958a 2019-10-09 | ARROW-6274: [Rust] [DataFusion] Add support
for writing results to CSV [Hengruo Zhang]
* 3145e9bef 2019-09-08 | ARROW-6408: [Rust] use "if cfg!" pattern
[Hengruo Zhang]

So there's only 1 commit in the last 1 month. This doesn't appear to
be enough to be guaranteed to show up in the Pulse view (1 week or 1
month views)

On Wed, Oct 9, 2019 at 4:53 PM paddy horan  wrote:
>
> It might also be due to our merge tool.  PRs are merged locally and pushed to 
> master (with the corresponding PR on github being “closed” rather than 
> “merged”).  This might not be reflected in the pulse view.
>
> P
>
> 
> From: Wes McKinney 
> Sent: Wednesday, October 9, 2019 4:06:59 PM
> To: dev 
> Subject: Re: Can't find myself in contributor list
>
> GitHub only shows the top 100 contributors to the project in
>
> https://github.com/apache/arrow/graphs/contributors
>
> Similarly I think you need more commits to show up in the Pulse view
>
> On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang  wrote:
> >
> > Hi,
> >
> > My two PRs have been already merged to the master branch, but I cannot
> > see me in the contributor list of GitHub, even if I narrowed down the
> > time span so that there are only less than 50 people. And I can't even
> > find my merging in https://github.com/apache/arrow/pulse .
> >
> > Could you please provide some possible reasons for this?
> >
> > PRs:
> > https://github.com/apache/arrow/pull/5577
> > https://github.com/apache/arrow/pull/5303
> >
> > Thanks,
> > Hengruo


Re: Can't find myself in contributor list

2019-10-09 Thread paddy horan
It might also be due to our merge tool.  PRs are merged locally and pushed to 
master (with the corresponding PR on github being “closed” rather than 
“merged”).  This might not be reflected in the pulse view.

P


From: Wes McKinney 
Sent: Wednesday, October 9, 2019 4:06:59 PM
To: dev 
Subject: Re: Can't find myself in contributor list

GitHub only shows the top 100 contributors to the project in

https://github.com/apache/arrow/graphs/contributors

Similarly I think you need more commits to show up in the Pulse view

On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang  wrote:
>
> Hi,
>
> My two PRs have been already merged to the master branch, but I cannot
> see me in the contributor list of GitHub, even if I narrowed down the
> time span so that there are only less than 50 people. And I can't even
> find my merging in https://github.com/apache/arrow/pulse .
>
> Could you please provide some possible reasons for this?
>
> PRs:
> https://github.com/apache/arrow/pull/5577
> https://github.com/apache/arrow/pull/5303
>
> Thanks,
> Hengruo


[jira] [Created] (ARROW-6841) [C++] Upgrade to LLVM 8

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6841:
---

 Summary: [C++] Upgrade to LLVM 8
 Key: ARROW-6841
 URL: https://issues.apache.org/jira/browse/ARROW-6841
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Now that LLVM 9 has been released, LLVM 8 has been promoted to stable according 
to 

http://apt.llvm.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6840:
---

 Summary: [C++/Python] retrieve fd of open memory mapped file and 
Open() memory mapped file by fd
 Key: ARROW-6840
 URL: https://issues.apache.org/jira/browse/ARROW-6840
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: John Muehlhausen


We want to retrieve the file descriptor of a memory mapped file for the purpose 
of transferring it across process boundaries.  On the receiving end, we want to 
be able to map a file based on the file descriptor rather than the path.

This helps with race conditions when the path may have been unlinked.


cf 
[https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6839:
---

 Summary: [Java] access File Footer custom_metadata
 Key: ARROW-6839
 URL: https://issues.apache.org/jira/browse/ARROW-6839
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6838:
---

 Summary: [JS] access File Footer custom_metadata
 Key: ARROW-6838
 URL: https://issues.apache.org/jira/browse/ARROW-6838
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6837:
---

 Summary: [C++/Python] access File Footer custom_metadata
 Key: ARROW-6837
 URL: https://issues.apache.org/jira/browse/ARROW-6837
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6836:
---

 Summary: [Format] add a custom_metadata:[KeyValue] field to the 
Footer table in File.fbs
 Key: ARROW-6836
 URL: https://issues.apache.org/jira/browse/ARROW-6836
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format
Reporter: John Muehlhausen


add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

Use case:

If a file is expanded with additional recordbatches and the custom_metadata 
changes, Schema is no longer an appropriate place to make this change since the 
two copies of Schema (at the beginning and end of the file) would then be 
ambiguous

cf 
https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6835:
-

 Summary: [Archery][CMake] Restore ARROW_LINT_ONLY  
 Key: ARROW-6835
 URL: https://issues.apache.org/jira/browse/ARROW-6835
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery
Reporter: Francois Saint-Jacques


This is used by developers to fasten the cmake build creation and loosen the 
required installed toolchains (notably libraries). This was yanked because 
ARROW_LINT_ONLY effectively exit-early and doesn't generate 
`compile_commands.json`.

Restore this option, but ensure that archery toggles accordingly to the usage 
of iwyu or clang-tidy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Can't find myself in contributor list

2019-10-09 Thread Wes McKinney
GitHub only shows the top 100 contributors to the project in

https://github.com/apache/arrow/graphs/contributors

Similarly I think you need more commits to show up in the Pulse view

On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang  wrote:
>
> Hi,
>
> My two PRs have been already merged to the master branch, but I cannot
> see me in the contributor list of GitHub, even if I narrowed down the
> time span so that there are only less than 50 people. And I can't even
> find my merging in https://github.com/apache/arrow/pulse .
>
> Could you please provide some possible reasons for this?
>
> PRs:
> https://github.com/apache/arrow/pull/5577
> https://github.com/apache/arrow/pull/5303
>
> Thanks,
> Hengruo


Can't find myself in contributor list

2019-10-09 Thread Hengruo Zhang
Hi,

My two PRs have been already merged to the master branch, but I cannot
see me in the contributor list of GitHub, even if I narrowed down the
time span so that there are only less than 50 people. And I can't even
find my merging in https://github.com/apache/arrow/pulse .

Could you please provide some possible reasons for this?

PRs:
https://github.com/apache/arrow/pull/5577
https://github.com/apache/arrow/pull/5303

Thanks,
Hengruo


[jira] [Created] (ARROW-6834) [C++] Appveyor build failing on master

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6834:
---

 Summary: [C++] Appveyor build failing on master
 Key: ARROW-6834
 URL: https://issues.apache.org/jira/browse/ARROW-6834
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Not sure what introduced this

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl

{code}
LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj 
/out:release\arrow-public-api-test.exe 
/implib:release\arrow-public-api-test.lib 
/pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 
/NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
release\arrow_testing.lib release\arrow.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib 
Kernel32.lib Ws2_32.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib 
mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib 
Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib 
oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
/MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 
1120) with the following output:
public_api_test.cc.obj : error LNK2019: unresolved external symbol 
"__declspec(dllimport) public: static void __cdecl 
testing::Test::SetUpTestSuite(void)" 
(__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetSetUpCaseOrSuite(char const *,int))(void)" 
(?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
public_api_test.cc.obj : error LNK2019: unresolved external symbol 
"__declspec(dllimport) public: static void __cdecl 
testing::Test::TearDownTestSuite(void)" 
(__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetTearDownCaseOrSuite(char const *,int))(void)" 
(?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved externals
[205/515] Building CXX object 
src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj
[206/515] Building CXX object 
src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj
ninja: build stopped: subcommand failed.
(arrow) C:\projects\arrow\cpp\build>goto scriptexit 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Question about timestamps ...

2019-10-09 Thread David Boles
The following code dies with pyarrow 0.14.2:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')

ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
tz='UTC'))
table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])

writer.write_table(table)
writer.close()

with the message:

ValueError: Invalid value for coerce_timestamps: ns

That appears to be because of this code in _parquet.pxi:

cdef int _set_coerce_timestamps(
self, ArrowWriterProperties.Builder* props) except -1:
if self.coerce_timestamps == 'ms':
props.coerce_timestamps(TimeUnit_MILLI)
elif self.coerce_timestamps == 'us':
props.coerce_timestamps(TimeUnit_MICRO)
elif self.coerce_timestamps is not None:
raise ValueError('Invalid value for coerce_timestamps: {0}'
 .format(self.coerce_timestamps))

which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
intentional, or a bug?

Thanks,

 - db


Looking ahead to 1.0

2019-10-09 Thread Neal Richardson
Congratulations everyone on 0.15! I know a lot of hard work went into
it, not only in the software itself but also in the build and release
process.

Once you've caught your breath from the release, we should start
thinking about what's in scope for our next release, the big 1.0. To
get us started (or restarted, since we did discuss 1.0 before the
flatbuffer alignment issue came up), I've created
https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
based on our past release wiki pages.

A good place to begin would be to list, either in "blocker" Jiras or
bullet points on the document, the key features and tasks we must
resolve before 1.0. For example, I get the sense that we need to
overhaul the documentation, but that should be expressed in a more
concrete, actionable way.

Neal


[jira] [Created] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6833:
--

 Summary: [R][CI] Add crossbow job for full R autobrew macOS build
 Key: ARROW-6833
 URL: https://issues.apache.org/jira/browse/ARROW-6833
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson


I have a separate nightly job that runs this on multiple R versions, but it 
would be nice to be able to have crossbow check this on a PR. As it turns out, 
the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp 
doesn't seem to ship static libs via Homebrew, so the autobrew packaging 
doesn't work, even though the formula builds and {{brew audit}} is clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6832:
--

 Summary: [R] Implement Codec::IsAvailable
 Key: ARROW-6832
 URL: https://issues.apache.org/jira/browse/ARROW-6832
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6831:
--

 Summary: [R] Update R macOS/Windows builds for change in cmake 
compression defaults
 Key: ARROW-6831
 URL: https://issues.apache.org/jira/browse/ARROW-6831
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-6631 changed the defaults for including compressions but did not update 
these build scripts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0

2019-10-09 Thread Neal Richardson
FWIW there appears to have been a recent update to grpc on Homebrew
involving protobuf:
https://github.com/Homebrew/homebrew-core/commits/master/Formula/grpc.rb

Last time we had a Homebrew grpc issue, I made this at Kou's
suggestion: https://github.com/Homebrew/homebrew-core/pull/44198

I think it's fair to report an issue there and show some log output
that we think is unexpected and see what they say. Maybe they can
rebuild the bottles again and that will magically fix it, like last
time.

Neal

On Wed, Oct 9, 2019 at 7:16 AM Wes McKinney  wrote:
>
> It looks like protobuf and other gRPC dependencies are being built
> from source when doing `brew install grpc`. This is probably an issue
> with the Homebrew stack, do we know how to address this situation now
> and in the future (probably requires asking the Homebrew community
> about grpc "bottles")?
>
> On Wed, Oct 9, 2019 at 7:26 AM Crossbow  wrote:
> >
> >
> > Arrow Build Report for Job nightly-2019-10-09-0
> >
> > All tasks: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0
> >
> > Failed Tasks:
> > - gandiva-jar-trusty:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty
> > - docker-clang-format:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format
> > - wheel-osx-cp36m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m
> > - wheel-osx-cp35m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m
> > - wheel-osx-cp37m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m
> > - wheel-manylinux2010-cp35m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m
> > - gandiva-jar-osx:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx
> > - wheel-osx-cp27m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m
> >
> > Succeeded Tasks:
> > - homebrew-cpp-autobrew:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew
> > - wheel-manylinux1-cp27mu:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu
> > - docker-hdfs-integration:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration
> > - docker-lint:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint
> > - docker-pandas-master:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master
> > - docker-cpp-static-only:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only
> > - wheel-manylinux2010-cp27mu:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu
> > - docker-cpp-cmake32:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32
> > - centos-6:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6
> > - centos-7:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7
> > - docker-cpp-release:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release
> > - docker-python-2.7:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7
> > - docker-r:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r
> > - docker-spark-integration:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration
> > - debian-stretch:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch
> > - conda-osx-clang-py36:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36
> > - ubuntu-xenial:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial
> > - ubuntu-disco:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco
> > - wheel-manylinux1-cp27m:
> >   URL: 
> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wh

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

2019-10-09 Thread Wes McKinney
There are a number of issues worth discussion.

1. What is the timeline/plan for Rust implementing a Parquet _writer_?
It's OK to be reliant on other libraries in the short term to produce
files to test against, but does not strike me as a sustainable
long-term plan. Fixing bugs can be a lot more difficult than it needs
to be if you can't write targeted "endogenous" unit tests

2. Reproducible data generation

I think if you're going to test against a pre-generated corpus, you
should make sure that generating the corpus is reproducible for other
developers (i.e. with a Dockerfile), and can be extended by adding new
files or random data generation.

I additionally would prefer generating the test corpus at test time
rather than checking in binary files. If this isn't viable right now
we can create an "arrow-rust-crutch" git repository for you to stash
binary files until some of these testing scalability issues are
addressed.

If we're going to spend energy on Parquet integration testing with
Java, this would be a good opportunity to do the work in a way where
the C++ Parquet library can also participate (since we ought to be
doing integration tests with Java, and we can also read JSON files to
Arrow).

On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu  wrote:
>
> On Wed, Oct 9, 2019 at 12:11 PM Andy Grove  wrote:
>
> > I'm very interested in helping to find a solution to this because we really
> > do need integration tests for Rust to make sure we're compatible with other
> > implementations... there is also the ongoing CI dockerization work that I
> > feel is related.
> >
> > I haven't looked at the current integration tests yet and would appreciate
> > some pointers on how all of this works (do we have docs?) or where to start
> > looking.
> >
> I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> And here is the generated data:
> https://github.com/apache/arrow-testing/pull/11
> As with program to generate these data, it's just a simple java program.
> I'm not sure whether we need to integrate it into arrow.
>
> >
> > I imagine the integration test could follow the approach that Renjie is
> > outlining where we call Java to generate some files and then call Rust to
> > parse them?
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu  wrote:
> >
> > > Hi:
> > >
> > > I'm developing rust version of reader which reads parquet into arrow
> > array.
> > > To verify the correct of this reader, I use the following approach:
> > >
> > >
> > >1. Define schema with protobuf.
> > >2. Generate json data of this schema using other language with more
> > >sophisticated implementation (e.g. java)
> > >3. Generate parquet data of this schema using other language with more
> > >sophisticated implementation (e.g. java)
> > >4. Write tests to read json file, and parquet file into memory (arrow
> > >array), then compare json data with arrow data.
> > >
> > >  I think with this method we can guarantee the correctness of arrow
> > reader
> > > because json format is ubiquitous and their implementation are more
> > stable.
> > >
> > > Any comment is appreciated.
> > >
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0

2019-10-09 Thread Wes McKinney
It looks like protobuf and other gRPC dependencies are being built
from source when doing `brew install grpc`. This is probably an issue
with the Homebrew stack, do we know how to address this situation now
and in the future (probably requires asking the Homebrew community
about grpc "bottles")?

On Wed, Oct 9, 2019 at 7:26 AM Crossbow  wrote:
>
>
> Arrow Build Report for Job nightly-2019-10-09-0
>
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0
>
> Failed Tasks:
> - gandiva-jar-trusty:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty
> - docker-clang-format:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format
> - wheel-osx-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m
> - wheel-osx-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m
> - wheel-manylinux2010-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx
> - wheel-osx-cp27m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m
>
> Succeeded Tasks:
> - homebrew-cpp-autobrew:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew
> - wheel-manylinux1-cp27mu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu
> - docker-hdfs-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration
> - docker-lint:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint
> - docker-pandas-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master
> - docker-cpp-static-only:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only
> - wheel-manylinux2010-cp27mu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu
> - docker-cpp-cmake32:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32
> - centos-6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6
> - centos-7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7
> - docker-cpp-release:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release
> - docker-python-2.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7
> - docker-r:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r
> - docker-spark-integration:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration
> - debian-stretch:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36
> - ubuntu-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial
> - ubuntu-disco:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco
> - wheel-manylinux1-cp27m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27m
> - docker-iwyu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-iwyu
> - docker-js:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-js
> - docker-python-3.7:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-3.7
> - docker-go:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-go
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp
> - conda-linux-gcc-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=n

[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-6830:


 Summary: Question / Feature Request- Select Subset of Columns in 
read_arrow
 Key: ARROW-6830
 URL: https://issues.apache.org/jira/browse/ARROW-6830
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, R
Reporter: Anthony Abate


*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6829) [Docs] Migrate integration test docs to Sphinx, fix instructions after ARROW-6466

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6829:
---

 Summary: [Docs] Migrate integration test docs to Sphinx, fix 
instructions after ARROW-6466
 Key: ARROW-6829
 URL: https://issues.apache.org/jira/browse/ARROW-6829
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-6466.

Also, the readme uses out of date archery flags

https://github.com/apache/arrow/blob/master/integration/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6827) [Archery] lint sub-command should provide a --fail-fast option

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6827:
-

 Summary: [Archery] lint sub-command should provide a --fail-fast 
option
 Key: ARROW-6827
 URL: https://issues.apache.org/jira/browse/ARROW-6827
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6828) [Archery] Benchmark diff should provide a TUI friendly output

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6828:
-

 Summary: [Archery] Benchmark diff should provide a TUI friendly 
output
 Key: ARROW-6828
 URL: https://issues.apache.org/jira/browse/ARROW-6828
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0

2019-10-09 Thread Crossbow


Arrow Build Report for Job nightly-2019-10-09-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0

Failed Tasks:
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty
- docker-clang-format:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m
- wheel-osx-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx
- wheel-osx-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m

Succeeded Tasks:
- homebrew-cpp-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu
- docker-hdfs-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration
- docker-lint:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master
- docker-cpp-static-only:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release
- docker-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial
- ubuntu-disco:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27m
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-iwyu
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-js
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-3.7
- docker-go:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-go
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-linux-gcc-py37
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp37m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-appveyor-wheel-win-cp37m
- docker-python-3.6-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?que

[jira] [Created] (ARROW-6826) [Archery] Default build should be minimal

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6826:
-

 Summary: [Archery] Default build should be minimal
 Key: ARROW-6826
 URL: https://issues.apache.org/jira/browse/ARROW-6826
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques


Follow of https://github.com/apache/arrow/pull/5600/files#r332655141



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6825) [C++] Rework CSV reader IO around readahead iterator

2019-10-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6825:
-

 Summary: [C++] Rework CSV reader IO around readahead iterator
 Key: ARROW-6825
 URL: https://issues.apache.org/jira/browse/ARROW-6825
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Following ARROW-6764, we should try to remove the custom ReadaheadSpooler and 
use the generic readahead iteration facility instead. This will require 
reworking the blocking / chunking logic to mimick what is done in the JSON 
reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6824) [Plasma] Support batched create and seal requests for small objects

2019-10-09 Thread Philipp Moritz (Jira)
Philipp Moritz created ARROW-6824:
-

 Summary: [Plasma] Support batched create and seal requests for 
small objects
 Key: ARROW-6824
 URL: https://issues.apache.org/jira/browse/ARROW-6824
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Affects Versions: 0.15.0
Reporter: Philipp Moritz


Currently the plasma create API supports creating and sealing a single object – 
this makes sense for large objects because their creating throughput is limited 
by the memory throughput of the client when the data is filled into the buffer. 
However sometimes we want to create lots of small objects in which case the 
throughput is limited by the number of IPCs to the store we can do when 
creating new objects. This can be fixed by offering a version of CreateAndSeal 
that allows us to create multiple objects at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)