Re: [DRAFT] Apache Arrow Board Report - October 2019
On Thu, Oct 10, 2019 at 12:22 AM Jacques Nadeau wrote: > > I'm not dismissing the there are issues but I also don't feel like there > has been constant discussion for months on the list that INFRA is not being > responsive to Arrow community requests. It seems like you might be saying a > couple different things one of two things (or both?)? > > 1) The Arrow infrastructure requirements are vastly different than other > projects. Because of Arrow's specialized requirements, we need things that > no other project needs. > 2) There are many projects that want CircleCI, Buildkite and Azure > pipelines but Infrastructure is not responsive. This is putting a big > damper on the success of the Arrow project. Yes, I'm saying both of these things. 1. Yes, Arrow is special -- validating the project requires running a dozen or more different builds (with dozens more nightly builds) that test different parts of the project. Different language components, a large and diverse packaging matrix, and interproject integration tests and integration with external projects (e.g. Apache Spark adn others) 2. Yes, the limited GitHub App availability is hurting us. I'm OK to place this concern in the "Community Health" section and spend more time building a comprehensive case about how Infra's conservatism around Apps is causing us to work with one hand tied behind our back. I know that I'm not the only one who is unhappy, but I'll let the others speak for themselves. > For each of these, if we're asking the board to do something, we should say > more and more clearly. Sure, CI is a pain in the Arrow project's a**. I > also agree that community health is impacted by the challenge to merge > things. I also share the perspective that the foundation has been slow to > adopt new technologies and has been way to religious about svn. However, If > we're asking the board to do something, what is it? Allow GitHub Apps that do not require write access to the code itself, set up appropriate checks and balances to ensure that the Foundation's IP provenance webhooks are preserved. > Looking at the two things you might be saying... > If 1, are we confident in that? Many other projects have pretty complex > build matrices I think. (I haven't thought about this and evaluated the > other projects...maybe it is true.) If 1, we should clarify why we think > we're different. If that is the case, what are asking for from the board. > > If 2, and you are proposing throwing stones at INFRA, we should back it up > with INFRA tickets and numbers (e.g. how many projects have wanted these > things and for how long). We should reference multiple threads on the INFRA > mailing list where we voiced certain concerns and many other people voiced > similar concerns and INFRA turned a deaf ear or blind eye (maybe these > exist, I haven't spent much time on the INFRA list lately). As it stands, > the one ticket referenced in this thread is a ticket that has only one > project asking for a new integration that has been open for less than a > week. That may be annoying but it doesn't seem like something that has > gotten to the level that we need to get the boards help. > > In a nutshell, I agree that this is impacting the health and growth of the > project but think we should cover that in the community health section of > the report. I'm less a fan of saying this is an issue the board needs to > help us solve unless it has been a constant point of pain that we've > attempted to elevate multiple times in infra forums and experienced > unreasonable responses. The board is a blunt instrument and should only be > used when we have depleted every other avenue for resolution. > Yes, I'm happy to spend more time building a comprehensive case before escalating it to the board level. However, Apache Arrow is a high profile project and it is not a good luck to have a PMC in a fast-growing project growing disgruntled with the Foundation's policies in this way. We've been struggling visibly for a long time with our CI scalability, and I think we should have all the options on the table to utilize GitHub-integrated tools to help us find a way out of the mess that we are in. > > On Wed, Oct 9, 2019 at 9:44 PM Wes McKinney wrote: > > > hi Jacques, > > > > I think we need to share the concerns that many PMC members have over > > the constraints that INFRA is placing on us. Can we rephrase the > > concern in a way that is more helpful? > > > > Firstly, I respect and appreciate the ASF's desire to limit write > > access to committers only from an IP provenance perspective. I > > understand that GitHub webhooks are used to log actions taken in > > repositories to secure IP provenance. I do not think a third party > > application should be given the ability to commit or modify a > > repository -- all write operations on the .git repository should be > > initiated by committers. > > > > However, GitHub is the main platform for producing open source > > software, and tools a
Re: [DRAFT] Apache Arrow Board Report - October 2019
I'm not dismissing the there are issues but I also don't feel like there has been constant discussion for months on the list that INFRA is not being responsive to Arrow community requests. It seems like you might be saying a couple different things one of two things (or both?)? 1) The Arrow infrastructure requirements are vastly different than other projects. Because of Arrow's specialized requirements, we need things that no other project needs. 2) There are many projects that want CircleCI, Buildkite and Azure pipelines but Infrastructure is not responsive. This is putting a big damper on the success of the Arrow project. For each of these, if we're asking the board to do something, we should say more and more clearly. Sure, CI is a pain in the Arrow project's a**. I also agree that community health is impacted by the challenge to merge things. I also share the perspective that the foundation has been slow to adopt new technologies and has been way to religious about svn. However, If we're asking the board to do something, what is it? Looking at the two things you might be saying... If 1, are we confident in that? Many other projects have pretty complex build matrices I think. (I haven't thought about this and evaluated the other projects...maybe it is true.) If 1, we should clarify why we think we're different. If that is the case, what are asking for from the board. If 2, and you are proposing throwing stones at INFRA, we should back it up with INFRA tickets and numbers (e.g. how many projects have wanted these things and for how long). We should reference multiple threads on the INFRA mailing list where we voiced certain concerns and many other people voiced similar concerns and INFRA turned a deaf ear or blind eye (maybe these exist, I haven't spent much time on the INFRA list lately). As it stands, the one ticket referenced in this thread is a ticket that has only one project asking for a new integration that has been open for less than a week. That may be annoying but it doesn't seem like something that has gotten to the level that we need to get the boards help. In a nutshell, I agree that this is impacting the health and growth of the project but think we should cover that in the community health section of the report. I'm less a fan of saying this is an issue the board needs to help us solve unless it has been a constant point of pain that we've attempted to elevate multiple times in infra forums and experienced unreasonable responses. The board is a blunt instrument and should only be used when we have depleted every other avenue for resolution. On Wed, Oct 9, 2019 at 9:44 PM Wes McKinney wrote: > hi Jacques, > > I think we need to share the concerns that many PMC members have over > the constraints that INFRA is placing on us. Can we rephrase the > concern in a way that is more helpful? > > Firstly, I respect and appreciate the ASF's desire to limit write > access to committers only from an IP provenance perspective. I > understand that GitHub webhooks are used to log actions taken in > repositories to secure IP provenance. I do not think a third party > application should be given the ability to commit or modify a > repository -- all write operations on the .git repository should be > initiated by committers. > > However, GitHub is the main platform for producing open source > software, and tools are being created to help produce open source more > efficiently. It is frustrating for us to not be able to take advantage > of the tools that are available to everyone else on GitHub. I brought > up the recent request about Buildkite as being representative of this > (after learning that Google has been making a lot of use of it), but > we have previously been denied use of CircleCI and Azure Pipelines > since those services require even more permissions (AFAIK) than in the > case of Buildkite. From our use in > https://github.com/ursa-labs/crossbow CircleCI and Azure seem to be a > lot better than Travis CI and Appveyor > > I think the ASF is going to face an existential crisis in the near > future whether it wants to live in 2020 or 2000. It feels like GitHub > is treated somewhat as ersatz SVN "because people want to use git + > GitHub instead of SVN" > > In the same way that the cloud revolutionized software startups, > enabling small groups of developers to build large SaaS applications, > the same kind of leverage is becoming available to open source > developers to set up infrastructure to automate and scale open source > projects. I think projects considering joining the Foundation are > going to look at these issues around App usage and decide that they > would rather be in control of their own infrastructure. > > I can set aside even more time and money from my non-profit > organization's modest budget to do CI work for Apache Arrow. The > amount that we have invested already is very large, and continues to > grow. I'm raising these issues because as Member of the Foundation I'
Re: [DISCUSS] Proposal about integration test of arrow parquet reader
It would be fine in that case. Wes McKinney 于 2019年10月10日周四 下午12:58写道: > On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu > wrote: > > > > 1. There already exists a low level parquet writer which can produce > > parquet file, so unit test should be fine. But writer from arrow to > parquet > > doesn't exist yet, and it may take some period of time to finish it. > > 2. In fact my data are randomly generated and it's definitely > reproducible. > > However, I don't think it would be good idea to randomly generate data > > everytime we run ci because it would be difficult to debug. For example > PR > > a introduced a bug, which is triggerred in other PR's build it would be > > confusing for contributors. > > Presumably any random data generation would use a fixed seed precisely > to be reproducible. > > > 3. I think it would be good idea to spend effort on integration test with > > parquet because it's an important use case of arrow. Also similar > approach > > could be extended to other language and other file format(avro, orc). > > > > > > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney > wrote: > > > > > There are a number of issues worth discussion. > > > > > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_? > > > It's OK to be reliant on other libraries in the short term to produce > > > files to test against, but does not strike me as a sustainable > > > long-term plan. Fixing bugs can be a lot more difficult than it needs > > > to be if you can't write targeted "endogenous" unit tests > > > > > > 2. Reproducible data generation > > > > > > I think if you're going to test against a pre-generated corpus, you > > > should make sure that generating the corpus is reproducible for other > > > developers (i.e. with a Dockerfile), and can be extended by adding new > > > files or random data generation. > > > > > > I additionally would prefer generating the test corpus at test time > > > rather than checking in binary files. If this isn't viable right now > > > we can create an "arrow-rust-crutch" git repository for you to stash > > > binary files until some of these testing scalability issues are > > > addressed. > > > > > > If we're going to spend energy on Parquet integration testing with > > > Java, this would be a good opportunity to do the work in a way where > > > the C++ Parquet library can also participate (since we ought to be > > > doing integration tests with Java, and we can also read JSON files to > > > Arrow). > > > > > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu > > > wrote: > > > > > > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove > > > wrote: > > > > > > > > > I'm very interested in helping to find a solution to this because > we > > > really > > > > > do need integration tests for Rust to make sure we're compatible > with > > > other > > > > > implementations... there is also the ongoing CI dockerization work > > > that I > > > > > feel is related. > > > > > > > > > > I haven't looked at the current integration tests yet and would > > > appreciate > > > > > some pointers on how all of this works (do we have docs?) or where > to > > > start > > > > > looking. > > > > > > > > > I have a test in my latest PR: > https://github.com/apache/arrow/pull/5523 > > > > And here is the generated data: > > > > https://github.com/apache/arrow-testing/pull/11 > > > > As with program to generate these data, it's just a simple java > program. > > > > I'm not sure whether we need to integrate it into arrow. > > > > > > > > > > > > > > I imagine the integration test could follow the approach that > Renjie is > > > > > outlining where we call Java to generate some files and then call > Rust > > > to > > > > > parse them? > > > > > > > > > > Thanks, > > > > > > > > > > Andy. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu > > > > wrote: > > > > > > > > > > > Hi: > > > > > > > > > > > > I'm developing rust version of reader which reads parquet into > arrow > > > > > array. > > > > > > To verify the correct of this reader, I use the following > approach: > > > > > > > > > > > > > > > > > >1. Define schema with protobuf. > > > > > >2. Generate json data of this schema using other language with > > > more > > > > > >sophisticated implementation (e.g. java) > > > > > >3. Generate parquet data of this schema using other language > with > > > more > > > > > >sophisticated implementation (e.g. java) > > > > > >4. Write tests to read json file, and parquet file into memory > > > (arrow > > > > > >array), then compare json data with arrow data. > > > > > > > > > > > > I think with this method we can guarantee the correctness of > arrow > > > > > reader > > > > > > because json format is ubiquitous and their implementation are > more > > > > > stable. > > > > > > > > > > > > Any comment is appreciated. > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Renjie Liu > > > > Software Engineer, MVAD >
Re: [DISCUSS] Proposal about integration test of arrow parquet reader
On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu wrote: > > 1. There already exists a low level parquet writer which can produce > parquet file, so unit test should be fine. But writer from arrow to parquet > doesn't exist yet, and it may take some period of time to finish it. > 2. In fact my data are randomly generated and it's definitely reproducible. > However, I don't think it would be good idea to randomly generate data > everytime we run ci because it would be difficult to debug. For example PR > a introduced a bug, which is triggerred in other PR's build it would be > confusing for contributors. Presumably any random data generation would use a fixed seed precisely to be reproducible. > 3. I think it would be good idea to spend effort on integration test with > parquet because it's an important use case of arrow. Also similar approach > could be extended to other language and other file format(avro, orc). > > > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney wrote: > > > There are a number of issues worth discussion. > > > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_? > > It's OK to be reliant on other libraries in the short term to produce > > files to test against, but does not strike me as a sustainable > > long-term plan. Fixing bugs can be a lot more difficult than it needs > > to be if you can't write targeted "endogenous" unit tests > > > > 2. Reproducible data generation > > > > I think if you're going to test against a pre-generated corpus, you > > should make sure that generating the corpus is reproducible for other > > developers (i.e. with a Dockerfile), and can be extended by adding new > > files or random data generation. > > > > I additionally would prefer generating the test corpus at test time > > rather than checking in binary files. If this isn't viable right now > > we can create an "arrow-rust-crutch" git repository for you to stash > > binary files until some of these testing scalability issues are > > addressed. > > > > If we're going to spend energy on Parquet integration testing with > > Java, this would be a good opportunity to do the work in a way where > > the C++ Parquet library can also participate (since we ought to be > > doing integration tests with Java, and we can also read JSON files to > > Arrow). > > > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu > > wrote: > > > > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove > > wrote: > > > > > > > I'm very interested in helping to find a solution to this because we > > really > > > > do need integration tests for Rust to make sure we're compatible with > > other > > > > implementations... there is also the ongoing CI dockerization work > > that I > > > > feel is related. > > > > > > > > I haven't looked at the current integration tests yet and would > > appreciate > > > > some pointers on how all of this works (do we have docs?) or where to > > start > > > > looking. > > > > > > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523 > > > And here is the generated data: > > > https://github.com/apache/arrow-testing/pull/11 > > > As with program to generate these data, it's just a simple java program. > > > I'm not sure whether we need to integrate it into arrow. > > > > > > > > > > > I imagine the integration test could follow the approach that Renjie is > > > > outlining where we call Java to generate some files and then call Rust > > to > > > > parse them? > > > > > > > > Thanks, > > > > > > > > Andy. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu > > wrote: > > > > > > > > > Hi: > > > > > > > > > > I'm developing rust version of reader which reads parquet into arrow > > > > array. > > > > > To verify the correct of this reader, I use the following approach: > > > > > > > > > > > > > > >1. Define schema with protobuf. > > > > >2. Generate json data of this schema using other language with > > more > > > > >sophisticated implementation (e.g. java) > > > > >3. Generate parquet data of this schema using other language with > > more > > > > >sophisticated implementation (e.g. java) > > > > >4. Write tests to read json file, and parquet file into memory > > (arrow > > > > >array), then compare json data with arrow data. > > > > > > > > > > I think with this method we can guarantee the correctness of arrow > > > > reader > > > > > because json format is ubiquitous and their implementation are more > > > > stable. > > > > > > > > > > Any comment is appreciated. > > > > > > > > > > > > > > > > > > -- > > > Renjie Liu > > > Software Engineer, MVAD > > > > > -- > Renjie Liu > Software Engineer, MVAD
Re: Looking ahead to 1.0
Hi John, Since the 1.0.0 release is focused on Format stability, probably the only real "blockers" will be ensuring that we have hardened multiple implementations (in particular C++ and Java) of the columnar format as specified with integration tests to prove it. The issues you listed sound more like C++ library changes to me? If you want to propose Format-related changes, that would need to happen right away otherwise the ship will sail on that. - Wes On Wed, Oct 9, 2019 at 9:08 PM John Muehlhausen wrote: > > ARROW-5916 > ARROW-6836/6837 > > These are of particular interest to me because they enable recordbatch > "incrementalism" which is useful for streaming applications: > > ARROW-5916 allows a recordbatch to pre-allocate space for future records > that have not yet been populated, making it safe for readers to consume the > partial batch. > > ARROW-6836/6837 allows a file of record batches to be extended at the end, > without re-writing the beginning, while including the idea that the > custom_metadata may change with each update. (custom_metadata in the > Schema is not a good candidate because Schema also appears at the beginning > of the file.) > > While these are not blockers for me quite yet, they soon will be! If I > wanted to ensure that these are in 1.0, what is my deadline for > implementation and test cases? Can such a note be made on the wiki? > Should I change the priority in Jira? > > Thanks, > John > > On Wed, Oct 9, 2019 at 2:57 PM Neal Richardson > wrote: > > > Congratulations everyone on 0.15! I know a lot of hard work went into > > it, not only in the software itself but also in the build and release > > process. > > > > Once you've caught your breath from the release, we should start > > thinking about what's in scope for our next release, the big 1.0. To > > get us started (or restarted, since we did discuss 1.0 before the > > flatbuffer alignment issue came up), I've created > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release > > based on our past release wiki pages. > > > > A good place to begin would be to list, either in "blocker" Jiras or > > bullet points on the document, the key features and tasks we must > > resolve before 1.0. For example, I get the sense that we need to > > overhaul the documentation, but that should be expressed in a more > > concrete, actionable way. > > > > Neal > >
Re: [DRAFT] Apache Arrow Board Report - October 2019
hi Jacques, I think we need to share the concerns that many PMC members have over the constraints that INFRA is placing on us. Can we rephrase the concern in a way that is more helpful? Firstly, I respect and appreciate the ASF's desire to limit write access to committers only from an IP provenance perspective. I understand that GitHub webhooks are used to log actions taken in repositories to secure IP provenance. I do not think a third party application should be given the ability to commit or modify a repository -- all write operations on the .git repository should be initiated by committers. However, GitHub is the main platform for producing open source software, and tools are being created to help produce open source more efficiently. It is frustrating for us to not be able to take advantage of the tools that are available to everyone else on GitHub. I brought up the recent request about Buildkite as being representative of this (after learning that Google has been making a lot of use of it), but we have previously been denied use of CircleCI and Azure Pipelines since those services require even more permissions (AFAIK) than in the case of Buildkite. From our use in https://github.com/ursa-labs/crossbow CircleCI and Azure seem to be a lot better than Travis CI and Appveyor I think the ASF is going to face an existential crisis in the near future whether it wants to live in 2020 or 2000. It feels like GitHub is treated somewhat as ersatz SVN "because people want to use git + GitHub instead of SVN" In the same way that the cloud revolutionized software startups, enabling small groups of developers to build large SaaS applications, the same kind of leverage is becoming available to open source developers to set up infrastructure to automate and scale open source projects. I think projects considering joining the Foundation are going to look at these issues around App usage and decide that they would rather be in control of their own infrastructure. I can set aside even more time and money from my non-profit organization's modest budget to do CI work for Apache Arrow. The amount that we have invested already is very large, and continues to grow. I'm raising these issues because as Member of the Foundation I'm concerned that fast-growing projects like ours are not being adequately served by INFRA, and we probably aren't the only project that will face these issues. All that is needed is for INFRA to let us use third party GitHub Apps and monitor any potentially destructive actions that they may take, such as modifying unrelated repository webhooks related to IP provenance. - Wes On Wed, Oct 9, 2019 at 9:33 PM Jacques Nadeau wrote: > > I think we need to more direct in listing issues for the board. > > What have we done? What do we want them to do? > > In general, any large org is going to be slow to add new deep integrations > into GitHub. I don't think we should expect Apache to be any different (it > took several years before we could merge things through github for > example). If I were on the INFRA side, I think I would look and see how > many different people are asking for BuildKite before considering > integration. It seems like we only opened the JIRA 6 days ago and no other > projects have requested access to this? > > I'm not clear why this is a board issue. What do we think the board can do > for us that we can't solve ourselves and need them to solve? Remember, a > board solution to a problem is typically very removed from what matters to > individuals on a project. > > > > > > > On Tue, Oct 8, 2019 at 7:03 AM Wes McKinney wrote: > > > New draft > > > > ## Description: > > The mission of Apache Arrow is the creation and maintenance of software > > related > > to columnar in-memory processing and data interchange > > > > ## Issues: > > > > * We are struggling with Continuous Integration scalability as the project > > has > > definitely outgrown what Travis CI and Appveyor can do for us. Some > > contributors have shown reluctance to submit patches they aren't sure > > about > > because they don't want to pile on the build queue. We are exploring > > alternative solutions such as Buildbot, Buildkite, and GitHub Actions to > > provide a path to migrate away from Travis CI / Appveyor. In our request > > to > > Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD > > service like Buildkite may not be able to be connected to the @apache > > GitHub > > account on account of requiring admin access to repository webhooks, but > > no > > ability to modify source code. There are workarounds (building custom > > OAuth > > bots) that could enable us to use Buildkite, but it would require extra > > development and result in a less refined experience for community > > members. > > > > ## Membership Data: > > * Apache Arrow was founded 2016-01-19 (4 years ago) > > * There are currently 48 committers and 28 PMC members in this project. > > * The Committer-to-P
[jira] [Created] (ARROW-6844) List columns read broken with 0.15.0
Benoit Rostykus created ARROW-6844: -- Summary: List columns read broken with 0.15.0 Key: ARROW-6844 URL: https://issues.apache.org/jira/browse/ARROW-6844 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.0 Reporter: Benoit Rostykus Columns of type `array` (such as `array`, `array`...) are not readable anymore using `pyarrow == 0.15.0` (but were with `pyarrow == 0.14.1`) when the original writer of the parquet file is `parquet-mr 1.9.1`. ``` import pyarrow.parquet as pq pf = pq.ParquetFile('sample.gz.parquet') print(pf.read(columns=['profile_ids'])) ``` with 0.14.1: ``` pyarrow.Table profile_ids: list child 0, element: int64 ... ``` with 0.15.0: ``` Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 253, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1131, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Column data for field 0 with type list is inconsistent with schema list ``` I've tested parquet files coming from multiple tables (with various schemas) created with `parquet-mr`, couldn't read any `array` column anymore. I _think_ the bug was introduced with [this commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]]. I think the root of the issue comes from the fact that `parquet-mr` writes the inner struct name as `"element"` by default (see [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]), whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example [this test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]). The round-tripping tests write/read in pyarrow only obviously won't catch this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Proposal about integration test of arrow parquet reader
1. There already exists a low level parquet writer which can produce parquet file, so unit test should be fine. But writer from arrow to parquet doesn't exist yet, and it may take some period of time to finish it. 2. In fact my data are randomly generated and it's definitely reproducible. However, I don't think it would be good idea to randomly generate data everytime we run ci because it would be difficult to debug. For example PR a introduced a bug, which is triggerred in other PR's build it would be confusing for contributors. 3. I think it would be good idea to spend effort on integration test with parquet because it's an important use case of arrow. Also similar approach could be extended to other language and other file format(avro, orc). On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney wrote: > There are a number of issues worth discussion. > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_? > It's OK to be reliant on other libraries in the short term to produce > files to test against, but does not strike me as a sustainable > long-term plan. Fixing bugs can be a lot more difficult than it needs > to be if you can't write targeted "endogenous" unit tests > > 2. Reproducible data generation > > I think if you're going to test against a pre-generated corpus, you > should make sure that generating the corpus is reproducible for other > developers (i.e. with a Dockerfile), and can be extended by adding new > files or random data generation. > > I additionally would prefer generating the test corpus at test time > rather than checking in binary files. If this isn't viable right now > we can create an "arrow-rust-crutch" git repository for you to stash > binary files until some of these testing scalability issues are > addressed. > > If we're going to spend energy on Parquet integration testing with > Java, this would be a good opportunity to do the work in a way where > the C++ Parquet library can also participate (since we ought to be > doing integration tests with Java, and we can also read JSON files to > Arrow). > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu > wrote: > > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove > wrote: > > > > > I'm very interested in helping to find a solution to this because we > really > > > do need integration tests for Rust to make sure we're compatible with > other > > > implementations... there is also the ongoing CI dockerization work > that I > > > feel is related. > > > > > > I haven't looked at the current integration tests yet and would > appreciate > > > some pointers on how all of this works (do we have docs?) or where to > start > > > looking. > > > > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523 > > And here is the generated data: > > https://github.com/apache/arrow-testing/pull/11 > > As with program to generate these data, it's just a simple java program. > > I'm not sure whether we need to integrate it into arrow. > > > > > > > > I imagine the integration test could follow the approach that Renjie is > > > outlining where we call Java to generate some files and then call Rust > to > > > parse them? > > > > > > Thanks, > > > > > > Andy. > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu > wrote: > > > > > > > Hi: > > > > > > > > I'm developing rust version of reader which reads parquet into arrow > > > array. > > > > To verify the correct of this reader, I use the following approach: > > > > > > > > > > > >1. Define schema with protobuf. > > > >2. Generate json data of this schema using other language with > more > > > >sophisticated implementation (e.g. java) > > > >3. Generate parquet data of this schema using other language with > more > > > >sophisticated implementation (e.g. java) > > > >4. Write tests to read json file, and parquet file into memory > (arrow > > > >array), then compare json data with arrow data. > > > > > > > > I think with this method we can guarantee the correctness of arrow > > > reader > > > > because json format is ubiquitous and their implementation are more > > > stable. > > > > > > > > Any comment is appreciated. > > > > > > > > > > > > > -- > > Renjie Liu > > Software Engineer, MVAD > -- Renjie Liu Software Engineer, MVAD
Re: [DRAFT] Apache Arrow Board Report - October 2019
I think we need to more direct in listing issues for the board. What have we done? What do we want them to do? In general, any large org is going to be slow to add new deep integrations into GitHub. I don't think we should expect Apache to be any different (it took several years before we could merge things through github for example). If I were on the INFRA side, I think I would look and see how many different people are asking for BuildKite before considering integration. It seems like we only opened the JIRA 6 days ago and no other projects have requested access to this? I'm not clear why this is a board issue. What do we think the board can do for us that we can't solve ourselves and need them to solve? Remember, a board solution to a problem is typically very removed from what matters to individuals on a project. On Tue, Oct 8, 2019 at 7:03 AM Wes McKinney wrote: > New draft > > ## Description: > The mission of Apache Arrow is the creation and maintenance of software > related > to columnar in-memory processing and data interchange > > ## Issues: > > * We are struggling with Continuous Integration scalability as the project > has > definitely outgrown what Travis CI and Appveyor can do for us. Some > contributors have shown reluctance to submit patches they aren't sure > about > because they don't want to pile on the build queue. We are exploring > alternative solutions such as Buildbot, Buildkite, and GitHub Actions to > provide a path to migrate away from Travis CI / Appveyor. In our request > to > Infrastructure INFRA-19217, some of us were alarmed to find that an CI/CD > service like Buildkite may not be able to be connected to the @apache > GitHub > account on account of requiring admin access to repository webhooks, but > no > ability to modify source code. There are workarounds (building custom > OAuth > bots) that could enable us to use Buildkite, but it would require extra > development and result in a less refined experience for community > members. > > ## Membership Data: > * Apache Arrow was founded 2016-01-19 (4 years ago) > * There are currently 48 committers and 28 PMC members in this project. > * The Committer-to-PMC ratio is roughly 3:2. > > Community changes, past quarter: > - Micah Kornfield was added to the PMC on 2019-08-21 > - Sebastien Binet was added to the PMC on 2019-08-21 > - Ben Kietzman was added as committer on 2019-09-07 > - David Li was added as committer on 2019-08-30 > - Kenta Murata was added as committer on 2019-09-05 > - Neal Richardson was added as committer on 2019-09-05 > - Praveen Kumar was added as committer on 2019-07-14 > > ## Project Activity: > > * The project has just made a 0.15.0 release. > * We are discussing ways to make the Arrow libraries as accessible as > possible > to downstream projects for minimal use cases while allowing the > development > of more comprehensive "standard libraries" with larger dependency stacks > in > the project > * We plan to make a 1.0.0 release as our next major release, at which time > we > will declare that the Arrow binary protocol is stable with forward and > backward compatibility guarantees > > ## Community Health: > > * The community is overall healthy, with the aforementioned concerns > around CI > scalability. New contributors frequently take notice of the long build > queue > times when submitting pull requests. > > On Tue, Oct 8, 2019 at 8:58 AM Wes McKinney wrote: > > > > Yes, I agree with raising the issue to the board. > > > > On Tue, Oct 8, 2019 at 8:31 AM Antoine Pitrou > wrote: > > > > > > > > > I agree. Especially given that the constraints imposed by Infra don't > > > help solving the problem. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 08/10/2019 à 15:02, Uwe L. Korn a écrit : > > > > I'm not sure what qualifies for "board attention" but it seems that > CI is a critical problem in Apache projects, not just Arrow. Should we > raise that? > > > > > > > > Uwe > > > > > > > > On Tue, Oct 8, 2019, at 12:00 AM, Wes McKinney wrote: > > > >> Here is a start for our Q3 board report > > > >> > > > >> ## Description: > > > >> The mission of Apache Arrow is the creation and maintenance of > software related > > > >> to columnar in-memory processing and data interchange > > > >> > > > >> ## Issues: > > > >> There are no issues requiring board attention at this time > > > >> > > > >> ## Membership Data: > > > >> * Apache Arrow was founded 2016-01-19 (4 years ago) > > > >> * There are currently 48 committers and 28 PMC members in this > project. > > > >> * The Committer-to-PMC ratio is roughly 3:2. > > > >> > > > >> Community changes, past quarter: > > > >> - Micah Kornfield was added to the PMC on 2019-08-21 > > > >> - Sebastien Binet was added to the PMC on 2019-08-21 > > > >> - Ben Kietzman was added as committer on 2019-09-07 > > > >> - David Li was added as committer on 2019-08-30 > > > >> - Kenta Murata was added as committer on 2019-09-05 > >
Re: Looking ahead to 1.0
ARROW-5916 ARROW-6836/6837 These are of particular interest to me because they enable recordbatch "incrementalism" which is useful for streaming applications: ARROW-5916 allows a recordbatch to pre-allocate space for future records that have not yet been populated, making it safe for readers to consume the partial batch. ARROW-6836/6837 allows a file of record batches to be extended at the end, without re-writing the beginning, while including the idea that the custom_metadata may change with each update. (custom_metadata in the Schema is not a good candidate because Schema also appears at the beginning of the file.) While these are not blockers for me quite yet, they soon will be! If I wanted to ensure that these are in 1.0, what is my deadline for implementation and test cases? Can such a note be made on the wiki? Should I change the priority in Jira? Thanks, John On Wed, Oct 9, 2019 at 2:57 PM Neal Richardson wrote: > Congratulations everyone on 0.15! I know a lot of hard work went into > it, not only in the software itself but also in the build and release > process. > > Once you've caught your breath from the release, we should start > thinking about what's in scope for our next release, the big 1.0. To > get us started (or restarted, since we did discuss 1.0 before the > flatbuffer alignment issue came up), I've created > https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release > based on our past release wiki pages. > > A good place to begin would be to list, either in "blocker" Jiras or > bullet points on the document, the key features and tasks we must > resolve before 1.0. For example, I get the sense that we need to > overhaul the documentation, but that should be expressed in a more > concrete, actionable way. > > Neal >
[jira] [Created] (ARROW-6843) [Website] Disable deploy on pull request
Kouhei Sutou created ARROW-6843: --- Summary: [Website] Disable deploy on pull request Key: ARROW-6843 URL: https://issues.apache.org/jira/browse/ARROW-6843 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Can't find myself in contributor list
Got it. 6408 was reverted. That makes sense. On Wed, Oct 9, 2019 at 3:19 PM Wes McKinney wrote: > I'm seeing > > $ git hist | grep Hengruo > * f9cd2958a 2019-10-09 | ARROW-6274: [Rust] [DataFusion] Add support > for writing results to CSV [Hengruo Zhang] > * 3145e9bef 2019-09-08 | ARROW-6408: [Rust] use "if cfg!" pattern > [Hengruo Zhang] > > So there's only 1 commit in the last 1 month. This doesn't appear to > be enough to be guaranteed to show up in the Pulse view (1 week or 1 > month views) > > On Wed, Oct 9, 2019 at 4:53 PM paddy horan wrote: > > > > It might also be due to our merge tool. PRs are merged locally and > pushed to master (with the corresponding PR on github being “closed” rather > than “merged”). This might not be reflected in the pulse view. > > > > P > > > > > > From: Wes McKinney > > Sent: Wednesday, October 9, 2019 4:06:59 PM > > To: dev > > Subject: Re: Can't find myself in contributor list > > > > GitHub only shows the top 100 contributors to the project in > > > > https://github.com/apache/arrow/graphs/contributors > > > > Similarly I think you need more commits to show up in the Pulse view > > > > On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang wrote: > > > > > > Hi, > > > > > > My two PRs have been already merged to the master branch, but I cannot > > > see me in the contributor list of GitHub, even if I narrowed down the > > > time span so that there are only less than 50 people. And I can't even > > > find my merging in https://github.com/apache/arrow/pulse . > > > > > > Could you please provide some possible reasons for this? > > > > > > PRs: > > > https://github.com/apache/arrow/pull/5577 > > > https://github.com/apache/arrow/pull/5303 > > > > > > Thanks, > > > Hengruo >
[jira] [Created] (ARROW-6842) [Website] Jekyll error building website
Wes McKinney created ARROW-6842: --- Summary: [Website] Jekyll error building website Key: ARROW-6842 URL: https://issues.apache.org/jira/browse/ARROW-6842 Project: Apache Arrow Issue Type: Bug Components: Website Reporter: Wes McKinney Fix For: 1.0.0 I'm getting the following error locally on a fresh checkout and {{bundle install --path vendor/bundle}} {code} $ bundle exec jekyll serve Configuration file: /home/wesm/code/arrow-site/_config.yml Source: /home/wesm/code/arrow-site Destination: build Incremental build: disabled. Enable with --incremental Generating... jekyll 3.8.4 | Error: wrong number of arguments (given 2, expected 1) {code} Never seen this so not sure how to debug -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Can't find myself in contributor list
I'm seeing $ git hist | grep Hengruo * f9cd2958a 2019-10-09 | ARROW-6274: [Rust] [DataFusion] Add support for writing results to CSV [Hengruo Zhang] * 3145e9bef 2019-09-08 | ARROW-6408: [Rust] use "if cfg!" pattern [Hengruo Zhang] So there's only 1 commit in the last 1 month. This doesn't appear to be enough to be guaranteed to show up in the Pulse view (1 week or 1 month views) On Wed, Oct 9, 2019 at 4:53 PM paddy horan wrote: > > It might also be due to our merge tool. PRs are merged locally and pushed to > master (with the corresponding PR on github being “closed” rather than > “merged”). This might not be reflected in the pulse view. > > P > > > From: Wes McKinney > Sent: Wednesday, October 9, 2019 4:06:59 PM > To: dev > Subject: Re: Can't find myself in contributor list > > GitHub only shows the top 100 contributors to the project in > > https://github.com/apache/arrow/graphs/contributors > > Similarly I think you need more commits to show up in the Pulse view > > On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang wrote: > > > > Hi, > > > > My two PRs have been already merged to the master branch, but I cannot > > see me in the contributor list of GitHub, even if I narrowed down the > > time span so that there are only less than 50 people. And I can't even > > find my merging in https://github.com/apache/arrow/pulse . > > > > Could you please provide some possible reasons for this? > > > > PRs: > > https://github.com/apache/arrow/pull/5577 > > https://github.com/apache/arrow/pull/5303 > > > > Thanks, > > Hengruo
Re: Can't find myself in contributor list
It might also be due to our merge tool. PRs are merged locally and pushed to master (with the corresponding PR on github being “closed” rather than “merged”). This might not be reflected in the pulse view. P From: Wes McKinney Sent: Wednesday, October 9, 2019 4:06:59 PM To: dev Subject: Re: Can't find myself in contributor list GitHub only shows the top 100 contributors to the project in https://github.com/apache/arrow/graphs/contributors Similarly I think you need more commits to show up in the Pulse view On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang wrote: > > Hi, > > My two PRs have been already merged to the master branch, but I cannot > see me in the contributor list of GitHub, even if I narrowed down the > time span so that there are only less than 50 people. And I can't even > find my merging in https://github.com/apache/arrow/pulse . > > Could you please provide some possible reasons for this? > > PRs: > https://github.com/apache/arrow/pull/5577 > https://github.com/apache/arrow/pull/5303 > > Thanks, > Hengruo
[jira] [Created] (ARROW-6841) [C++] Upgrade to LLVM 8
Wes McKinney created ARROW-6841: --- Summary: [C++] Upgrade to LLVM 8 Key: ARROW-6841 URL: https://issues.apache.org/jira/browse/ARROW-6841 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Now that LLVM 9 has been released, LLVM 8 has been promoted to stable according to http://apt.llvm.org/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd
John Muehlhausen created ARROW-6840: --- Summary: [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd Key: ARROW-6840 URL: https://issues.apache.org/jira/browse/ARROW-6840 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: John Muehlhausen We want to retrieve the file descriptor of a memory mapped file for the purpose of transferring it across process boundaries. On the receiving end, we want to be able to map a file based on the file descriptor rather than the path. This helps with race conditions when the path may have been unlinked. cf [https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata
John Muehlhausen created ARROW-6839: --- Summary: [Java] access File Footer custom_metadata Key: ARROW-6839 URL: https://issues.apache.org/jira/browse/ARROW-6839 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata
John Muehlhausen created ARROW-6838: --- Summary: [JS] access File Footer custom_metadata Key: ARROW-6838 URL: https://issues.apache.org/jira/browse/ARROW-6838 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata
John Muehlhausen created ARROW-6837: --- Summary: [C++/Python] access File Footer custom_metadata Key: ARROW-6837 URL: https://issues.apache.org/jira/browse/ARROW-6837 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs
John Muehlhausen created ARROW-6836: --- Summary: [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Key: ARROW-6836 URL: https://issues.apache.org/jira/browse/ARROW-6836 Project: Apache Arrow Issue Type: New Feature Components: Format Reporter: John Muehlhausen add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Use case: If a file is expanded with additional recordbatches and the custom_metadata changes, Schema is no longer an appropriate place to make this change since the two copies of Schema (at the beginning and end of the file) would then be ambiguous cf https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY
Francois Saint-Jacques created ARROW-6835: - Summary: [Archery][CMake] Restore ARROW_LINT_ONLY Key: ARROW-6835 URL: https://issues.apache.org/jira/browse/ARROW-6835 Project: Apache Arrow Issue Type: Bug Components: Archery Reporter: Francois Saint-Jacques This is used by developers to fasten the cmake build creation and loosen the required installed toolchains (notably libraries). This was yanked because ARROW_LINT_ONLY effectively exit-early and doesn't generate `compile_commands.json`. Restore this option, but ensure that archery toggles accordingly to the usage of iwyu or clang-tidy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Can't find myself in contributor list
GitHub only shows the top 100 contributors to the project in https://github.com/apache/arrow/graphs/contributors Similarly I think you need more commits to show up in the Pulse view On Wed, Oct 9, 2019 at 2:58 PM Hengruo Zhang wrote: > > Hi, > > My two PRs have been already merged to the master branch, but I cannot > see me in the contributor list of GitHub, even if I narrowed down the > time span so that there are only less than 50 people. And I can't even > find my merging in https://github.com/apache/arrow/pulse . > > Could you please provide some possible reasons for this? > > PRs: > https://github.com/apache/arrow/pull/5577 > https://github.com/apache/arrow/pull/5303 > > Thanks, > Hengruo
Can't find myself in contributor list
Hi, My two PRs have been already merged to the master branch, but I cannot see me in the contributor list of GitHub, even if I narrowed down the time span so that there are only less than 50 people. And I can't even find my merging in https://github.com/apache/arrow/pulse . Could you please provide some possible reasons for this? PRs: https://github.com/apache/arrow/pull/5577 https://github.com/apache/arrow/pull/5303 Thanks, Hengruo
[jira] [Created] (ARROW-6834) [C++] Appveyor build failing on master
Wes McKinney created ARROW-6834: --- Summary: [C++] Appveyor build failing on master Key: ARROW-6834 URL: https://issues.apache.org/jira/browse/ARROW-6834 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Not sure what introduced this https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl {code} LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj /out:release\arrow-public-api-test.exe /implib:release\arrow-public-api-test.lib /pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console release\arrow_testing.lib release\arrow.lib C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib googletest_ep-prefix\src\googletest_ep\lib\gtest.lib googletest_ep-prefix\src\googletest_ep\lib\gmock.lib C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib Kernel32.lib Ws2_32.lib C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 1120) with the following output: public_api_test.cc.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: static void __cdecl testing::Test::SetUpTestSuite(void)" (__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetSetUpCaseOrSuite(char const *,int))(void)" (?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z) public_api_test.cc.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: static void __cdecl testing::Test::TearDownTestSuite(void)" (__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function "public: static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetTearDownCaseOrSuite(char const *,int))(void)" (?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z) release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved externals [205/515] Building CXX object src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj [206/515] Building CXX object src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj ninja: build stopped: subcommand failed. (arrow) C:\projects\arrow\cpp\build>goto scriptexit {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
Question about timestamps ...
The following code dies with pyarrow 0.14.2: import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),]) writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns') ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns', tz='UTC')) table = pa.Table.from_arrays([ ts_array ], names=['timestamp']) writer.write_table(table) writer.close() with the message: ValueError: Invalid value for coerce_timestamps: ns That appears to be because of this code in _parquet.pxi: cdef int _set_coerce_timestamps( self, ArrowWriterProperties.Builder* props) except -1: if self.coerce_timestamps == 'ms': props.coerce_timestamps(TimeUnit_MILLI) elif self.coerce_timestamps == 'us': props.coerce_timestamps(TimeUnit_MICRO) elif self.coerce_timestamps is not None: raise ValueError('Invalid value for coerce_timestamps: {0}' .format(self.coerce_timestamps)) which restricts the choice to 'ms' or 'us', even though AFAICT everywhere else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this intentional, or a bug? Thanks, - db
Looking ahead to 1.0
Congratulations everyone on 0.15! I know a lot of hard work went into it, not only in the software itself but also in the build and release process. Once you've caught your breath from the release, we should start thinking about what's in scope for our next release, the big 1.0. To get us started (or restarted, since we did discuss 1.0 before the flatbuffer alignment issue came up), I've created https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release based on our past release wiki pages. A good place to begin would be to list, either in "blocker" Jiras or bullet points on the document, the key features and tasks we must resolve before 1.0. For example, I get the sense that we need to overhaul the documentation, but that should be expressed in a more concrete, actionable way. Neal
[jira] [Created] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build
Neal Richardson created ARROW-6833: -- Summary: [R][CI] Add crossbow job for full R autobrew macOS build Key: ARROW-6833 URL: https://issues.apache.org/jira/browse/ARROW-6833 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Neal Richardson Assignee: Neal Richardson I have a separate nightly job that runs this on multiple R versions, but it would be nice to be able to have crossbow check this on a PR. As it turns out, the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp doesn't seem to ship static libs via Homebrew, so the autobrew packaging doesn't work, even though the formula builds and {{brew audit}} is clean. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6832) [R] Implement Codec::IsAvailable
Neal Richardson created ARROW-6832: -- Summary: [R] Implement Codec::IsAvailable Key: ARROW-6832 URL: https://issues.apache.org/jira/browse/ARROW-6832 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Fix For: 1.0.0 New in ARROW-6631 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults
Neal Richardson created ARROW-6831: -- Summary: [R] Update R macOS/Windows builds for change in cmake compression defaults Key: ARROW-6831 URL: https://issues.apache.org/jira/browse/ARROW-6831 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson ARROW-6631 changed the defaults for including compressions but did not update these build scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0
FWIW there appears to have been a recent update to grpc on Homebrew involving protobuf: https://github.com/Homebrew/homebrew-core/commits/master/Formula/grpc.rb Last time we had a Homebrew grpc issue, I made this at Kou's suggestion: https://github.com/Homebrew/homebrew-core/pull/44198 I think it's fair to report an issue there and show some log output that we think is unexpected and see what they say. Maybe they can rebuild the bottles again and that will magically fix it, like last time. Neal On Wed, Oct 9, 2019 at 7:16 AM Wes McKinney wrote: > > It looks like protobuf and other gRPC dependencies are being built > from source when doing `brew install grpc`. This is probably an issue > with the Homebrew stack, do we know how to address this situation now > and in the future (probably requires asking the Homebrew community > about grpc "bottles")? > > On Wed, Oct 9, 2019 at 7:26 AM Crossbow wrote: > > > > > > Arrow Build Report for Job nightly-2019-10-09-0 > > > > All tasks: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0 > > > > Failed Tasks: > > - gandiva-jar-trusty: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty > > - docker-clang-format: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format > > - wheel-osx-cp36m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m > > - wheel-osx-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m > > - wheel-osx-cp37m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m > > - wheel-manylinux2010-cp35m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m > > - gandiva-jar-osx: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx > > - wheel-osx-cp27m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m > > > > Succeeded Tasks: > > - homebrew-cpp-autobrew: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew > > - wheel-manylinux1-cp27mu: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu > > - docker-hdfs-integration: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration > > - docker-lint: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint > > - docker-pandas-master: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master > > - docker-cpp-static-only: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only > > - wheel-manylinux2010-cp27mu: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu > > - docker-cpp-cmake32: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32 > > - centos-6: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6 > > - centos-7: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7 > > - docker-cpp-release: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release > > - docker-python-2.7: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7 > > - docker-r: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r > > - docker-spark-integration: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration > > - debian-stretch: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch > > - conda-osx-clang-py36: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36 > > - ubuntu-xenial: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial > > - ubuntu-disco: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco > > - wheel-manylinux1-cp27m: > > URL: > > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wh
Re: [DISCUSS] Proposal about integration test of arrow parquet reader
There are a number of issues worth discussion. 1. What is the timeline/plan for Rust implementing a Parquet _writer_? It's OK to be reliant on other libraries in the short term to produce files to test against, but does not strike me as a sustainable long-term plan. Fixing bugs can be a lot more difficult than it needs to be if you can't write targeted "endogenous" unit tests 2. Reproducible data generation I think if you're going to test against a pre-generated corpus, you should make sure that generating the corpus is reproducible for other developers (i.e. with a Dockerfile), and can be extended by adding new files or random data generation. I additionally would prefer generating the test corpus at test time rather than checking in binary files. If this isn't viable right now we can create an "arrow-rust-crutch" git repository for you to stash binary files until some of these testing scalability issues are addressed. If we're going to spend energy on Parquet integration testing with Java, this would be a good opportunity to do the work in a way where the C++ Parquet library can also participate (since we ought to be doing integration tests with Java, and we can also read JSON files to Arrow). On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu wrote: > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove wrote: > > > I'm very interested in helping to find a solution to this because we really > > do need integration tests for Rust to make sure we're compatible with other > > implementations... there is also the ongoing CI dockerization work that I > > feel is related. > > > > I haven't looked at the current integration tests yet and would appreciate > > some pointers on how all of this works (do we have docs?) or where to start > > looking. > > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523 > And here is the generated data: > https://github.com/apache/arrow-testing/pull/11 > As with program to generate these data, it's just a simple java program. > I'm not sure whether we need to integrate it into arrow. > > > > > I imagine the integration test could follow the approach that Renjie is > > outlining where we call Java to generate some files and then call Rust to > > parse them? > > > > Thanks, > > > > Andy. > > > > > > > > > > > > > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu wrote: > > > > > Hi: > > > > > > I'm developing rust version of reader which reads parquet into arrow > > array. > > > To verify the correct of this reader, I use the following approach: > > > > > > > > >1. Define schema with protobuf. > > >2. Generate json data of this schema using other language with more > > >sophisticated implementation (e.g. java) > > >3. Generate parquet data of this schema using other language with more > > >sophisticated implementation (e.g. java) > > >4. Write tests to read json file, and parquet file into memory (arrow > > >array), then compare json data with arrow data. > > > > > > I think with this method we can guarantee the correctness of arrow > > reader > > > because json format is ubiquitous and their implementation are more > > stable. > > > > > > Any comment is appreciated. > > > > > > > > -- > Renjie Liu > Software Engineer, MVAD
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0
It looks like protobuf and other gRPC dependencies are being built from source when doing `brew install grpc`. This is probably an issue with the Homebrew stack, do we know how to address this situation now and in the future (probably requires asking the Homebrew community about grpc "bottles")? On Wed, Oct 9, 2019 at 7:26 AM Crossbow wrote: > > > Arrow Build Report for Job nightly-2019-10-09-0 > > All tasks: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0 > > Failed Tasks: > - gandiva-jar-trusty: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty > - docker-clang-format: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format > - wheel-osx-cp36m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m > - wheel-osx-cp35m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m > - wheel-osx-cp37m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m > - wheel-manylinux2010-cp35m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m > - gandiva-jar-osx: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx > - wheel-osx-cp27m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m > > Succeeded Tasks: > - homebrew-cpp-autobrew: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew > - wheel-manylinux1-cp27mu: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu > - docker-hdfs-integration: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration > - docker-lint: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint > - docker-pandas-master: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master > - docker-cpp-static-only: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only > - wheel-manylinux2010-cp27mu: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu > - docker-cpp-cmake32: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32 > - centos-6: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6 > - centos-7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7 > - docker-cpp-release: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release > - docker-python-2.7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7 > - docker-r: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r > - docker-spark-integration: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration > - debian-stretch: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch > - conda-osx-clang-py36: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36 > - ubuntu-xenial: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial > - ubuntu-disco: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco > - wheel-manylinux1-cp27m: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27m > - docker-iwyu: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-iwyu > - docker-js: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-js > - docker-python-3.7: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-3.7 > - docker-go: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-go > - homebrew-cpp: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp > - conda-linux-gcc-py37: > URL: > https://github.com/ursa-labs/crossbow/branches/all?query=n
[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow
Anthony Abate created ARROW-6830: Summary: Question / Feature Request- Select Subset of Columns in read_arrow Key: ARROW-6830 URL: https://issues.apache.org/jira/browse/ARROW-6830 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Anthony Abate *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: data_rbfr <- arrow::RecordBatchFileReader("arrowfile") FOREACH BATCH: batch <- data_rbfr$get_batch(i) col4 <- batch$column(4) col5 <- batch$column(7) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6829) [Docs] Migrate integration test docs to Sphinx, fix instructions after ARROW-6466
Wes McKinney created ARROW-6829: --- Summary: [Docs] Migrate integration test docs to Sphinx, fix instructions after ARROW-6466 Key: ARROW-6829 URL: https://issues.apache.org/jira/browse/ARROW-6829 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Wes McKinney Fix For: 1.0.0 Follow up to ARROW-6466. Also, the readme uses out of date archery flags https://github.com/apache/arrow/blob/master/integration/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6827) [Archery] lint sub-command should provide a --fail-fast option
Francois Saint-Jacques created ARROW-6827: - Summary: [Archery] lint sub-command should provide a --fail-fast option Key: ARROW-6827 URL: https://issues.apache.org/jira/browse/ARROW-6827 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6828) [Archery] Benchmark diff should provide a TUI friendly output
Francois Saint-Jacques created ARROW-6828: - Summary: [Archery] Benchmark diff should provide a TUI friendly output Key: ARROW-6828 URL: https://issues.apache.org/jira/browse/ARROW-6828 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques -- This message was sent by Atlassian Jira (v8.3.4#803005)
[NIGHTLY] Arrow Build Report for Job nightly-2019-10-09-0
Arrow Build Report for Job nightly-2019-10-09-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0 Failed Tasks: - gandiva-jar-trusty: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-trusty - docker-clang-format: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-clang-format - wheel-osx-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp36m - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp35m - wheel-osx-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp37m - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp35m - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-gandiva-jar-osx - wheel-osx-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-osx-cp27m Succeeded Tasks: - homebrew-cpp-autobrew: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp-autobrew - wheel-manylinux1-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27mu - docker-hdfs-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-hdfs-integration - docker-lint: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-lint - docker-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-pandas-master - docker-cpp-static-only: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-static-only - wheel-manylinux2010-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp27mu - docker-cpp-cmake32: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-cmake32 - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-centos-7 - docker-cpp-release: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-cpp-release - docker-python-2.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-2.7 - docker-r: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-r - docker-spark-integration: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-spark-integration - debian-stretch: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-debian-stretch - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-osx-clang-py36 - ubuntu-xenial: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-xenial - ubuntu-disco: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-ubuntu-disco - wheel-manylinux1-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp27m - docker-iwyu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-iwyu - docker-js: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-js - docker-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-python-3.7 - docker-go: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-circle-docker-go - homebrew-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-homebrew-cpp - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-azure-conda-linux-gcc-py37 - wheel-manylinux1-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux1-cp36m - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-travis-wheel-manylinux2010-cp37m - wheel-win-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-10-09-0-appveyor-wheel-win-cp37m - docker-python-3.6-nopandas: URL: https://github.com/ursa-labs/crossbow/branches/all?que
[jira] [Created] (ARROW-6826) [Archery] Default build should be minimal
Francois Saint-Jacques created ARROW-6826: - Summary: [Archery] Default build should be minimal Key: ARROW-6826 URL: https://issues.apache.org/jira/browse/ARROW-6826 Project: Apache Arrow Issue Type: New Feature Components: Archery Reporter: Francois Saint-Jacques Follow of https://github.com/apache/arrow/pull/5600/files#r332655141 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6825) [C++] Rework CSV reader IO around readahead iterator
Antoine Pitrou created ARROW-6825: - Summary: [C++] Rework CSV reader IO around readahead iterator Key: ARROW-6825 URL: https://issues.apache.org/jira/browse/ARROW-6825 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Following ARROW-6764, we should try to remove the custom ReadaheadSpooler and use the generic readahead iteration facility instead. This will require reworking the blocking / chunking logic to mimick what is done in the JSON reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6824) [Plasma] Support batched create and seal requests for small objects
Philipp Moritz created ARROW-6824: - Summary: [Plasma] Support batched create and seal requests for small objects Key: ARROW-6824 URL: https://issues.apache.org/jira/browse/ARROW-6824 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Affects Versions: 0.15.0 Reporter: Philipp Moritz Currently the plasma create API supports creating and sealing a single object – this makes sense for large objects because their creating throughput is limited by the memory throughput of the client when the data is filled into the buffer. However sometimes we want to create lots of small objects in which case the throughput is limited by the number of IPCs to the store we can do when creating new objects. This can be fixed by offering a version of CreateAndSeal that allows us to create multiple objects at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)