[DISCUSS] Improving Contributor Guidelines
Hi Everyone, I am writing to give a bump to some of what was written in reply to Andrew's thread on auto-creating JIRAs. I would like to try to focus on small (hopefully) short term achievable items, to make the community friendlier to newcomers and reduce toil for regular contributors. 1. I think creating a GitHub action that can automatically copy a GitHub issue to a JIRA, close the issue and leave a note [1] would be useful. The intent is to be friendlier to people interacting with the project for the first time and letting them decide how invested they are in a bug to create the necessary credentials to track it. 2. Guidelines for trivial/minor patches (those not requiring a JIRA) and updating the PR tool to accept a title indicating them as such. I would propose the following fall under the trivial guideline: a. Grammar, usage and spelling fixes that affect no more than N files b. Documentation updates affecting no more than N files and not more than M words. 3. Guidelines for when to use the auto-create JIRA tool: a. Refactors (no functionality change) affect no more than N [2] files. If coding work required on the code is less than 1 or 2 hours, JIRAs can be disruptive enough to one's workflow and don't really contribute to the "openness" of the project in a meaningful way. This can be a slippery slope, but I think we can all be judicious on when to use it. b. Small one-off bug fixes by new contributors to the project (ideally this would be accompanied with a note pointing to the contributors guide). 4. IIUC some of the angst on the thread from regular contributors was the amount of duplication of effort it is to fill in details in multiple places. However, I do think transparent development is important and migration away from our current tooling would be an expensive investment. In my mind JIRA serves as a useful index of issues/feature backlog that is tied into our development and release tooling. I'm not sure it falls within the "Apache Way" but it seems that as long as the necessary discussion has already happened on list (possibly by way of reviewing PRs/Google Docs and summarizing them back in the list) then minimal JIRAs are sufficient (i.e. Title, Component, and a link to the discussed artifact). So for instance if we had an RFC process for a major feature I would imagine the guideline would be something like: a. Create JIRA for writing RFC (should be fairly minimal, I imagine most content will actually go into a PR for the RFC). This gives others that might be interested in the area knowledge that one is being written and an opportunity to collaborate ahead of time. b. Send RFC for review and give a heads up to the mailing list. c. Gain consensus on RFC d. Create minimal JIRAs corresponding to work items to complete the RFC and link back to it. For less involved features I still think minimal JIRAs are still OK, if other contributors/observers have particular concerns they can ask for more details. For me the key is expressing intent up-front to enable potential collaboration, discussion and feedback before a lot of time is invested. After the fact, understanding the rationale that went into decisions is also useful. Thoughts? Anything other guidelines or norms we should try to socialize around our development process? Thanks, Micah [1] A note like: "This issue has been copied to JIRA item [ARROW-1]. If you wish to discuss or further track it please track it there. For more details please see the [contributors guide]( https://arrow.apache.org/docs/developers/contributing.html)" [2] I used N above because I think it is best to treat the number as a guideline rather than a rule. But I would pick N=2 if we wanted to force the rule.
Re: [Java] IPC stream write with re-stated dictionaries
Hi Joris, I do believe this is missing. I believe we worked around this for testing by directly writing dictionary batches to the stream [1]. Thanks, Micah [1] https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java#L614 On Thu, Mar 4, 2021 at 4:06 AM Joris Peeters wrote: > Hello, > > For my use case I'm sending an Arrow IPC-stream from a server to a client, > with some columns being dictionary-encoded. Dictionary-encoding happens on > the fly, though, so the full dictionary isn't known yet at the beginning of > the stream, but rather is computed for every batch, and DictionaryBatches > are to be emitted prior to every RecordBatch. > > However, unless I am mistaken, this is not currently supported in the > ArrowStreamWriter. The dictionary provider is passed in at construction > time, the dicts are emitted once, and there is no hook for re-emitting > these. > > I've locally hacked around this by basically copy-pasting ArrowStreamWriter > and extending it with a `public void writeBatch(DictionaryProvider > provider)` method, that re-emits the dictionaries prior to emitting the > record batches. > > However, I'd of course much prefer if the provided ArrowStreamWriter > supported this. If people agree that it's missing (i.e. maybe I'm > overlooking something obvious) and that it would be useful to have, then > I'm happy to contribute it myself (not necessarily by using the > aforementioned `writeBatch(provider)` approach, but seems reasonable). > > Cheers, > -J >
Re: [Flight Extension] Request for Comments
Regarding the BarrageRecordBatch: I have been concatenating them; it’s one batch with two sets of arrow payloads. They don’t have separate metadata headers; the update is to be applied atomically. I have only studied the Java Arrow Flight implementation, and I believe it is usable maybe with some minor changes. The piece of code in Flight that does the deserialization takes two parallel lists/iterators, a `Buffer` list (these describe the length of a section of the body payload) and a `FieldNode` list (these describe num rows and null_count). Each field node is 2-3 buffers depending on schema type. Buffers are allowed to have length of 0, to omit their payloads; this, for example, is how you omit the validity buffer when null_count is zero. The proposed barrage payload keeps this structural pattern (list of buffer, list of field node) with the following modifications: - we only include field nodes / buffers for subscribed columns - the first set of field nodes are for added rows; these may be omitted if there are no added rows included in the update - the second set of field nodes are for modified rows; we omit columns that have no modifications included in the update I believe the only thing that is missing is the ability to control the field types to be deserialized (like a third list/iterator parallel to field nodes and buffers). Note that the BarrageRecordBatch.addedRowsIncluded, BarrageFieldNode.addedRows, BarrageFieldNode.modifiedRows and BarrageFieldNode.includedRows (all part of the flatbuffer metadata) are intended to be used by code one layer of abstraction higher than that actual wire-format parser. The parser doesn't really need them except to know which columns to expect in the payload. Technically, we could encode the field nodes / buffers as empty, too (but why be wasteful if this information is already encoded?). Regarding Browser Flight Support: Was this company FactSet by chance? (I saw they are mentioned in the JS thread that recently was bumped on the dev list.) I looked at the ticket and wanted to comment how we are handling bi-directional streams for our web-ui. We use ArrowFlight's concept of Ticket to allow a client to create and identify temporary state (new tables / views / REPL sessions / etc). Any bidirectional stream we support also has a server-streaming only variant with the ability for the client to attach a Ticket to reference/identify that stream. The client may then send a message, out-of-band, to the Ticket. They are sequenced by the client (since gRPC doesn't guarantee ordered delivery) and delivered to the piece of code controlling that server-stream. It does require that the server be a bit stateful; but it works =). On Thu, Mar 4, 2021 at 6:58 AM David Li wrote: > Re: the multiple batches, that makes sense. In that case, depending on how > exactly the two record batches are laid out, I'd suggest considering a > Union of Struct columns (where a Struct is essentially interchangeable with > a record batch or table) - that would let you encode two distinct record > batches inside the same physical batch. Or if the two batches have > identical schemas, you could just concatenate them and include indices in > your metadata. > > As for browser Flight support - there's an existing ticket: > https://issues.apache.org/jira/browse/ARROW-9860 > > I was sure I had seen another organization talking about browser support > recently, but now I can't find them. I'll update here if I do figure it out. > > Best, > David > > On Wed, Mar 3, 2021, at 21:00, Nate Bauernfeind wrote: > > > if each payload has two batches with different purposes [...] > > > > The purposes of the payloads are slightly different, however they are > > intended to be applied atomically. If there are guarantees by the table > > operation generating the updates then those guarantees are only valid on > > each boundary of applying the update to your local state. In a sense, one > > is relatively useless without the other. Record batches fit well in > > map-reduce paradigms / algorithms, but what we have is stateful to > > enable/support incremental updates. For example, sorting a flight of data > > is best done map-reduce-style and requires one to re-sort the entire data > > set when it changes. Our approach focuses on producing incremental > updates > > which are used to manipulate your existing client state using a much > > smaller footprint (in both time and space). You can imagine, in the sort > > scenario, if you evaluate the table after adding rows but before > modifying > > existing rows your table won’t be sorted between the two updates. The > > client would then need to wait until it receives the pair of > RecordBatches > > anyways, so it seems more natural to deliver them together. > > > > > As a side note - is said UI browser-based? Another project recently was > > planning to look at JavaScript support for Flight (using WebSockets as > the > > transport, IIRC) and it might make sense to join
Re: [C++] Generating random Date64 & Timestamp arrays
Agreed, though keep in mind that rather than "some form of reinterpretation at ArrayData level", you can use the Array::View function, so it would look something like auto ty = date64(); auto arr = *rag.Int64(...)->View(ty); On Thu, Mar 4, 2021 at 3:47 AM Antoine Pitrou wrote: > > > Hi Ying, > > Yes, this approach sounds reasonable. It would be useful at some point > to add random date/timestamp generation to RandomArrayGenerator, though. > > Regards > > Antoine. > > > Le 04/03/2021 à 04:36, Ying Zhou a écrit : > > Hi, > > > > I’d like to generate random Date64 & Timestamp arrays with artificial max > > and mins. RandomArrayGenerator::ArrayOf in arrow/testing/random.h does not > > help. Currently the approach I’d like to take is using > > RandomArrayGenerator::Int64 to generate a random int64 array and then > > convert it to a date64/timestamp array through some form of > > reinterpretation at ArrayData level. Does that work? If so is it the best > > approach? Thanks! > > > > Ying > >
Re: [Rust] Arrow in WebAssemby
I just remembered a bigger issue I ran into. I wanted to read from IPC but I don’t have a file. I do have the data as [u8] already. The current API incurs more copies than necessary (I think) and therefore the performance of reading IPC is worse than in JS. ( https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11696). On Mar 1, 2021 at 23:29:18, Dominik Moritz wrote: > I am looking forward to speaking with you then. I’ll talk about the > motivation. > > My experience with the library has been good. I ran into a few limitations > that I filed Jiras for. I struggled a bit with some of the error handling > and Arc types but that’s probably because I am now very experienced with > Rust and wasm-bindgen doesn’t support all Rust features. > > I had some bigger issues with the DataFusion and Parquet libraries as they > don’t support wasm right now (also filed Jiras for those). > > On Feb 27, 2021 at 11:14:27, Andrew Lamb wrote: > >> Hi Dominik, >> >> That sounds really interesting -- thank you for the offer >> >> I for one would enjoy seeing a demo and suggest that 10 minutes might be a >> good length. The next call (details are also on the announcement [1]) is >> scheduled for Wednesday March 10, 2021 at 09:00 PST / 12:00 EST / 17:00 >> UTC. The link is https://meet.google.com/ctp-yujs-aee >> >> I would personally be interested in hearing about your experience as a >> user >> of the Rust library (what was good, what was challenging, how can we >> improve). >> >> Thanks! >> Andrew >> >> [1] >> >> https://lists.apache.org/thread.html/raa72e1a8a3ad5dbb8366e9609a041eccca87f85545c3bc3d85170cfc%40%3Cdev.arrow.apache.org%3E >> >> On Fri, Feb 26, 2021 at 4:17 AM Fernando Herrera < >> fernando.j.herr...@gmail.com> wrote: >> >> Hi Dominic, >> >> >> I would be interested in a demo. Im curious to see your implementation and >> >> what advantages you have seen over javascript >> >> >> thanks >> >> Fernando >> >> >> On Thu, Feb 25, 2021 at 10:39 PM Dominik Moritz wrote: >> >> >> > Hello Rust Arrow Devs, >> >> > >> >> > I have been working on a wasm version of Arrow using the Rust library ( >> >> > https://github.com/domoritz/arrow-wasm). I was wondering whether you >> >> would >> >> > be interested in having me demo it in the Arrow Rust sync call. If so, >> >> when >> >> > would be the next one and how much time would you want to allocate for >> >> it? >> >> > Also, would you be interested for me to dive into something in >> >> particular? >> >> > >> >> > Cheers, >> >> > Dominik >> >> > >> >> >>
Re: [Flight Extension] Request for Comments
Re: the multiple batches, that makes sense. In that case, depending on how exactly the two record batches are laid out, I'd suggest considering a Union of Struct columns (where a Struct is essentially interchangeable with a record batch or table) - that would let you encode two distinct record batches inside the same physical batch. Or if the two batches have identical schemas, you could just concatenate them and include indices in your metadata. As for browser Flight support - there's an existing ticket: https://issues.apache.org/jira/browse/ARROW-9860 I was sure I had seen another organization talking about browser support recently, but now I can't find them. I'll update here if I do figure it out. Best, David On Wed, Mar 3, 2021, at 21:00, Nate Bauernfeind wrote: > > if each payload has two batches with different purposes [...] > > The purposes of the payloads are slightly different, however they are > intended to be applied atomically. If there are guarantees by the table > operation generating the updates then those guarantees are only valid on > each boundary of applying the update to your local state. In a sense, one > is relatively useless without the other. Record batches fit well in > map-reduce paradigms / algorithms, but what we have is stateful to > enable/support incremental updates. For example, sorting a flight of data > is best done map-reduce-style and requires one to re-sort the entire data > set when it changes. Our approach focuses on producing incremental updates > which are used to manipulate your existing client state using a much > smaller footprint (in both time and space). You can imagine, in the sort > scenario, if you evaluate the table after adding rows but before modifying > existing rows your table won’t be sorted between the two updates. The > client would then need to wait until it receives the pair of RecordBatches > anyways, so it seems more natural to deliver them together. > > > As a side note - is said UI browser-based? Another project recently was > planning to look at JavaScript support for Flight (using WebSockets as the > transport, IIRC) and it might make sense to join forces if that’s a path > you were also going to pursue. > > Yes, our UI runs in the browser, although table operations themselves run > on the server to keep the browser lean and fast. That said, the browser > isn’t the only target for the API we’re iterating on. We’re engaged in a > rewrite to unify our “first-class” Java API for intra-engine (server, > heavyweight client) usage and our cross-language (Javascript/C++/C#/Python) > “open” API. Our existing customers use the engine to drive multi-process > data applications, REPL/notebook experiences, and dashboards. We are > preserving these capabilities as we make the engine available as open > source software. One goal of the OSS effort is to produce a singular modern > API that’s more interoperable with the data science and development > community as a whole. In the interest of minimizing entry/egress points, we > are migrating to gRPC for everything in addition to the data IPC layer, so > not just the barrage/arrow-flight piece. > > The point of all this is to make the Deephaven engine as accessible as > possible for a broad user base, including developers using the API from > their language of choice or scripts/code running co-located within an > engine process. Our software can be used to explore or build applications > and visualizations around static as well as real-time data (imagine joins, > aggregations, sorts, filters, time-series joins, etc), perform table > operations with code or with a few clicks in a GUI, or as a building-block > in a multi-stage data pipeline. We think making ourselves as interoperable > as possible with tools built on Arrow is an important part of attaining > this goal. > > That said, we have run into quite a few pain points migrating to gRPC, such > as 1) no-client-side streaming is supported by any browser, 2) today, > server-side streams require a proxy layer of some sort (such as envoy), 3) > flatbuffer’s javascript/typescript support is a little weak, and I’m sure > there are others that aren’t coming to mind at the moment. We have some > interesting solutions to these problems, but, today, these issues are a > decent chunk of our focus. That said, the UI is usable today by our > enterprise clients, but it interacts with the server over websockets and a > protocol that is heavily influenced by 10-years of existing proprietary > java-to-java IPC (which are NOT friendly to being robust over intermittent > failures). Today, we’re just heads-down going the gRPC route and hoping > that eventually browsers get around to better support for some of this > stuff (so, maybe one day a proxy isn’t required, etc). Some of our RPCs > make most sense as bidirectional streams, but to support our web-ui we also > have a server-streaming variant that we can pass data to “out-of-band” via > a unary call
[Java] IPC stream write with re-stated dictionaries
Hello, For my use case I'm sending an Arrow IPC-stream from a server to a client, with some columns being dictionary-encoded. Dictionary-encoding happens on the fly, though, so the full dictionary isn't known yet at the beginning of the stream, but rather is computed for every batch, and DictionaryBatches are to be emitted prior to every RecordBatch. However, unless I am mistaken, this is not currently supported in the ArrowStreamWriter. The dictionary provider is passed in at construction time, the dicts are emitted once, and there is no hook for re-emitting these. I've locally hacked around this by basically copy-pasting ArrowStreamWriter and extending it with a `public void writeBatch(DictionaryProvider provider)` method, that re-emits the dictionaries prior to emitting the record batches. However, I'd of course much prefer if the provided ArrowStreamWriter supported this. If people agree that it's missing (i.e. maybe I'm overlooking something obvious) and that it would be useful to have, then I'm happy to contribute it myself (not necessarily by using the aforementioned `writeBatch(provider)` approach, but seems reasonable). Cheers, -J
[NIGHTLY] Arrow Build Report for Job nightly-2021-03-04-0
Arrow Build Report for Job nightly-2021-03-04-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0 Failed Tasks: - conda-linux-gcc-py37-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-drone-conda-linux-gcc-py37-aarch64 - conda-linux-gcc-py38-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-drone-conda-linux-gcc-py38-aarch64 - test-build-vcpkg-win: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-build-vcpkg-win - test-conda-cpp-valgrind: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-conda-cpp-valgrind - test-conda-python-3.7-dask-latest: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-conda-python-3.7-dask-latest - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-conda-python-3.7-turbodbc-master - test-conda-python-3.8-jpype: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-conda-python-3.8-jpype - test-r-versions: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-test-r-versions - test-ubuntu-18.04-docs: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-test-ubuntu-18.04-docs - test-ubuntu-18.04-r-sanitizer: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-test-ubuntu-18.04-r-sanitizer - wheel-osx-high-sierra-cp36m: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-high-sierra-cp36m - wheel-osx-high-sierra-cp37m: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-high-sierra-cp37m - wheel-osx-high-sierra-cp38: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-high-sierra-cp38 - wheel-osx-high-sierra-cp39: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-high-sierra-cp39 - wheel-osx-mavericks-cp36m: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-mavericks-cp36m - wheel-osx-mavericks-cp37m: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-mavericks-cp37m - wheel-osx-mavericks-cp38: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-mavericks-cp38 - wheel-osx-mavericks-cp39: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-wheel-osx-mavericks-cp39 Succeeded Tasks: - centos-7-amd64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-centos-7-amd64 - centos-8-amd64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-github-centos-8-amd64 - conda-clean: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-clean - conda-linux-gcc-py36-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-drone-conda-linux-gcc-py36-aarch64 - conda-linux-gcc-py36-cpu-r36: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py36-cpu-r36 - conda-linux-gcc-py36-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py36-cuda - conda-linux-gcc-py37-cpu-r40: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py37-cpu-r40 - conda-linux-gcc-py37-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py37-cuda - conda-linux-gcc-py38-cpu: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py38-cpu - conda-linux-gcc-py38-cuda: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py38-cuda - conda-linux-gcc-py39-aarch64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-drone-conda-linux-gcc-py39-aarch64 - conda-linux-gcc-py39-cpu: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-03-04-0-azure-conda-linux-gcc-py39-cpu - conda-linux-gcc-py39-cuda: URL:
Re: [C++] Generating random Date64 & Timestamp arrays
Hi Ying, Yes, this approach sounds reasonable. It would be useful at some point to add random date/timestamp generation to RandomArrayGenerator, though. Regards Antoine. Le 04/03/2021 à 04:36, Ying Zhou a écrit : Hi, I’d like to generate random Date64 & Timestamp arrays with artificial max and mins. RandomArrayGenerator::ArrayOf in arrow/testing/random.h does not help. Currently the approach I’d like to take is using RandomArrayGenerator::Int64 to generate a random int64 array and then convert it to a date64/timestamp array through some form of reinterpretation at ArrayData level. Does that work? If so is it the best approach? Thanks! Ying