Re: [DISCUSS] Updating what are considered reference implementations?
I think this [1] is the thread where the policy was proposed, but it doesn't look like we ever settled on "Java and C++" vs. "any two implementations", or had a vote. I worry that requiring maintainers to add new format features to two "complete" implementations will just lead to fragmentation. People might opt to maintain a fork rather than unblock themselves by implementing a backlog of features they don't need. [1] https://lists.apache.org/thread/9t0pglrvxjhrt4r4xcsc1zmgmbtr8pxj On Fri, Jan 6, 2023 at 12:33 PM Weston Pace wrote: > I think it would be reasonable to state that a reference > implementation must be a complete implementation (i.e. supports all > existing types) that is not derived from another implementation (e.g. > you can't pick pyarrow and arrow-c++). If an implementation does not > plan on ever supporting a new array type then maintainers of that > implementation should be empowered to vote against it. Given that, it > seems like a reasonable burden to ask maintainers to catch up first > before expanding in new directions. > > > On Fri, Jan 6, 2023 at 10:20 AM Micah Kornfield > wrote: > > > > > > > > Note this wording talks about "two reference implementations" not > "*the* > > > two reference implementations". So there can be more than two reference > > > implementations. > > > > > > Maybe reference implementation is the wrong wording here. My main > concern > > is that we try to maintain two "feature complete" implementations at all > > times. I worry if there is a pick 2 from N reference implementations > that > > potentially leads to fragmentation more quickly. But maybe this is > > premature? > > > > Cheers, > > Micah > > > > > > On Fri, Jan 6, 2023 at 10:02 AM Antoine Pitrou > wrote: > > > > > > > > Le 06/01/2023 à 18:58, Micah Kornfield a écrit : > > > > I'm having trouble finding it, but I think we've previously agreed > that > > > new > > > > features needed implementations in 2 reference implementations before > > > > approval (I had thought the community agreed on Java and C++ as the > two > > > > implementations but I can't find the vote thread on it). > > > > > > Note this wording talks about "two reference implementations" not > "*the* > > > two reference implementations". So there can be more than two reference > > > implementations. > > > > > > Regards > > > > > > Antoine. > > > >
Re: [VOTE] Remove compute from Arrow JS
+1 I don't think there's much reason to keep the compute code around when there's a more performant, easier to use alternative. I think the only unique feature of the arrow compute code was the ability to optimize queries on dictionary-encoded columns, but Jeff added this to Arquero almost a year ago now [1]. Brian [1] https://github.com/uwdata/arquero/issues/86 On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz wrote: > Dear Arrow community, > > We are proposing to remove the compute code from Arrow JS. Right now, the > compute code is encapsulated in a DataFrame class that extends Table. The > DataFrame implements a few functions such as filtering and counting with > expressions. However, the predicate code is not very efficient (it’s > interpreted) and most people only use Arrow to read data but don’t need > compute. There are also more complete alternatives for doing compute on > Arrow data structures such as Arquero (https://github.com/uwdata/arquero). > By removing the compute code, we can focus on the IPC reading/writing and > primitive types. > > The vote will be open for at least 72 hours. > > [ ] +1 Remove compute from Arrow JS > [ ] +0 > [ ] -1 Do not remove compute because… > > Thank you, > Dominik >
Re: Improving PR workload management for Arrow maintainers
I review a decent number of PRs for Apache Beam, and I've built some of my own tooling to help keep track of open PRs. I wrote a script that pulls metadata about all relevant PRs and uses some heuristics to categorize them into: - incoming review - outgoing review - "CC'd" - where I've been mentioned but am not the reviewer or author In the first two cases I try to highlight the ones that need my attention, simply by detecting if I'm the person who took the most recent action or not. This works reasonably well but gets tripped up on several edge cases: 1) The author might push multiple commits before they're actually ready for more feedback. 2) A PR might need feedback from multiple reviewers (e.g. people with domain knowledge of certain areas). I've been planning to make my script stateful so that I can mark a PR as "not my turn" (i.e. unhighlight this until there is more activity), and maybe "never my turn" (i.e. I've finished reviewing this, it's waiting on someone else), to handle these cases. The idea of a "Addressing Feedback" -> "Waiting on Review" label that is automatically transitioned when there is activity would run into these same edge cases. If a reviewer had the ability to bump the label back to "Addressing Feedback", that would at least address #1. I think Wes's proposal (a read-only web UI) would likely also run into these edge cases since it stores no state of its own to deconflict in those situations. Brian On Tue, Jun 29, 2021 at 6:26 AM Wes McKinney wrote: > On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb wrote: > > > > The thing that would make me more efficient reviewing PRs is figuring out > > which one of the open reviews are ready for additional feedback. > > Yes, I think this would be the single most significant quality-of-life > improvement for reviewers. > > > I think the idea of a webapp or something that shows active reviews would > > be helpful (though I get most of that from appropriate email filters). > > > > What about a system involving labels (for which there is already a basic > > GUI in github)? Something low tech like > > > > (Waiting for Review) > > (Addressing Feedback) > > (Approved, waiting for Merge) > > > > With maybe some automation prompting people to add the "Waiting on > Review" > > label when they want feedback > > I think it would have to be a bot that automatically sets the labels. > If it requires contributors to take some action outside of pushing new > work (new commits or a rebased version of the patch) to the PR and > leaving responses to comments on the PR, the system is likely to fail > some non-trivial percentage of the time. > Given the quality of off-the-shelf web app components nowadays (e.g. > https://material-ui.com), throwing together a read-only PR dashboard > that shows what has changed since you last interacted with them (along > with some other helpful things, like whether the build is passing) is > "probably" not a super heavy lift. I haven't done any frontend > development in years so while the backend part (writing Python code to > wrangle data from GitHub's REST API and put it in a SQLite database) > wouldn't take very long I would need some help on the front end > portion and setting it up for deployment on DigitalOcean or somewhere. > > > Andrew > > > > On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney > wrote: > > > > > hi folks, > > > > > > I've noted that the volume of PRs for Arrow has been steadily > > > increasing (and will likely continue to increase), and while I've > > > personally had less time for development / maintenance / code reviews > > > over the last year, I would like to have a discussion about what we > > > could do to improve our tooling for maintainers to optimize the > > > efficiency of time spent tending to the PR queue. In my own > > > experience, I have felt that I have wasted a lot of time digging > > > around the queue looking for PRs that are awaiting feedback or need to > > > be merged. > > > > > > I note first of all that around 70 out of 173 open PRs have been > > > updated in the last 7 days, so while there is some PR staleness, to > > > have nearly half of the PRs active is pretty good. That said, ~70 > > > active PRs is a lot of PRs to tend to. > > > > > > I scraped the project's code review comment history, and here are the > > > individuals who have left the most comments on PRs since genesis > > > > > > pitrou6802 > > > wesm 5023 > > > emkornfield 3032 > > > bkietz2834 > > > kou 1489 > > > nealrichardson1439 > > > fsaintjacques 1356 > > > kszucs1250 > > > alamb 1133 > > > jorisvandenbossche1094 > > > liyafan82 831 > > > lidavidm 816 > > > westonpace 794 > > > xhochy 770 > > > nevi-me643 > > > BryanCutler639 > > > jorgecarleitao 635 > > > cpcloud551 > > > sunc
Re: [ANNOUNCE] New Arrow committer: Dominik Moritz
Congratulations Dominik! Well deserved! Really excited to see some momentum in the JavaScript library On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz wrote: > Thank you for the warm welcome, Wes. > > I look forward to continue working with you all on Arrow and in particular > the Arrow JavaScript library. > > Dominik > > On Jun 2, 2021 at 14:19:51, Wes McKinney wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Dominik has > > accepted an > > invitation to become a committer on Apache Arrow. Welcome, and thank you > > for your contributions! > > > > Wes > > >
Re: Long title on github page
Thank you for bringing this up Dominik. I sampled some of the descriptions for other Apache projects I frequent, the ones with a meaningful description have a single sentence: github.com/apache/spark - Apache Spark - A unified analytics engine for large-scale data processing github.com/apache/beam - Apache Beam is a unified programming model for Batch and Streaming github.com/apache/avro - Apache Avro is a data serialization system Several others (Flink, Hadoop, ...) just have "[Mirror of] Apache " as the description. +1 for Nate's suggestion "Apache Arrow is a cross-language development platform for in-memory data. It enables systems to process and transport data more efficiently." On Mon, May 17, 2021 at 5:23 AM Wes McKinney wrote: > It's probably best for description to limit mentions of specific > features. There are some high level features mentioned in the > description now ("computational libraries and zero-copy streaming > messaging and interprocess communication"), but now in 2021 since the > project has grown so much, it could leave people with a limited view > of what they might find here. > > On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas > wrote: > > > > How about > > 'Apache Arrow is a cross-language development platform for in-memory > data. > > It enables systems to process and transport data efficiently, providing a > > simple and fast library for partitioning of large tables'? > > > > Sorry the delay, long election day > > > > On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind < > natebauernfe...@deephaven.io> > > wrote: > > > > > Suggestion: faster -> more efficiently > > > > > > "Apache Arrow is a cross-language development platform for in-memory > > > data. It enables systems to process and transport data more > efficiently." > > > > > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney > wrote: > > > > > > > Here's what there now: > > > > > > > > "Apache Arrow is a cross-language development platform for in-memory > > > > data. It specifies a standardized language-independent columnar > memory > > > > format for flat and hierarchical data, organized for efficient > > > > analytic operations on modern hardware. It also provides > computational > > > > libraries and zero-copy streaming messaging and interprocess > > > > communication…" > > > > > > > > How about something shorter like > > > > > > > > "Apache Arrow is a cross-language development platform for in-memory > > > > data. It enables systems to process and transport data faster." > > > > > > > > Suggestions / refinements from others welcome > > > > > > > > > > > > On Sat, May 15, 2021 at 9:12 PM Dominik Moritz > wrote: > > > > > > > > > > Super minor issue but could someone make the description on GitHub > > > > shorter? > > > > > > > > > > > > > > > > > > > > GitHub puts the description into the title of the page and makes it > > > hard > > > > to find it in URL autocomplete. > > > > > > > > > > > > > > > > > > -- > > > >
Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)
+1 this looks good to me. My only concern is with criteria #3 " Is the underlying encoding of the type already semantically supported by a type?". I think this is a good criteria, but it's inconsistent with the current spec. By that criteria some existing types (Timestamp, Time, Duration, Date) should be well known extension types, right? Perhaps we should explicitly indicate these types are grandfathered in [1] because they existed before extension types, to avoid tension with this criteria. Brian [1] https://en.wikipedia.org/wiki/Grandfather_clause On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Thanks for writing this. > > I agree. That is a good decision tree. +1 > > Best, > Jorge > > > On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield > wrote: > > > The discussion around adding another interval type to the Schema.fbs > raises > > the issue of when do we decide to add a new type to the Schema.fbs vs > using > > other means (primarily extension types [1]). > > > > A few criteria come to mind that could help decide (feedback welcome): > > > > 1. Is the type a new parameterization of an existing type? > > - If Yes, and we believe the parameterization is useful and can be > done > > in a forward/backward compatible manner then we would update Schema.fbs. > > > > 2. Does the type itself have its own specification for processing (e.g. > > JSON, BSON, Thrift, Avro, Protobuf)? > > - If yes, we would NOT add them to Schema.fbs. I think this would > > potentially yield too many new types. > > > > 3. Is the underlying encoding of the type already semantically supported > > by a type? (e.g. if we want to encode physical lengths like meters these > > can be represented by an integer). > >- If yes, we would NOT update the specification. This seems like the > > exact use-case that extension types are meant for. > > > > * How does this apply to Interval? * > > Interval extends an existing type in the specification and multiple > "packed > > fields" cannot be easily communicated with the current version of the > > specification. Hence, I feel comfortable making the addition to > Schema.fbs > > > > * What does this mean for other common types? * > > > > I think as types come up that are very common but we don't want to add to > > the Schema.fbs we should invest in formalizing them as "Well Known" > > Extension types. In this scenario, we would update the specification to > > include how to specify the extension type metadata (and still require at > > least two libraries support the Extension type before inclusion as "Well > > Known"). > > > > * Practical implications * > > > > I think this means the type system in Schema.fbs is mostly closed (i.e. > > there is a high bar for adding new types). One potentially useful type to > > have would be a "packed struct" that supports something similar to python > > struct library [2]. I think this would likely cover many extension type > > use-cases. > > > > Thoughts? > > > > -Micah > > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > [2] https://docs.python.org/3/library/struct.html > > >
Re: [DISCUSS] How to describe computation on Arrow data?
I agree this would be a great development. It would also be useful for leveraging compute engines from JS via wasm. I've thought about something like this in the context of multi-language relational workloads in Apache Beam, mostly just leading me to wonder if something like it already exists. But so far I haven't found it. On Thu, Mar 18, 2021 at 7:39 AM Wes McKinney wrote: > I completely agree with developing a common “query protocol” or “physical > execution plan” IR + serialization scheme inside Apache Arrow. It may take > some time to stabilize so we should try to avoid being hasty in closing it > to change until more time has elapsed to allow requirements to percolate. > > On Thu, Mar 18, 2021 at 8:17 AM Andy Grove wrote: > > > Hi Paddy, > > > > Thanks for raising this. > > > > Ballista defines computations using protobuf [1] to describe logical and > > physical query plans, which consist of operators and expressions. It is > > actually based on the Gandiva protobuf [2] for describing expressions. > > > > I see a lot of value in standardizing some of this across > implementations. > > Ballista is essentially becoming a distributed scheduler for Arrow and > can > > work with any implementation that supports this protobuf definition of > > query plans. > > > > It would also make it easier to embed C++ in Rust, or Rust in C++, having > > this common IR, so I would be all for having something like this as an > > Arrow specification. > > > > Thanks, > > > > Andy. > > > > [1] > > > > > https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto > > [2] > > > > > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto > > > > > > On Thu, Mar 18, 2021 at 7:40 AM paddy horan > > wrote: > > > > > Hi All, > > > > > > I do not have a computer science background so I may not be asking this > > in > > > the correct way or using the correct terminology but I wonder if we can > > > achieve some level of standardization when describing computation over > > > Arrow data. > > > > > > At the moment on the Rust side DataFusion clearly has a way to describe > > > computation, I believe that Ballista adds the ability to serialize this > > to > > > allow distributed computation. On the C++ side work is starting on a > > > similar query engine and we already have Gandiva. Is there an > > opportunity > > > to define a kind of IR for computation over Arrow data that could be > > > adopted across implementations? > > > > > > In this case DataFusion could easily incorporate Gandiva to generate > > > optimized compute kernels if they were using the same IR to describe > > > computation. Applications built on Arrow could "describe" computation > in > > > any language and take advantage or innovations across the community, > > adding > > > this to Arrow's zero copy data sharing could be a game changer in my > > mind. > > > I'm not someone who knows enough to drive this forward but I obviously > > > would like to get involved. For some time I was playing around with > > using > > > TVM's relay IR [1] and applying it to Arrow data. > > > > > > As the Arrow memory format has now matured I fell like this could be > the > > > next step. Is there any plan for this kind of work or are we going to > > > allow sub-projects to "go their own way"? > > > > > > Thanks, > > > Paddy > > > > > > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation ( > apache.org > > )< > > > https://tvm.apache.org/docs/dev/relay_intro.html> > > > > > > > > >
Arrow JS Meetup (02/13)
Hi all, +Dominik Moritz recently reached out to +Paul Taylor and myself to set up an Arrow JS meetup with the goal of re-building some momentum around the Arrow JS library. We've scheduled it for this coming Saturday, 02/13 at 11:30 AM PST. Rough Agenda: - Arrow JS Design Principles, Future Plans, and How to Contribute (Paul and Brian) - Lightning Talks from Arrow JS users - Discussions/breakouts as needed If anyone is interested in joining please reach out to Dominik at domor...@cmu.edu For anyone who can't join - I will try my best to capture notes and share them with the mailing list afterward. Brian
Re: [javascript] streaming IPC examples?
+Paul Taylor would your work with whatwg streams be relevant here? Are there any examples that would be useful for Ryan? Brian On Sat, Jan 23, 2021 at 4:52 PM Ryan McKinley wrote: > Hello- > > I am exploring options to support streaming in grafana. We have a golang > websocket server and am exploring options to send data to the browser. > > Are there any good examples of reading IPC data with callbacks for each > block? I see examples for mapd, and for reading whole tables -- but am > hoping for something that lets me read initial header data, then get each > record batch as a callback (rxjs) > https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format > > Thanks for any pointers > Ryan >
Re: [javascript] cant get timestamps in arrow 2.0
Ah good to know, thanks for the clarifications Neal. Clearly I haven't been keeping up very well. On Fri, Dec 18, 2020, 09:49 Neal Richardson wrote: > A few clarifications: Feather, in it's version 2, _is_ the Arrow IPC file > format. We've kept the Feather name as a way of referring to Arrow files. > The original Feather file format, which had differences from the Arrow IPC > format, did not support compression. The Arrow IPC format may include > compression (https://issues.apache.org/jira/browse/ARROW-300), but as > Micah > brought up on the user mailing list thread, it's only the C++ > implementation and libraries using it that have implemented yet, and the > feature is not well documented yet. > > So all Arrow libraries support Feather v2 (as it is the IPC file format), > but currently only C++ (thus Python, R, and glib/Ruby) supports Feather/IPC > files with compression. > > Neal > > On Fri, Dec 18, 2020 at 8:18 AM Brian Hulette wrote: > > > Hi Andrew, > > I'm glad you got this working! The javascript library only implements the > > arrow IPC spec, it doesn't have any special handling for feather and its > > compression support. It's good to know that you can read uncompressed > > feather files, but I'd only expect it to read an IPC stream or file. This > > is what I did for the Intro to Arrow JS notebook [1], see scrabble.py > here > > [2]. Note that python script was written many versions of arrow ago, I'm > > sure there's less boilerplate required for this in pyarrow 2.0. > > > > Support for feather and compression would certainly be a welcome > > contribution > > > > [1] https://observablehq.com/@theneuralbit/introduction-to-apache-arrow > > [2] > https://gist.github.com/TheNeuralBit/64d8cc13050c9b5743281dcf66059de5 > > > > On Thu, Dec 17, 2020 at 10:10 AM Andrew Clancy wrote: > > > > > So, I figured out the issue here - I had to remove compression from the > > > pyarrow feather.write_feather(compression='uncompressed'). Is there any > > way > > > to read a compressed feather file in arrow js? > > > See the comment under the first answer here: > > > > > > > > > https://stackoverflow.com/questions/64629670/how-to-write-a-pandas-dataframe-to-arrow-file/64648955#64648955 > > > I couldn't find anything in the arrow docs or notebooks on this - I'm > > > assuming that's related to javascript compression libraries being so > > > limited. > > > > > > On Mon, 14 Dec 2020 at 19:02, Andrew Clancy wrote: > > > > > > > Hi, > > > > > > > > I have a simple feather file created via a pandas to_feather with a > > > > datetime64[ns] column, and cannot get timestamps in javascript > > > > apache-arrow@2.0.0 > > > > > > > > See this notebook: > > > > https://observablehq.com/@nite/apache-arrow-timestamp-investigation > > > > > > > > I'm guessing I'm missing something, has anyone got any suggestions, > or > > > > decent examples of reading a file created in pandas? I've seen in > > > examples > > > > of apache-arrow@0.3.1 where dates stored as an array of 2 ints. > > > > > > > > File was created with: > > > > > > > > import pandas as pd > > > > pd.read_parquet('sample.parquet') > > > > df.to_feather('sample-seconds.feather') > > > > > > > > Final Q: I'm assuming this is the best place for this question? Happy > > to > > > > post elsewhere if there's any other forums, or if this should be a > JIRA > > > > ticket? > > > > > > > > Thanks! > > > > Andy > > > > > > > > > >
Re: [javascript] cant get timestamps in arrow 2.0
Hi Andrew, I'm glad you got this working! The javascript library only implements the arrow IPC spec, it doesn't have any special handling for feather and its compression support. It's good to know that you can read uncompressed feather files, but I'd only expect it to read an IPC stream or file. This is what I did for the Intro to Arrow JS notebook [1], see scrabble.py here [2]. Note that python script was written many versions of arrow ago, I'm sure there's less boilerplate required for this in pyarrow 2.0. Support for feather and compression would certainly be a welcome contribution [1] https://observablehq.com/@theneuralbit/introduction-to-apache-arrow [2] https://gist.github.com/TheNeuralBit/64d8cc13050c9b5743281dcf66059de5 On Thu, Dec 17, 2020 at 10:10 AM Andrew Clancy wrote: > So, I figured out the issue here - I had to remove compression from the > pyarrow feather.write_feather(compression='uncompressed'). Is there any way > to read a compressed feather file in arrow js? > See the comment under the first answer here: > > https://stackoverflow.com/questions/64629670/how-to-write-a-pandas-dataframe-to-arrow-file/64648955#64648955 > I couldn't find anything in the arrow docs or notebooks on this - I'm > assuming that's related to javascript compression libraries being so > limited. > > On Mon, 14 Dec 2020 at 19:02, Andrew Clancy wrote: > > > Hi, > > > > I have a simple feather file created via a pandas to_feather with a > > datetime64[ns] column, and cannot get timestamps in javascript > > apache-arrow@2.0.0 > > > > See this notebook: > > https://observablehq.com/@nite/apache-arrow-timestamp-investigation > > > > I'm guessing I'm missing something, has anyone got any suggestions, or > > decent examples of reading a file created in pandas? I've seen in > examples > > of apache-arrow@0.3.1 where dates stored as an array of 2 ints. > > > > File was created with: > > > > import pandas as pd > > pd.read_parquet('sample.parquet') > > df.to_feather('sample-seconds.feather') > > > > Final Q: I'm assuming this is the best place for this question? Happy to > > post elsewhere if there's any other forums, or if this should be a JIRA > > ticket? > > > > Thanks! > > Andy > > >
Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library
On Tue, Jul 14, 2020 at 9:36 AM Micah Kornfield wrote: > Hi Adam, > > > This sounds really interesting, how about adding the wasm build (C++) to > > the releases? > > I think this just needs someone to volunteer to do it and maintain it (at a > minimum if it doesn't already exist we need CI for it). We would also need > to figure out details of publishing and integrating it into the release > process. > Yes a wasm build for the core C++ library would be a welcome addition as well (as long as C++ maintainers agree whatever we do doesn't add a large maintenance burden). As Micah pointed out folks at JPMC have already done some work on this as part of Perspective, but we don't have any support in Arrow itself. I gave this a shot after being encouraged by [1], but ran into issues that I can't recall and gave up. Probably someone with more knowledge of C++ and cmake could get past it, especially given there's an example in Perspective. As far as release/publishing, for Rust there's wasm-pack [2] which would let us publish build artifacts to the npm registry for use in JS. I'm not sure if this is helpful for integrating with Spark or not. FWIW there was another thread [3] about wasm builds for Rust and C++ a while back. [1] https://github.com/apache/arrow/pull/3350#issuecomment-464517253 [2] https://rustwasm.github.io/wasm-pack/book/introduction.html [3] https://lists.apache.org/thread.html/e15dc80debf9dea1b33581fa6ba95fd84b57c0ccd0162505d5d25079%40%3Cdev.arrow.apache.org%3E > I've done a lot of asm.js work (different from wasm) in the past, but my > > assumption would be that using Rust instead of C++ as source for wasm > > should result in smaller wasm binaries. > > I don't know much about either, but I'm curious why you would expect this > to be the case? > > On Tue, Jul 14, 2020 at 8:07 AM Adam Lippai wrote: > > > This sounds really interesting, how about adding the wasm build (C++) to > > the releases? > > I've done a lot of asm.js work (tfrom wasm) in the past, but my > > assumption would be that using Rust instead of C++ as source for wasm > > should result in smaller wasm binaries. > > Rust Arrow doesn't really use exotic solutions, eg. simd or tokio > > dependency can be turned off. > > > > Having DataFusion + some performant data access in browsers or even in > > node.js would be useful. > > Not needing to build fancy HTTP/GraphQL API over the Rust/C++ impl. but > > moving the data processing code to the client is viable for "small" > > workloads. > > Ofc if JS Arrow lands Flight support this may become less of an issue, > but > > AFAIK it's gRPC based which would need setting up a gRPC reverse proxy > for > > C++/Rust Arrow. > > Overall both the code-duplication and feature fragmentation would > decrease > > by using a single source (like you don't have a full Python impl. for > > obvious reasons) > > > > Best regards, > > Adam Lippai > > > > On Tue, Jul 14, 2020 at 4:27 PM Micah Kornfield > > wrote: > > > >> Fwiw, I believe at least the core c++ library already can be compiled to > >> wasm. I think perspective does this [1] > >> > >> > >> I'm curious What are you hoping to achieve with embedded wasm in > spark? > >> > >> Thanks, > >> Micah > >> > >> [1] https://perspective.finos.org/ > >> > >> On Tuesday, July 14, 2020, Brian Hulette wrote: > >> > >> > That sounds great! I'd like to have some support for using the rust > >> and/or > >> > C++ libraries in the browser via wasm as well. > >> > As long as the community is ok with your overall approach "to add > >> compiler > >> > conditionals around any I/O features and libc dependent features of > >> these > >> > two libraries," I think it may be best to start with a PR and discuss > >> > specifics from there. > >> > > >> > Do any rust contributors have objections to this? > >> > > >> > Brian > >> > > >> > On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal wrote: > >> > > >> > > Hi all, > >> > > > >> > > Looking for guidance on how to submit a design and PR to add WASM32 > >> > support > >> > > to apache arrow's rust libraries. > >> > > > >> > > I am looking to use the arrow library to pass data in arrow format > >> > between > >> > > the host spark environment and UDFs defined in WASM . > >> > > > >> > > I created the following JIRA ticket to capture the work > >> > > https://issues.apache.org/jira/browse/ARROW-9453 > >> > > > >> > > Thanks, > >> > > RJ > >> > > > >> > > >> > > >
Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library
That sounds great! I'd like to have some support for using the rust and/or C++ libraries in the browser via wasm as well. As long as the community is ok with your overall approach "to add compiler conditionals around any I/O features and libc dependent features of these two libraries," I think it may be best to start with a PR and discuss specifics from there. Do any rust contributors have objections to this? Brian On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal wrote: > Hi all, > > Looking for guidance on how to submit a design and PR to add WASM32 support > to apache arrow's rust libraries. > > I am looking to use the arrow library to pass data in arrow format between > the host spark environment and UDFs defined in WASM . > > I created the following JIRA ticket to capture the work > https://issues.apache.org/jira/browse/ARROW-9453 > > Thanks, > RJ >
Re: [JavaScript] how to set column name after creation?
Hi Ryan, Here or user@arrow.apache.orgis a fine place to ask :) The metadata on Table/Column/Field objects are all immutable, so doing this right now would require creating a new instance of Table with the field renamed, which takes quite a lot of boilerplate. A helper for renaming a column (or even better a generalization of select [1] that lets you do a full projection, including column renames) would be a great contribution. Here's an example of creating a renamed column, which should get you most of the way to creating a Table with a renamed column: https://observablehq.com/@theneuralbit/renaming-an-arrow-column Brian [1] https://github.com/apache/arrow/blob/ff7ee06020949daf66ac05090753e1a17736d9fa/js/src/table.ts#L249 On Thu, Jun 25, 2020 at 4:04 PM Ryan McKinley wrote: > Apologies if this is the wrong list or place to ask... > > What is the best way to update a column name for a Table in javascript? > > const col = table.getColumnAt(i); > col.name = 'new name!' > > Currently: Cannot assign to 'name' because it is a read-only property > > Thanks! > > ryan >
Re: Why downloading sources of pyarrow and its requirements takes several minutes?
+1 fo a jira to track this. I looked into it a little bit just out of curiosity. I passed --verbose to pip to get insight into what's going on in in the "Installing build dependencies..." step. I did this for both 0.15.1 and 0.16. They took 4:10 and 5:57 respectively. It looks like 0.16.0 spent 2:43 installing numpy, which is absent from the 0.15.1 log. I'm not sure what changed to cause this. I collected logs with the following command (note it relies on ts in moreutils for adding timestamps): python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary :all: --verbose 2>&1 | ts | tee /tmp/0.16.0.log I found the numpy difference and measured its runtime by grepping for "Running setup.py" in these logs. The logs are uploaded to google drive: https://drive.google.com/drive/folders/1rPoYAsVul3HGdrviiCLGPf_P8dOlBCd1?usp=sharing On Fri, May 29, 2020 at 5:49 AM Wes McKinney wrote: > hi Valentyn, > > This is the first I've ever heard of anyone doing what you are doing, > so safe to say that we've given little to no consideration to this use > case. We have been focused on providing binary packages for pip and > conda. Could you please open a JIRA and provide more detailed > information about what you are seeing? > > Thanks > Wes > > On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev > wrote: > > > > Hi Arrow dev community, > > > > Do you have any insight why > > > > python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary > > :all: > > > > takes several minutes to execute? From the output we can see that pip get > > stuck on: > > > > File was already downloaded /tmp/pyarrow-0.16.0.tar.gz > > Installing build dependencies ... | > > > > There is a significant increase in runtime between 0.15.1 and 0.16.0. I > > suspect some build dependencies need to be installed before pip > > understands the dependencies of pyarrow. Is there some inefficiency in > > Avro's setup.py that is causing this? > > > > Thanks, > > Valentyn >
Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads
* What kind of devops tooling would be appropriate to provision and manage the instances, scaling up and down based on need? * What CI/CD platform would be appropriate to dispatch work to the cloud nodes (taking into consideration the high costs of sysadmin, and seeking to minimize nodes sitting unused)? I looked into solutions for running CI/CD workers on GCP a (very) little bit and just wanted to shared some findings. Appveyor claims it can auto-scale GCE instances [1] but I don't think it would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a problem? BuildKite has documentation about running agents on a scalable GKE cluster [3], but unfortunately no way to auto-scale based on the backlog. We could maybe roll our own/contribute something based on their AWS scaler [4]. [1] https://www.appveyor.com/docs/byoc/gce/ [2] https://www.appveyor.com/pricing/ [3] https://buildkite.com/docs/agent/v3/gcloud#running-the-agent-on-google-kubernetes-engine [4] https://github.com/buildkite/buildkite-agent-scaler On Wed, Mar 11, 2020 at 7:49 PM Micah Kornfield wrote: > > > > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can > > donate cloud compute credits to the project > > Google has offered a donation of GCP credits based on some estimates I made > last year when we were facing Travis CI issues. I'm happy to try to do some > integration work to help make this happen. > > For the other questions, I'm happy to do some research, but also happy if > someone else would like to take up the work here. I think one blocker in > the past has been restrictions from Apache Infra, is there any > documentation on what is and is not supported on that front? > > Thanks, > Micah > On Wed, Mar 11, 2020 at 3:17 PM Wes McKinney wrote: > > > hi folks, > > > > There has periodically been a discussion about employing dedicated > > compute resources to serve our testing needs beyond what can be > > accomplished in free / public CI services like GitHub Actions, > > Appveyor, etc. For example: > > > > * Workloads requiring a CUDA-capable GPU > > * Tests requiring a lot of memory > > * ARM architecture > > > > While physical machines can be hooked up to some CI/CD services like > > Github Actions and Buildkite, I believe we should not be 100% > > dependent on the availability of such hardware (the recent tornado in > > Nashville is a good example of what can go wrong). > > > > At some point it will make sense to be able to provision cloud hosts > > (either temporary spot instances or persistent nodes) to meet these > > needs. This brings up several questions: > > > > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can > > donate cloud compute credits to the project > > * What kind of devops tooling would be appropriate to provision and > > manage the instances, scaling up and down based on need? > > * What CI/CD platform would be appropriate to dispatch work to the > > cloud nodes (taking into consideration the high costs of sysadmin, and > > seeking to minimize nodes sitting unused)? > > > > This will probably take time to work out and there is significant > > engineering involved in achieving any solution, but it would be good > > to have all the options on the table with a frank analysis of the > > pros/cons and costs (both in money and volunteer time) involved. > > > > Thanks, > > Wes > > >
Re: [DISCUSS][Java] Support non-nullable vectors
> And there is a "nullable" metadata-only flag at the > Field level. Could the same kinds of optimizations be implemented in > Java without introducing a "nullable" concept? Note Liya Fan did suggest pulling the nullable flag from the Field when the vector is created in item (1) of the proposed changes. Brian On Wed, Mar 11, 2020 at 5:54 AM Fan Liya wrote: > Hi Micah, > > Thanks a lot for your valuable comments. Please see my comments inline. > > > I'm a little concerned that this will change assumptions for at least > some > > of the clients using the library (some might always rely on the validity > > buffer being present). > > I can understand your concern and I am also concerned. > IMO, the client should not depend on this assumption, as the specification > says "Arrays having a 0 null count may choose to not allocate the validity > bitmap." [1] > That being said, I think it would be safe to provide a global flag to > switch on/off the feature (as you suggested). > > > I think this is a good feature to have for the reasons you mentioned. It > > seems like there would need to be some sort of configuration bit to set > for > > this behavior. > > Good suggestion. We should be able to switch on and off the feature with a > single global flag. > > > But, I'd be worried about code complexity this would > > introduce. > > I agree with you that code complexity is an important factor to consider. > IMO, our proposal should not involve too much code change, or increase code > complexity too much. > To prove this, maybe we need to show some small experimental code change. > > Best, > Liya Fan > > [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types > > On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield > wrote: > > > Hi Liya Fan, > > I'm a little concerned that this will change assumptions for at least > some > > of the clients using the library (some might always rely on the validity > > buffer being present). > > > > I think this is a good feature to have for the reasons you mentioned. It > > seems like there would need to be some sort of configuration bit to set > for > > this behavior. But, I'd be worried about code complexity this would > > introduce. > > > > Thanks, > > Micah > > > > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya wrote: > > > > > Hi Wes, > > > > > > Thanks a lot for your quick reply. > > > I think what you mentioned is almost exactly what we want to do in > > Java.The > > > concept is not important. > > > > > > Maybe there are only some minor differences: > > > 1. In C++, the null_count is mutable, while for Java, once a vector is > > > constructed as non-nullable, its null count can only be 0. > > > 2. In C++, a non-nullable array's validity buffer is null, while in > Java, > > > the buffer is an empty buffer, and cannot be changed. > > > > > > Best, > > > Liya Fan > > > > > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney > > wrote: > > > > > > > hi Liya, > > > > > > > > In C++ we elect certain faster code paths when the null count is 0 or > > > > computed to be zero. When the null count is 0, we do not allocate a > > > > validity bitmap. And there is a "nullable" metadata-only flag at the > > > > Field level. Could the same kinds of optimizations be implemented in > > > > Java without introducing a "nullable" concept? > > > > > > > > - Wes > > > > > > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya > wrote: > > > > > > > > > > Dear all, > > > > > > > > > > A non-nullable vector is one that is guaranteed to contain no > nulls. > > We > > > > > want to support non-nullable vectors in Java. > > > > > > > > > > *Motivations:* > > > > > 1. It is widely used in practice. For example, in a database > engine, > > a > > > > > column can be declared as not null, so it cannot contain null > values. > > > > > 2.Non-nullable vectors has significant performance advantages > > compared > > > > with > > > > > their nullable conterparts, such as: > > > > > 1) the memory space of the validity buffer can be saved. > > > > > 2) manipulation of the validity buffer can be bypassed > > > > > 3) some if-else branches can be replaced by sequential > instructions > > > (by > > > > > the JIT compiler), leading to high throughput for the CPU pipeline. > > > > > > > > > > *Potential Cost:* > > > > > For nullable vectors, there can be extra checks against the > > > nullablility > > > > > flag. So we must change the code in a way that minimizes the cost. > > > > > > > > > > *Proposed Changes:* > > > > > 1. There is no need to create new vector classes. We add a final > > > boolean > > > > to > > > > > the vector base classes as the nullability flag. The value of the > > flag > > > > can > > > > > be obtained from the field when creating the vector. > > > > > 2. Add a method "boolean isNullable()" to the root interface > > > ValueVector. > > > > > 3. If a vector is non-nullable, its validity buffer should be an > > empty > > > > > buffer (not null, so much of the existing logic can be left > > unchange
Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)
> It seems we should potentially disallow dictionaries to contain null values? +1 - I've always thought it was odd you could encode null values in two different places for dictionary encoded columns. You could argue it's more efficient to encode the nulls in the dictionary, but I think if we're going to allow that we should go further - we know there should only be _one_ index with the NULL value in a dictionary, why encode an entire validity buffer? Maybe this is one place where a sentinel value makes sense. The mailing list thread where I brought up the idea of nested dictionaries [1] is useful context for item 2. I still think this is a good idea, but I've changed jobs since then and the use-case I described is no longer motivating me to actually implement it. > It seems simpler to keep dictionary encoding at the leafs of the schema. Do we need to go that far? I think we could still allow dictionary encoding at any level of a hierarchy, and just disallow nested dictionaries. Brian [1] https://lists.apache.org/thread.html/37c0480c4c7a48dd298e8459938444afb901bf01dcebd5f8c5f1dee6%40%3Cdev.arrow.apache.org%3E On Sat, Feb 8, 2020 at 10:53 PM Micah Kornfield wrote: > I'd like to understand if any one is making use of the following features > and if we should revisit them before 1.0. > > 1. Dictionaries can encode null values. > - This become error prone for things like parquet. We seem to be > calculating the definition level solely based on the null bitmap. > > I might have missed something but it appears that we only check if a > dictionary contains nulls on the optimized path [1] but not when converting > the dictionary array back to dense, so I think the values written could get > out of sync with the rep/def levels? > > It seems we should potentially disallow dictionaries to contain null > values? > > 2. Dictionaries can nested columns which are in turn dictionary encoded > columns. > > - Again we aren't handling this in Parquet today, and I'm wondering if it > worth the effort. > There was a PR merged a while ago [2] to add a "skipped" integration test > but it doesn't look like anyone has done follow-up work to make enable > this/make it pass. > > It seems simpler to keep dictionary encoding at the leafs of the schema. > > Of the two I'm a little more worried that Option #1 will break people if we > decide to disallow it. > > Thoughts? > > Thanks, > Micah > > > [1] > > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765 > [2] https://github.com/apache/arrow/pull/1848 >
Re: [Java] PR Reviewers
I'm still pretty new to the Java implementation, but I can probably help out with some reviews. On Thu, Jan 23, 2020 at 8:41 PM Micah Kornfield wrote: > I mentioned this elsewhere but my intent is to stop doing java reviews for > the immediate future once I wrap up the few that I have requested change > on. > > I'm happy to try to triage incoming Java PRs, but in order to do this, I > need to know which committers have some bandwidth to do reviews (some of > the existing PRs I've tagged people who never responded). > > Thanks, > Micah >
Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty
What about returning null for a null list? It looks like now the function returns a primitive boolean, so I guess that would be a substantial change, but null seems more correct to me. On Thu, Jan 23, 2020, 21:38 Micah Kornfield wrote: > I would vote for treating nulls as empty. > > On Fri, Jan 10, 2020 at 12:36 AM Ji Liu > wrote: > > > Hi all, > > > > Currently isEmpty API is always return false in BaseRepeatedValueVector, > > and its subclass ListVector did not overwrite this method. > > This will lead to incorrect result, for example, a ListVector with data > > [1,2], null, [], [5,6] would get [false, false, false, false] which is > not > > right. > > I opened a PR to fix this[1] and not sure what’s the right behavior for > > null value, should it return [false, false, true, false] or [false, true, > > true, false] ? > > > > > > Thanks, > > Ji Liu > > > > > > [1] https://github.com/apache/arrow/pull/6044 > > > > >
[jira] [Created] (ARROW-7674) Add helpful message for captcha challenge in merge_arrow_pr.py
Brian Hulette created ARROW-7674: Summary: Add helpful message for captcha challenge in merge_arrow_pr.py Key: ARROW-7674 URL: https://issues.apache.org/jira/browse/ARROW-7674 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Brian Hulette Assignee: Brian Hulette After an incorrect password jira starts requiring a captcha challenge. When this happens with merge_arrow_pr.py its difficult to distinguish from any other failed login attempt. We should print a helpful message when this happens. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: pyarrow and macOS 10.15
Thanks Wes. I'm not sure about static linking but it seems likely, I'll start a discussion on https://issues.apache.org/jira/browse/BEAM-8368. On Fri, Oct 11, 2019 at 10:17 AM Wes McKinney wrote: > Does Apache Beam statically-link Protocol Buffers? > > I opened https://issues.apache.org/jira/browse/ARROW-6860 > > It would be great if the Beam community could work with us to resolve > issues around shipping C++ Protocol Buffers. We don't want you to be > stuck on pyarrow 0.13.0 and have your users be subjected to bugs and > other issues. > > On Thu, Oct 10, 2019 at 3:11 PM Brian Hulette wrote: > > > > In Beam we've had a few users report issues importing Beam Python after > > upgrading to macOS 10.15 Catalina, and it seems like our pyarrow import > is > > the root cause [1]. Given that I don't see any reports of this on the > arrow > > side I suspect that this is an issue just with pyarrow 0.14 (in Beam > we've > > restricted to <0.15 [2]), can anyone confirm that the pypi release of > > pyarrow 0.15 is working on macOS 10.15? > > > > Thanks, > > Brian > > > > [1] https://issues.apache.org/jira/browse/BEAM-8368 > > [2] https://github.com/apache/beam/blob/master/sdks/python/setup.py#L122 >
pyarrow and macOS 10.15
In Beam we've had a few users report issues importing Beam Python after upgrading to macOS 10.15 Catalina, and it seems like our pyarrow import is the root cause [1]. Given that I don't see any reports of this on the arrow side I suspect that this is an issue just with pyarrow 0.14 (in Beam we've restricted to <0.15 [2]), can anyone confirm that the pypi release of pyarrow 0.15 is working on macOS 10.15? Thanks, Brian [1] https://issues.apache.org/jira/browse/BEAM-8368 [2] https://github.com/apache/beam/blob/master/sdks/python/setup.py#L122
Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield
Congratulations Micah! Well deserved :) On Fri, Aug 9, 2019 at 9:02 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > Congrats! > > well deserved. > > On Fri, Aug 9, 2019 at 11:12 AM Wes McKinney wrote: > > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Micah Kornfield to become a PMC member and we are pleased to announce > > that Micah has accepted. > > > > Congratulations and welcome! >
Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type
I'm a little confused about the proposal now. If the unknown dimension doesn't have to be the same within a record batch, how would you be able to deduce it with the approach you described (dividing the logical length of the values array by the length of the record batch)? On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney wrote: > I agree this sounds like a good application for ExtensionType. At > minimum, ExtensionType can be used to develop a working version of > what you need to help guide further discussions. > > On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques > wrote: > > > > Hello, > > > > if each record has a different size, then I suggest to just use a > > Struct> where Dim is a struct (or expand in the outer > > struct). You can probably add your own logic with the recently > > introduced ExtensionType [1]. > > > > François > > [1] > https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h > > > > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper > wrote: > > > > > > The intention is that each individual record could have a different > size. > > > This could be consistent within a given batch, but wouldn't need to be. > > > For example, if I wanted to send a 3-channel image, but the image size > may > > > vary for each record, then I could use > > > FixedSizeList[3]>[-1]>[-1]. > > > > > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette > wrote: > > > > > > > This isn't really relevant but I feel compelled to point it out - the > > > > FixedSizeList type has actually been in the Arrow spec for a while, > but it > > > > was only implemented in JS and Java initially. It was implemented in > C++ > > > > just a few months ago. > > > > > > > > > > Thanks for the clarification -- I was going based on the blame history > for > > > Layout.rst, but I guess it just didn't get officially documented there > > > until the c++ implementation was added. > > > > > > -Edward > > > > > > > > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper > > > > > wrote: > > > > > > > > > The FixedSizeList type, which was added to Arrow a few months ago, > is an > > > > > array where each slot contains a fixed-size sequence of values. > It is > > > > > specified as FixedSizeList[N], where T is a child type and N is > a > > > > signed > > > > > int32 that specifies the length of each list. > > > > > > > > > > This is useful for encoding fixed-size tensors. E.g., if I have a > > > > 100x8x10 > > > > > tensor, then I can encode it as > > > > > FixedSizeList[10]>[8]>[100]. > > > > > > > > > > But I'm also interested in encoding tensors where some dimension > sizes > > > > are > > > > > not known in advance. It seems to me that FixedSizeList could be > > > > extended > > > > > to support this fairly easily, by simply defining that N=-1 means > "each > > > > > array slot has the same length, but that length is not known in > advance." > > > > > So e.g. we could encode a 100x?x10 tensor as > > > > > FixedSizeList[10]>[-1]>[100]. > > > > > > > > > > Since these N=-1 row-lengths are not encoded in the type, we need > some > > > > way > > > > > to determine what they are. Luckily, every Field in the schema > has a > > > > > corresponding FieldNode in the message; and those FieldNodes can > be used > > > > to > > > > > deduce the row lengths. In particular, the row length must be > equal to > > > > the > > > > > length of the child node divided by the length of the > FixedSizeList. > > > > E.g., > > > > > if we have a FixedSizeList[-1] array with the values [[1, > 2], [3, > > > > 4], > > > > > [5, 6]] then the message representation is: > > > > > > > > > > * Length: 3, Null count: 0 > > > > > * Null bitmap buffer: Not required > > > > > * Values array (byte array): > > > > > * Length: 6, Null count: 0 > > > > > * Null bitmap buffer: Not required > > > > > * Value buffer: [1, 2, 3, 4, 5, 6, ] > > > > > > > > > > So we can deduce that the row length is 6/3=2. > > > > > > > > > > It looks to me like it would be fairly easy to add support for > this. > > > > E.g., > > > > > in the FixedSizeListArray constructor in c++, if > list_type()->list_size() > > > > > is -1, then set list_size_ to values.length()/length. There would > be no > > > > > changes to the schema.fbs/message.fbs files -- we would just be > > > > assigning a > > > > > meaning to something that's currently meaningless (having > > > > > FixedSizeList.listSize=-1). > > > > > > > > > > If there's support for adding this to Arrow, then I could put > together a > > > > > PR. > > > > > > > > > > Thanks, > > > > > -Edward > > > > > > > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple > days > > > > ago > > > > > right before subscribing to the mailing list; but I don't see it > on the > > > > > archives, presumably because I wasn't subscribed yet when I sent > it out. > > > > > > > > > >
Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type
I think it may be helpful to clarify what you mean by dimensions that are not known in advance. I believe the intention here is that this unknown dimension is consistent within a record batch, but it is allowed to vary from batch to batch. Otherwise, I would say you could just delay creating the schema until you do know the unknown dimension. This isn't really relevant but I feel compelled to point it out - the FixedSizeList type has actually been in the Arrow spec for a while, but it was only implemented in JS and Java initially. It was implemented in C++ just a few months ago. On Mon, Jul 29, 2019 at 7:01 AM Edward Loper wrote: > The FixedSizeList type, which was added to Arrow a few months ago, is an > array where each slot contains a fixed-size sequence of values. It is > specified as FixedSizeList[N], where T is a child type and N is a signed > int32 that specifies the length of each list. > > This is useful for encoding fixed-size tensors. E.g., if I have a 100x8x10 > tensor, then I can encode it as > FixedSizeList[10]>[8]>[100]. > > But I'm also interested in encoding tensors where some dimension sizes are > not known in advance. It seems to me that FixedSizeList could be extended > to support this fairly easily, by simply defining that N=-1 means "each > array slot has the same length, but that length is not known in advance." > So e.g. we could encode a 100x?x10 tensor as > FixedSizeList[10]>[-1]>[100]. > > Since these N=-1 row-lengths are not encoded in the type, we need some way > to determine what they are. Luckily, every Field in the schema has a > corresponding FieldNode in the message; and those FieldNodes can be used to > deduce the row lengths. In particular, the row length must be equal to the > length of the child node divided by the length of the FixedSizeList. E.g., > if we have a FixedSizeList[-1] array with the values [[1, 2], [3, 4], > [5, 6]] then the message representation is: > > * Length: 3, Null count: 0 > * Null bitmap buffer: Not required > * Values array (byte array): > * Length: 6, Null count: 0 > * Null bitmap buffer: Not required > * Value buffer: [1, 2, 3, 4, 5, 6, ] > > So we can deduce that the row length is 6/3=2. > > It looks to me like it would be fairly easy to add support for this. E.g., > in the FixedSizeListArray constructor in c++, if list_type()->list_size() > is -1, then set list_size_ to values.length()/length. There would be no > changes to the schema.fbs/message.fbs files -- we would just be assigning a > meaning to something that's currently meaningless (having > FixedSizeList.listSize=-1). > > If there's support for adding this to Arrow, then I could put together a > PR. > > Thanks, > -Edward > > P.S. Apologies if this gets posted twice -- I sent it out a couple days ago > right before subscribing to the mailing list; but I don't see it on the > archives, presumably because I wasn't subscribed yet when I sent it out. >
Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)
To me, the most important aspect of this proposal is the addition of sparse encodings, and I'm curious if there are any more objections to that specifically. So far I believe the only one is that it will make computation libraries more complicated. This is absolutely true, but I think it's worth that cost. It's been suggested on this list and elsewhere [1] that sparse encodings that can be operated on without fully decompressing should be added to the Arrow format. The longer we continue to develop computation libraries without considering those schemes, the harder it will be to add them. [1] https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney wrote: > On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou > wrote: > > > > On Fri, 12 Jul 2019 20:37:15 -0700 > > Micah Kornfield wrote: > > > > > > If the latter, I wonder why Parquet cannot simply be used instead of > > > > reinventing something similar but different. > > > > > > This is a reasonable point. However there is continuum here between > file > > > size and read and write times. Parquet will likely always be the > smallest > > > with the largest times to convert to and from Arrow. An uncompressed > > > Feather/Arrow file will likely always take the most space but will much > > > faster conversion times. > > > > I'm curious whether the Parquet conversion times are inherent to the > > Parquet format or due to inefficiencies in the implementation. > > > > Parquet is fundamentally more complex to decode. Consider several > layers of logic that must happen for values to end up in the right > place > > * Data pages are usually compressed, and a column consists of many > data pages each having a Thrift header that must be deserialized > * Values are usually dictionary-encoded, dictionary indices are > encoded using hybrid bit-packed / RLE scheme > * Null/not-null is encoded in definition levels > * Only non-null values are stored, so when decoding to Arrow, values > have to be "moved into place" > > The current C++ implementation could certainly be made faster. One > consideration with Parquet is that the files are much smaller, so when > you are reading them over the network the effective end-to-end time > including IO and deserialization will frequently win. > > > Regards > > > > Antoine. > > > > >
[jira] [Created] (ARROW-5741) [JS] Make numeric vector from functions consistent with TypedArray.from
Brian Hulette created ARROW-5741: Summary: [JS] Make numeric vector from functions consistent with TypedArray.from Key: ARROW-5741 URL: https://issues.apache.org/jira/browse/ARROW-5741 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Described in https://lists.apache.org/thread.html/b648a781cba7f10d5a6072ff2e7dab6c03e2d1f12e359d9261891486@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5740) [JS] Add ability to run tests in headless browsers
Brian Hulette created ARROW-5740: Summary: [JS] Add ability to run tests in headless browsers Key: ARROW-5740 URL: https://issues.apache.org/jira/browse/ARROW-5740 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Now that we have a compatibility check that modifies behavior based on the features in a supported browser, we should really be running our tests in various browsers to exercise the various cases. For example right now we don't actually run tests on the non-BigNum code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5714) [JS] Inconsistent behavior in Int64Builder with/without BigNum
Brian Hulette created ARROW-5714: Summary: [JS] Inconsistent behavior in Int64Builder with/without BigNum Key: ARROW-5714 URL: https://issues.apache.org/jira/browse/ARROW-5714 Project: Apache Arrow Issue Type: Bug Reporter: Brian Hulette Assignee: Brian Hulette Fix For: 0.14.0 When the Int64Builder is used in a context without BigNum, appending two numbers combines them into a single Int64: {{ > v = Arrow.Builder.new({type: new > Arrow.Int64()}).append(1).append(2).finish().toVector() > v.get(0) Int32Array [ 1, 2 ] }} Whereas the same process with BigNum creates two new Int64s. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5689) [JS] Remove hard-coded Field.nullable
Brian Hulette created ARROW-5689: Summary: [JS] Remove hard-coded Field.nullable Key: ARROW-5689 URL: https://issues.apache.org/jira/browse/ARROW-5689 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Context: https://github.com/apache/arrow/pull/4502#discussion_r296390833 This isn't a huge issue since we can just elide validity buffers when null count is zero, but sometimes it's desirable to be able to assert a Field is _never_ null. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5688) [JS] Add test for EOS in File Format
Brian Hulette created ARROW-5688: Summary: [JS] Add test for EOS in File Format Key: ARROW-5688 URL: https://issues.apache.org/jira/browse/ARROW-5688 Project: Apache Arrow Issue Type: Task Reporter: Brian Hulette Either in a unit test, or in the integration tests -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5491) Remove unecessary semicolons following MACRO definitions
Brian Hulette created ARROW-5491: Summary: Remove unecessary semicolons following MACRO definitions Key: ARROW-5491 URL: https://issues.apache.org/jira/browse/ARROW-5491 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 0.13.0 Reporter: Brian Hulette Assignee: Brian Hulette Fix For: 0.14.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[JS] Proposal for numeric vector `from` functions
I think the current behavior of `from` functions on IntVector and FloatVector can be quite confusing for new arrow users. The current behavior can be summarized as: - if the argument is any type of TypedArray (including one of a mismatched type), create a new vector backed by that array's buffer. - otherwise, treat it as an iterable of numbers, and convert them as needed - ... unless we're making an Int64Vector, then treat each input as a 32-bit number and pack pairs together This can give users very unexpected results. For example, you might expect arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0])) to yield a vector with the values [1,2,3] - but it doesn't, it gives you the integers that result from re-interpreting that buffer of floating point numbers as integers. I put together a notebook with some more examples of this confusing behavior, compared to TypedArray.from: https://observablehq.com/d/6aa80e43b5a97361 I'd like to propose that we re-write these from functions with the following behavior: - iff the argument is an ArrayBuffer or a TypedArray of the same numeric type, create a new vector backed by that array's buffer. - otherwise, treat is as an iterable of numbers and convert to the appropriate type. - no exceptions for Int64 If users really want to preserve the current behavior and use a TypedArray's memory directly without converting, even when the types are mismatched, they can still just access the underlying ArrayBuffer and pass that in. So arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0])) would yield a vector with [1,2,3], but you could still use arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0]).buffer) to replicate the current behavior. Removing the special case for Int64 does make it a little easier to shoot yourself in the foot by exceeding JS numbers' 53-bit precision, so maybe we should mitigate that somehow, but I don't think combining pairs of numbers is the right way to do that. Maybe a warning? What do you all think? If there's consensus on this I'd like to make the change prior to 0.14 to minimize the number of releases with the current behavior. Brian
Confluence edit access
Can I get edit access on confluence? I wanted to answer some of the questions about JS here: https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone My username is bhulette Thanks! Brian
[jira] [Created] (ARROW-5313) [Format] Comments on Field table are a bit confusing
Brian Hulette created ARROW-5313: Summary: [Format] Comments on Field table are a bit confusing Key: ARROW-5313 URL: https://issues.apache.org/jira/browse/ARROW-5313 Project: Apache Arrow Issue Type: Task Components: Format Affects Versions: 0.13.0 Reporter: Brian Hulette Assignee: Brian Hulette Currently Schema.fbs has two different explanations of {{Field.children}} One says "children is only for nested Arrow arrays" and the other says "children apply only to nested data types like Struct, List and Union". I think both are technically correct but the latter is much more explicit, we should remove the former. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC1
+1 (non-binding) Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1` with Node v11.12.0 On Thu, Mar 21, 2019 at 1:54 PM Krisztián Szűcs wrote: > +1 (binding) > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1` > with Node v11.12.0 on OSX 10.14.3 and it looks good. > > On Thu, Mar 21, 2019 at 8:45 PM Krisztián Szűcs > > wrote: > > > Hello all, > > > > I would like to propose the following release candidate (rc1) of Apache > > Arrow JavaScript version 0.4.1. This is the second release candidate, > > including the fix for node version requirement [3]. > > > > The source release rc1 is hosted at [1]. > > > > This release candidate is based on commit > > e9cf83c48b9740d42b5d18158e61c0962fda59c1 > > > > Please download, verify checksums and signatures, run the unit tests, and > > vote > > on the release. The easiest way is to use the JavaScript-specific release > > verification script dev/release/js-verify-release-candidate.sh. > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1 > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because... > > > > > > How to validate a release signature: > > https://httpd.apache.org/dev/verification.html > > > > [1]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc1/ > > [2]: > > > https://github.com/apache/arrow/tree/e9cf83c48b9740d42b5d18158e61c0962fda59c1 > > [3]: https://github.com/apache/arrow/pull/4006/ > > >
[jira] [Created] (ARROW-4991) [CI] Bump travis node version to 11.12
Brian Hulette created ARROW-4991: Summary: [CI] Bump travis node version to 11.12 Key: ARROW-4991 URL: https://issues.apache.org/jira/browse/ARROW-4991 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Brian Hulette Assignee: Brian Hulette Fix For: JS-0.4.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0
I just merged https://github.com/apache/arrow/pull/4006 that bumps the node requirement to 11.12 to avoid this issue. Krisztian, can you cut an RC1 with that change included? Brian On Thu, Mar 21, 2019 at 10:06 AM Brian Hulette wrote: > It looks like this was an issue with node v11.11 that was resolved in > v11.12 [1,2]. Can you try upgrading and running again? > > [1] > https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear > [2] https://github.com/nodejs/node/pull/26488 > > On Thu, Mar 21, 2019 at 8:00 AM Uwe L. Korn wrote: > >> This saldy fails locally for me on OSX High Sierra: >> >> ``` >> + npm run test >> >> > apache-arrow@0.4.1 test >> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1 >> > NODE_NO_WARNINGS=1 gulp test >> >> [15:23:02] Using gulpfile >> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1/gulpfile.js >> [15:23:02] Starting 'test'... >> [15:23:02] Starting 'test:ts'... >> [15:23:02] Starting 'test:src'... >> [15:23:02] Starting 'test:apache-arrow'... >> >> ● Test suite failed to run >> >> TypeError: Cannot assign to read only property >> 'Symbol(Symbol.toStringTag)' of object '#' >> >> at exports.default >> (node_modules/jest-util/build/create_process_object.js:15:34) >> ``` >> >> This is the same error as in the nightlies but the fix there doesn't help >> for me locally. >> >> Uwe >> >> On Thu, Mar 21, 2019, at 2:41 AM, Brian Hulette wrote: >> > +1 (non-binding) >> > >> > Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0 >> > >> > Thanks Krisztian! >> > Brian >> > >> > On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor wrote: >> > >> > > +1 non-binding >> > > >> > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High >> > > Sierra w/ node v11.6.0 >> > > >> > > >> > > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou >> wrote: >> > > >> > > > +1 (binding) >> > > > >> > > > I ran the followings on Debian GNU/Linux sid: >> > > > >> > > > * dev/release/js-verify-release-candidate.sh 0.4.1 0 >> > > > >> > > > with: >> > > > >> > > > * Node.js v11.12.0 >> > > > >> > > > Thanks, >> > > > -- >> > > > kou >> > > > >> > > > In > z...@mail.gmail.com> >> > > > "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019 >> > > > 00:09:54 +0100, >> > > > Krisztián Szűcs wrote: >> > > > >> > > > > Hello all, >> > > > > >> > > > > I would like to propose the following release candidate (rc0) of >> Apache >> > > > > Arrow JavaScript version 0.4.1. >> > > > > >> > > > > The source release rc0 is hosted at [1]. >> > > > > >> > > > > This release candidate is based on commit >> > > > > f55542eeb59dde8ff4512c707b9eca1b43b62073 >> > > > > >> > > > > Please download, verify checksums and signatures, run the unit >> tests, >> > > and >> > > > > vote >> > > > > on the release. The easiest way is to use the JavaScript-specific >> > > release >> > > > > verification script dev/release/js-verify-release-candidate.sh. >> > > > > >> > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1 >> > > > > [ ] +0 >> > > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 >> because... >> > > > > >> > > > > >> > > > > How to validate a release signature: >> > > > > https://httpd.apache.org/dev/verification.html >> > > > > >> > > > > [1]: >> > > > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/ >> > > > > [2]: >> > > > > >> > > > >> > > >> https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073 >> > > > >> > > >> > >> >
[jira] [Created] (ARROW-4988) Bump required node version to 11.12
Brian Hulette created ARROW-4988: Summary: Bump required node version to 11.12 Key: ARROW-4988 URL: https://issues.apache.org/jira/browse/ARROW-4988 Project: Apache Arrow Issue Type: Bug Reporter: Brian Hulette Assignee: Brian Hulette The cause of ARROW-4948 and http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C5ce620e0-0063-4bee-8ad6-a41301ac08c4%40www.fastmail.com%3E was actually a regression in node v11.11, resolved in v11.12 see https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear and https://github.com/nodejs/node/pull/26488 Bump requirement up to 11.12 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0
It looks like this was an issue with node v11.11 that was resolved in v11.12 [1,2]. Can you try upgrading and running again? [1] https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear [2] https://github.com/nodejs/node/pull/26488 On Thu, Mar 21, 2019 at 8:00 AM Uwe L. Korn wrote: > This saldy fails locally for me on OSX High Sierra: > > ``` > + npm run test > > > apache-arrow@0.4.1 test > /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1 > > NODE_NO_WARNINGS=1 gulp test > > [15:23:02] Using gulpfile > /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1/gulpfile.js > [15:23:02] Starting 'test'... > [15:23:02] Starting 'test:ts'... > [15:23:02] Starting 'test:src'... > [15:23:02] Starting 'test:apache-arrow'... > > ● Test suite failed to run > > TypeError: Cannot assign to read only property > 'Symbol(Symbol.toStringTag)' of object '#' > > at exports.default > (node_modules/jest-util/build/create_process_object.js:15:34) > ``` > > This is the same error as in the nightlies but the fix there doesn't help > for me locally. > > Uwe > > On Thu, Mar 21, 2019, at 2:41 AM, Brian Hulette wrote: > > +1 (non-binding) > > > > Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0 > > > > Thanks Krisztian! > > Brian > > > > On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor wrote: > > > > > +1 non-binding > > > > > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High > > > Sierra w/ node v11.6.0 > > > > > > > > > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou > wrote: > > > > > > > +1 (binding) > > > > > > > > I ran the followings on Debian GNU/Linux sid: > > > > > > > > * dev/release/js-verify-release-candidate.sh 0.4.1 0 > > > > > > > > with: > > > > > > > > * Node.js v11.12.0 > > > > > > > > Thanks, > > > > -- > > > > kou > > > > > > > > In z...@mail.gmail.com> > > > > "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019 > > > > 00:09:54 +0100, > > > > Krisztián Szűcs wrote: > > > > > > > > > Hello all, > > > > > > > > > > I would like to propose the following release candidate (rc0) of > Apache > > > > > Arrow JavaScript version 0.4.1. > > > > > > > > > > The source release rc0 is hosted at [1]. > > > > > > > > > > This release candidate is based on commit > > > > > f55542eeb59dde8ff4512c707b9eca1b43b62073 > > > > > > > > > > Please download, verify checksums and signatures, run the unit > tests, > > > and > > > > > vote > > > > > on the release. The easiest way is to use the JavaScript-specific > > > release > > > > > verification script dev/release/js-verify-release-candidate.sh. > > > > > > > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1 > > > > > [ ] +0 > > > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 > because... > > > > > > > > > > > > > > > How to validate a release signature: > > > > > https://httpd.apache.org/dev/verification.html > > > > > > > > > > [1]: > > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/ > > > > > [2]: > > > > > > > > > > > > > https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073 > > > > > > > > > >
Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0
+1 (non-binding) Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0 Thanks Krisztian! Brian On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor wrote: > +1 non-binding > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High > Sierra w/ node v11.6.0 > > > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou wrote: > > > +1 (binding) > > > > I ran the followings on Debian GNU/Linux sid: > > > > * dev/release/js-verify-release-candidate.sh 0.4.1 0 > > > > with: > > > > * Node.js v11.12.0 > > > > Thanks, > > -- > > kou > > > > In > > "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019 > > 00:09:54 +0100, > > Krisztián Szűcs wrote: > > > > > Hello all, > > > > > > I would like to propose the following release candidate (rc0) of Apache > > > Arrow JavaScript version 0.4.1. > > > > > > The source release rc0 is hosted at [1]. > > > > > > This release candidate is based on commit > > > f55542eeb59dde8ff4512c707b9eca1b43b62073 > > > > > > Please download, verify checksums and signatures, run the unit tests, > and > > > vote > > > on the release. The easiest way is to use the JavaScript-specific > release > > > verification script dev/release/js-verify-release-candidate.sh. > > > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1 > > > [ ] +0 > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because... > > > > > > > > > How to validate a release signature: > > > https://httpd.apache.org/dev/verification.html > > > > > > [1]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/ > > > [2]: > > > > > > https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073 > > >
Re: [DISCUSS] Cutting a JavaScript 0.4.1 bugfix release
Thanks Wes. Krisztian - Uwe cut 0.4.0 for us and said he was pretty comfortable with the process, so you may be able to defer to him if you don't have time. On Wed, Mar 20, 2019 at 3:26 PM Wes McKinney wrote: > It seems based on [1] that we are overdue in cutting a bugfix JS > release because of a problem with the 0.4.0 release on NPM > > If there are no objections to this I suggest we call a vote right away > and close the vote as soon as we have requisite PMC votes. Krisztian, > would you be able to help with this since you are set up as an RM from > the 0.12 release? I am traveling until next Tuesday and do not have my > code signing key on the laptop I have with me otherwise I would do it. > > The release can be cut based off of current master version of js/ > > Thanks, > Wes > > [1]: https://github.com/apache/arrow/pull/3630 >
Re: Timeline for 0.13 Arrow release
I think that makes sense. I would really like to make JS part of the mainstream releases, but we already have JS-0.4.1 ready to go [1] with primarily bugfixes for JS-0.4.0. I think we should just cut that and integrate JS in 0.14. [1] https://issues.apache.org/jira/projects/ARROW/versions/12344961 On Wed, Mar 20, 2019 at 8:20 AM Wes McKinney wrote: > In light of the discussion on > https://github.com/apache/arrow/pull/3630 I think we should wait until > we have a "not broken" JavaScript-only release on NPM and have > confidence that we can respond to the community's needs > > On Tue, Mar 19, 2019 at 11:24 PM Paul Taylor wrote: > > > > I agree, the JS has matured a lot in the last few months. I think it's > > ready to join the regular Arrow releases. Let me know if I can help > > integrate the publish scripts :-) > > > > The two main things in progress are docs + Vector Builders, neither of > > which should block this release. > > > > We're going to try to get the docs/recipes ready for a PR this weekend. > > If that lands shortly after 0.13.0 goes out, would it be possible to > > update the website independently, or would that need to wait until 0.14? > > > > Paul > > > > On 3/19/19 10:08 AM, Wes McKinney wrote: > > > I'm in favor of including JS in the 0.13.0 release. > > > > > > I'm going to try to fix a couple of the Python Parquet bugs until the > > > RC is ready to be cut, but none of them need block the release. > > > > > > Seems like we need someone else to volunteer to be the RM for 0.13 if > > > Uwe is unavailable next week. Antoine -- are you possibly up for it > > > (the initial setup will be a bit painful)? I don't have access to a > > > machine with my code signing key on it until next week so I cannot do > > > it > > > > > > - Wes > > > > > > On Tue, Mar 19, 2019 at 9:46 AM Kouhei Sutou > wrote: > > >> Hi, > > >> > > >> There are no blockers on GLib, Ruby and Linux packages. > > >> > > >> Can we include JavaScript into 0.13.0? > > >> If we include JavaScript into 0.13.0, we can remove > > >> codes to release JavaScript separately. For example, we can > > >> remove dev/release/js-*. We can enable version update code > > >> in dev/release/00-prepare.sh: > > >> > https://github.com/apache/arrow/blob/master/dev/release/00-prepare.sh#L67-L74 > > >> > > >> We can merge "JavaScript Releases" document into our release > > >> document: > > >> > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-JavaScriptReleases > > >> > > >> > > >> Thanks, > > >> -- > > >> kou > > >> > > >> In < > cajpuwmbgjzbwrwybwse6bd9lnn_7xozn_aq2job9_mpvmhc...@mail.gmail.com> > > >>"Re: Timeline for 0.13 Arrow release" on Mon, 18 Mar 2019 20:51:12 > -0500, > > >>Wes McKinney wrote: > > >> > > >>> hi folks, > > >>> > > >>> I think we're basically at the 0.13 end game here. There's some more > > >>> patches can get in, but do we all think we can cut an RC by the end > of > > >>> the week? What are the blocking issues? > > >>> > > >>> Thanks > > >>> Wes > > >>> > > >>> On Sat, Mar 16, 2019 at 9:57 PM Kouhei Sutou > wrote: > > Hi, > > > > > Submitted the packaging builds: > > > > https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-452 > > I've fixed .deb/.rpm packages: > https://github.com/apache/arrow/pull/3934 > > It has been merged. > > So .deb/.rpm packages are ready for release. > > > > Thanks, > > -- > > kou > > > > In < > cahm19a5somzxgcphc6ee-mr2usvvhwb252udgjrvocq-cb2...@mail.gmail.com> > > "Re: Timeline for 0.13 Arrow release" on Thu, 14 Mar 2019 > 16:24:43 +0100, > > Krisztián Szűcs wrote: > > > > > Submitted the packaging builds: > > > > https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-452 > > > > > > On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney > wrote: > > > > > >> The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard > labor on > > >> this. > > >> > > >> We should run all the packaging tasks and get a full accounting of > > >> what is broken so we aren't surprised during the release process > > >> > > >> On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs > > >> wrote: > > >>> The proof of the pudding is in the eating. You convinced me. > > >>> > > >>> On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney < > wesmck...@gmail.com> > > >> wrote: > > Krisztian -- are you all right with proceeding with merging the > CMake > > refactor? I'm pretty committed to helping fix the problems that > come > > up. Since most consumers of the project don't test until > _after_ a > > release, we won't find out about some problems until we merge > it and > > release it. Thus, IMHO it doesn't make sense to wait another > 8-10 > > weeks since we'd be delaying feedback for that long. There are > also a > > number of
Re: Flaky Travis CI builds on master
Another instance of #1 for the JS builds: https://travis-ci.org/apache/arrow/jobs/498967250#L992 I filed https://issues.apache.org/jira/browse/ARROW-4695 about it before seeing this thread. As noted there I was able to replicate the timeout on my laptop at least once. I didn't think to monitor memory usage to see if that was the cause. On Wed, Feb 27, 2019 at 6:52 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > I think we're witnessing multiple issues. > > 1. Travis seems to be slow (is it an OOM issue?) > - https://travis-ci.org/apache/arrow/jobs/499122041#L1019 > - https://travis-ci.org/apache/arrow/jobs/498906118#L3694 > - https://travis-ci.org/apache/arrow/jobs/499146261#L2316 > 2. https://issues.apache.org/jira/browse/ARROW-4694 detect-changes.py is > confused > 3. https://issues.apache.org/jira/browse/ARROW-4684 is failing one python > test consistently > > #2 doesn't help with #1, it could be related to PR based out of "old" > commits and the velocity of our project. I've suggested that we disable the > failing test in #3 until resolved since it affects all C++ PRs. > > On Tue, Feb 26, 2019 at 5:01 PM Wes McKinney wrote: > > > Here's a build that just ran > > > > > > > https://travis-ci.org/apache/arrow/builds/498906102?utm_source=github_status&utm_medium=notification > > > > 2 failed builds > > > > * ARROW-4684 > > * Seemingly a GLib Plasma OOM > > https://travis-ci.org/apache/arrow/jobs/498906118#L3689 > > > > 24 hours ago: > > > https://travis-ci.org/apache/arrow/builds/498501983?utm_source=github_status&utm_medium=notification > > > > * The same GLib Plasma OOM > > * Rust try_from bug that was just fixed > > > > It looks like that GLib test has been failing more than it's been > > succeeding (also failed in the last build on Feb 22). > > > > I think it might be worth setting up some more "annoying" > > notifications when failing builds persist for a long time. > > > > On Tue, Feb 26, 2019 at 3:37 PM Michael Sarahan > > wrote: > > > > > > Yes, please let us know. We definitely see 500's from anaconda.org, > > though > > > I'd expect less of them from CDN-enabled channels. > > > > > > On Tue, Feb 26, 2019 at 3:18 PM Uwe L. Korn wrote: > > > > > > > Hello Wes, > > > > > > > > if there are 500er errors it might be useful to report them somehow > to > > > > Anaconda. They recently migrated conda-forge to a CDN enabled account > > and > > > > this could be one of the results of that. Probably they need to still > > iron > > > > out some things. > > > > > > > > Uwe > > > > > > > > On Tue, Feb 26, 2019, at 8:40 PM, Wes McKinney wrote: > > > > > hi folks, > > > > > > > > > > We haven't had a green build on master for about 5 days now (the > last > > > > > one was February 21). Has anyone else been paying attention to > this? > > > > > It seems we should start cataloging which tests and build > > environments > > > > > are the most flaky and see if there's anything we can do to reduce > > the > > > > > flakiness. Since we are dependent on anaconda.org for build > > toolchain > > > > > packages, it's hard to control for the 500 timeouts that occur > there, > > > > > but I'm seeing other kinds of routine flakiness. > > > > > > > > > > - Wes > > > > > > > > > > > >
[jira] [Created] (ARROW-4695) [JS] Tests timing out on Travis
Brian Hulette created ARROW-4695: Summary: [JS] Tests timing out on Travis Key: ARROW-4695 URL: https://issues.apache.org/jira/browse/ARROW-4695 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Affects Versions: JS-0.4.0 Reporter: Brian Hulette Example build: https://travis-ci.org/apache/arrow/jobs/498967250 JS tests sometimes fail with the following message: {noformat} > apache-arrow@ test /home/travis/build/apache/arrow/js > NODE_NO_WARNINGS=1 gulp test [22:14:01] Using gulpfile ~/build/apache/arrow/js/gulpfile.js [22:14:01] Starting 'test'... [22:14:01] Starting 'test:ts'... [22:14:49] Finished 'test:ts' after 47 s [22:14:49] Starting 'test:src'... [22:15:27] Finished 'test:src' after 38 s [22:15:27] Starting 'test:apache-arrow'... No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself. Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received The build has been terminated {noformat} I thought maybe we were just running up against some time limit, but that particular build was terminated at 22:25:27, exactly ten minutes after the last output, at 22:15:27. So it does seem like the build is somehow stalling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4686) Only accept 'y' or 'n' in merge_arrow_pr.py prompts
Brian Hulette created ARROW-4686: Summary: Only accept 'y' or 'n' in merge_arrow_pr.py prompts Key: ARROW-4686 URL: https://issues.apache.org/jira/browse/ARROW-4686 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Brian Hulette Assignee: Brian Hulette The current prompt syntax ("y/n" with neither capitalized) implies there's no default, which I think is the right behavior, but it's not implemented that way. Script should retry until either y or n is received. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Arrow on WebAssembly
Hi Franco, I'm not aware of anyone trying this in Rust, but Tim Paine at JPMC recently contributed a patch [1] to make it possible to compile the C++ implementation with emscripten, so that he could use it in Perspective [2]. Could you use the C++ lib instead? It would be great if either implementation could target WebAssembly though - do any Rust contributors know more about the libc/wasm issue? Maybe the rustwasm community [3] could be of assistance? Brian [1] https://github.com/apache/arrow/pull/3350 [2] https://github.com/jpmorganchase/perspective [3] https://github.com/rustwasm/team On Tue, Feb 19, 2019 at 11:06 AM Franco Nicolas Bellomo wrote: > Hi! > > Actually, Apache Arrow have a really nice implementation on Rust. I > try to compile this to webAssembly but I have a problem with libc. I > understand that this is a general problem of libc and wasm. > In the road map of Arrow, you plan support wasm? > > Thanks!! >
[jira] [Created] (ARROW-4551) [JS] Investigate using Symbols to access Row columns by index
Brian Hulette created ARROW-4551: Summary: [JS] Investigate using Symbols to access Row columns by index Key: ARROW-4551 URL: https://issues.apache.org/jira/browse/ARROW-4551 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Can we use row[Symbol.for(0)] instead of row[0] in order to avoid collisions? What would the performance impact be? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4524) [JS] only invoke `Object.defineProperty` once per table
Brian Hulette created ARROW-4524: Summary: [JS] only invoke `Object.defineProperty` once per table Key: ARROW-4524 URL: https://issues.apache.org/jira/browse/ARROW-4524 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette Fix For: 0.4.1 See https://github.com/vega/vega-loader-arrow/commit/19c88e130aaeeae9d0166360db467121e5724352#r32253784 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4523) [JS] Add row proxy generation benchmark
Brian Hulette created ARROW-4523: Summary: [JS] Add row proxy generation benchmark Key: ARROW-4523 URL: https://issues.apache.org/jira/browse/ARROW-4523 Project: Apache Arrow Issue Type: Test Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4519) Publish JS API Docs for v0.4.0
Brian Hulette created ARROW-4519: Summary: Publish JS API Docs for v0.4.0 Key: ARROW-4519 URL: https://issues.apache.org/jira/browse/ARROW-4519 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow JS 0.4.0 - RC1
+1 verified on Archlinux with Node v11.9.0 Thanks a lot for putting the RC together Uwe! On Thu, Jan 31, 2019 at 8:08 AM Uwe L. Korn wrote: > +1 (binding), > > verified on Ubuntu 16.04 with > `./dev/release/js-verify-release-candidate.sh 0.4.0 1` and Node v11.9.0 via > nvm. > > Uwe > > On Thu, Jan 31, 2019, at 5:07 PM, Uwe L. Korn wrote: > > Hello all, > > > > I would like to propose the following release candidate (rc1) of Apache > > Arrow JavaScript version 0.4.0. > > > > The source release rc1 is hosted at [1]. > > > > This release candidate is based on commit > > 6009eaa49ae29826764eb6e626bf0d12b83f3481 > > > > Please download, verify checksums and signatures, run the unit tests, > and vote > > on the release. The easiest way is to use the JavaScript-specific release > > verification script dev/release/js-verify-release-candidate.sh. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.0 > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.0 because... > > > > > > How to validate a release signature: > > https://httpd.apache.org/dev/verification.html > > > > [1]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.0-rc1/ > > [2]: > > > https://github.com/apache/arrow/tree/6009eaa49ae29826764eb6e626bf0d12b83f3481 >
Re: Benchmarking dashboard proposal
We also have some JS benchmarks [1]. Currently they're only really run on an ad-hoc basis to manually test major changes but it would be great to include them in this. [1] https://github.com/apache/arrow/tree/master/js/perf On Fri, Jan 18, 2019 at 12:34 AM Uwe L. Korn wrote: > Hello, > > note that we have(had?) the Python benchmarks continuously running and > reported at https://pandas.pydata.org/speed/arrow/. Seems like this > stopped in July 2018. > > UWe > > On Fri, Jan 18, 2019, at 9:23 AM, Antoine Pitrou wrote: > > > > Hi Areg, > > > > That sounds like a good idea to me. Note our benchmarks are currently > > scattered accross the various implementations. The two that I know of: > > > > - the C++ benchmarks are standalone executables created using the Google > > Benchmark library, aptly named "*-benchmark" (or "*-benchmark.exe" on > > Windows) > > - the Python benchmarks use the ASV utility: > > > https://github.com/apache/arrow/blob/master/docs/source/python/benchmarks.rst > > > > There may be more in the other implementations. > > > > Regards > > > > Antoine. > > > > > > Le 18/01/2019 à 07:13, Melik-Adamyan, Areg a écrit : > > > Hello, > > > > > > I want to restart/attach to the discussions for creating Arrow > benchmarking dashboard. I want to propose performance benchmark run per > commit to track the changes. > > > The proposal includes building infrastructure for per-commit tracking > comprising of the following parts: > > > - Hosted JetBrains for OSS https://teamcity.jetbrains.com/ as a build > system > > > - Agents running in cloud both VM/container (DigitalOcean, or others) > and bare-metal (Packet.net/AWS) and on-premise(Nvidia boxes?) > > > - JFrog artifactory storage and management for OSS projects > https://jfrog.com/open-source/#artifactory2 > > > - Codespeed as a frontend https://github.com/tobami/codespeed > > > > > > I am volunteering to build such system (if needed more Intel folks > will be involved) so we can start tracking performance on various platforms > and understand how changes affect it. > > > > > > Please, let me know your thoughts! > > > > > > Thanks, > > > -Areg. > > > > > > > > > >
Re: Arrow JS 0.4.0 Release
lt;https://github.com/graphistry/arrow/commits/master> with the > >> latest version of the library that we can build against, which I update > >> when I fix any bugs or add features. > >> > > It is common for software vendors to have "downstream" releases, so > > this is reasonable, so long as this work is not promoted as Apache > > releases > > > >> The JS project is young, and sometimes has to move at a rapid pace. I've > >> felt the turnaround time involved in the vote/prepare/verify/publish > >> release process is slower than would be helpful to me. I'm used to > >> publishing patch release to npm as soon as possible, possibly multiple > >> times a day. > > Well, surely the recent security problems with NPM demonstrate that > > there is value in giving the community opportunity to vet a package > > before it is published for the world to use, and that GPG-signing > > packages is an important security measure to ensure that production > > code is coming from a network of trust. It is different if you are > > publishing packages for your own personal or corporate use. > > > >> None of the PMCs contribute to or use the JS version (if that's wrong, > >> hit me up!) so there's been no release pressure from there. None of the > >> JS contributors are PMCs so even if we want to do releases, we have to > >> wait for the a PMC. My take is that everyone on the project (especially > >> PMCs) are probably ungodly busy people, and since not releasing to npm > >> hasn't been blocking me, I opt not to bother folks. > > I am happy to help release the JS package as often as you like, up to > > multiple times per month. I stated this early on in the process, but > > there has not seemed to be much desire to release. Brian's recent > > request to release caught me at a bad time at the end of the year, but > > there are other active PMCs who should be able to help. If you do > > decide you want to release in the next week or two, please let me know > > and I will make the time to help. > > > > The lack of PMCs with an interest in JavaScript is a bit of > > self-perpetuating issue. One of the responsibilities of PMC members > > (and what will enable a committer to become a PMC) is to promote the > > growth and development of a healthy community. This includes making > > sure that the project releases. The JS developer community hasn't > > grown much, though. My approach to such a problem is to act as a > > "community of one" until it changes -- drive a project forward and > > ensure a steady cadence of releases. > > > > - Wes > > > >> > >> On 12/13/18 11:52 AM, Wes McKinney wrote: > >>> +1 for synchronizing to the main releases when possible. In the 0.12 > >>> thread we have discussed moving to time-based releases (e.g. every 2 > >>> months). Time-based releases are helpful to create urgency around > >>> getting work completed, and making sure that the project is always > >>> ready to release. > >>> On Thu, Dec 13, 2018 at 10:39 AM Brian Hulette > wrote: > >>>> Sounds great Paul! Really excited that this refactor is wrapping up. > My > >>>> only concern with including this in 0.4.0 is that I'm not going to > have the > >>>> time to thoroughly review it for a few weeks, so gating on that would > >>>> really delay it. But I can just manually test with some use-cases I > care > >>>> about in lieu of a thorough review in the interest of time. > >>>> > >>>> I think in the future (after 0.12?) it may behoove us to tie back in > to the > >>>> main Arrow release cycle. The idea with the separate JS release was to > >>>> allow us to release faster, but in practice it has done the opposite. > Since > >>>> the fall of 2017 we've cut two major JS releases (0.2, 0.3) while > there > >>>> were four major main releases (0.8 - 0.11). Not to mention the > disjoint > >>>> version numbers can be confusing to users - perhaps not as much of a > >>>> concern now that the format is pretty stable, but it can still be a > >>>> friction point. And finally selfishly - if we had been on the main > release > >>>> cycle, the contributions I made in the summer would have been > released in > >>>> either 0.10 or 0.11 by now. > >>>> > >>>> Brian > >>>> > >>>> On Thu, De
Re: Arrow JS 0.4.0 Release
Sounds great Paul! Really excited that this refactor is wrapping up. My only concern with including this in 0.4.0 is that I'm not going to have the time to thoroughly review it for a few weeks, so gating on that would really delay it. But I can just manually test with some use-cases I care about in lieu of a thorough review in the interest of time. I think in the future (after 0.12?) it may behoove us to tie back in to the main Arrow release cycle. The idea with the separate JS release was to allow us to release faster, but in practice it has done the opposite. Since the fall of 2017 we've cut two major JS releases (0.2, 0.3) while there were four major main releases (0.8 - 0.11). Not to mention the disjoint version numbers can be confusing to users - perhaps not as much of a concern now that the format is pretty stable, but it can still be a friction point. And finally selfishly - if we had been on the main release cycle, the contributions I made in the summer would have been released in either 0.10 or 0.11 by now. Brian On Thu, Dec 13, 2018 at 3:29 AM Paul Taylor wrote: > The ongoing JS refactor/upgrade branch > <https://github.com/trxcllnt/arrow/tree/js-data-refactor/js> is just > about done. It's passing all the integration tests, as well as a hundred > or so new unit tests. I have to update existing tests where the APIs > changed, battle with closure-compiler a bit, then it'll be ready to > merge in and ship out. I think I'll be able to wrap it up in the next > couple hours. > > I started this branch to clean up the Vector Data classes to make it > easier to add higher-level Table and Vector operators, but as the Data > classes are fairly embedded in the core, it lead to a larger refactor of > the DataTypes, Vectors, Visitors, and IPC readers and writers. > > While I was updating the IPC readers and writers, I took the opportunity > to back-port all the Node and WhatWG (browser) streams integration that > we've built for Graphistry. Putting it in the Arrow JS library means we > can better ensure zero-copy when possible, empowers library consumers to > easily build streaming applications in both server and browser > environments, and (selfishly) reduces complexity in my code base. It > also advances a longer term personal goal to more closely adhere to the > structure and organization of ArrowCPP when reasonable. > > A non-exhaustive list of updates includes: > > * Updates the Table, Schema, RecordBatch, Visitor, Vector, Data, and > DataTypes to ensure the generic type signatures cascade recursively > through the type declarations > * New io primitives that abstract over the (mutually exclusive) file and > stream APIs in both node and browser environments > * New RecordBatchReaders and RecordBatchWriters that directly use the > zero-copy node and browser io primitives > * A consolidated reflective Visitor implementation that supports late > binding to shortcut traversal, provides an easy API for building higher > level Vector operators > * Fixed bugs/added support for reading and writing DictionaryBatch > deltas (tricky) > * Updated all the dependencies and did some config file gardening to > make debugging tests easier > * Added a bunch of new tests > > I'd be more than happy to help shepherd a 0.4.0 release of what's in > arrow/master if that's what everyone wants to do. But in the interest of > cutting a more feature-rich release and preventing customers paying the > cost of updating twice in a short time span, I vote we hold off for > another day or two and merge + release the work in the refactor branch. > > Paul > > On 12/9/18 10:51 AM, Wes McKinney wrote: > > I agree that we should cut a JavaScript release. > > > > With the amount of maintenance work on my plate I have to declare > > bankruptcy on doing any more than I am right now. Can another PMC > > volunteer to be the RM for the 0.4.0 JavaScript release? > > > > Thanks > > Wes > > On Tue, Dec 4, 2018 at 10:07 PM Brian Hulette > wrote: > >> Hi all, > >> It's been quite a while since our last major Arrow JS release (0.3.0 on > >> February 22!), and since then we've added several new features that will > >> make Arrow JS much easier to adopt. We've added convenience functions > for > >> creating Arrow vectors and tables natively in JavaScript, an IPC writer, > >> and a row proxy interface that will make integrating with existing JS > >> libraries much simpler. > >> > >> I think it's time we cut 0.4.0, so I spent some time closing out or > >> postponing the last few JIRAs in JS-0.4.0. I got it down to just one > JIRA > >> which involves documenting the release process - hopefully we can close > >> that out as we go through it again. > >> > >> Please let me know if you think it makes sense to cut JS-0.4.0 now, or > if > >> you have any concerns. > >> > >> Brian >
[jira] [Created] (ARROW-3993) [JS] CI Jobs Failing
Brian Hulette created ARROW-3993: Summary: [JS] CI Jobs Failing Key: ARROW-3993 URL: https://issues.apache.org/jira/browse/ARROW-3993 Project: Apache Arrow Issue Type: Task Components: JavaScript Affects Versions: JS-0.3.1 Reporter: Brian Hulette Assignee: Brian Hulette Fix For: JS-0.4.0 JS Jobs failing with: npm ERR! code ETARGET npm ERR! notarget No matching version found for gulp@next npm ERR! notarget In most cases you or one of your dependencies are requesting npm ERR! notarget a package version that doesn't exist. npm ERR! notarget npm ERR! notarget It was specified as a dependency of 'apache-arrow' npm ERR! notarget npm ERR! A complete log of this run can be found in: npm ERR! /home/travis/.npm/_logs/2018-12-10T22_33_26_272Z-debug.log The command "$TRAVIS_BUILD_DIR/ci/travis_before_script_js.sh" failed and exited with 1 during . Reported by [~wesmckinn] in https://github.com/apache/arrow/pull/3152#issuecomment-446020105 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Arrow JS 0.4.0 Release
Hi all, It's been quite a while since our last major Arrow JS release (0.3.0 on February 22!), and since then we've added several new features that will make Arrow JS much easier to adopt. We've added convenience functions for creating Arrow vectors and tables natively in JavaScript, an IPC writer, and a row proxy interface that will make integrating with existing JS libraries much simpler. I think it's time we cut 0.4.0, so I spent some time closing out or postponing the last few JIRAs in JS-0.4.0. I got it down to just one JIRA which involves documenting the release process - hopefully we can close that out as we go through it again. Please let me know if you think it makes sense to cut JS-0.4.0 now, or if you have any concerns. Brian
[jira] [Created] (ARROW-3691) [JS] Update dependencies, switch to terser
Brian Hulette created ARROW-3691: Summary: [JS] Update dependencies, switch to terser Key: ARROW-3691 URL: https://issues.apache.org/jira/browse/ARROW-3691 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Fix For: JS-0.4.0 Many dependencies are out of date, give them a bump. The uglifyjs-webpack-plugin [no longer supports|https://github.com/webpack-contrib/uglifyjs-webpack-plugin/releases/tag/v2.0.0] ES6 minification, switch to terser-webpack-plugin -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3689) [JS] Upgrade to TS 3.1
Brian Hulette created ARROW-3689: Summary: [JS] Upgrade to TS 3.1 Key: ARROW-3689 URL: https://issues.apache.org/jira/browse/ARROW-3689 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Fix For: JS-0.5.0 Attempted [here|https://github.com/apache/arrow/pull/2611#issuecomment-431318129], but ran into issues. Should upgrade typedoc to 0.13 at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
Brian Hulette created ARROW-3667: Summary: [JS] Incorrectly reads record batches with an all null column Key: ARROW-3667 URL: https://issues.apache.org/jira/browse/ARROW-3667 Project: Apache Arrow Issue Type: Bug Affects Versions: JS-0.3.1 Reporter: Brian Hulette Fix For: JS-0.4.0 The JS library seems to incorrectly read any columns that come after an all-null column in IPC buffers produced by pyarrow. Here's a python script that generates two arrow buffers, one with an all-null column followed by a utf-8 column, and a second with those two reversed {code:python} import pyarrow as pa import pandas as pd def serialize_to_arrow(df, fd, compress=True): batch = pa.RecordBatch.from_pandas(df) writer = pa.RecordBatchFileWriter(fd, batch.schema) writer.write_batch(batch) writer.close() if __name__ == "__main__": df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 'def', 'ghi']}, columns=['nulls', 'not nulls']) with open('bad.arrow', 'wb') as fd: serialize_to_arrow(df, fd) df = pd.DataFrame(df, columns=['not nulls', 'nulls']) with open('good.arrow', 'wb') as fd: serialize_to_arrow(df, fd) {code} JS incorrectly interprets the [null, not null] case: {code:javascript} > var arrow = require('apache-arrow') undefined > var fs = require('fs') undefined > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0) 'abc' > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' {code} Presumably this is because pyarrow is omitting some (or all) of the buffers associated with the all-null column, but the JS IPC reader is still looking for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3523) [JS] Assign dictionary IDs in IPC writer rather than on creation
Brian Hulette created ARROW-3523: Summary: [JS] Assign dictionary IDs in IPC writer rather than on creation Key: ARROW-3523 URL: https://issues.apache.org/jira/browse/ARROW-3523 Project: Apache Arrow Issue Type: Improvement Reporter: Brian Hulette Fix For: JS-0.5.0 Currently the JS implementation relies on on the user assigning IDs for dictionaries that they create, we should do something like the C++ implementation, which uses a dictionary id memo to assign and retrieve dictionary ids in the IPC writer (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L495). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3425) [JS] Programmatically created dictionary vectors don't get dictionary IDs
Brian Hulette created ARROW-3425: Summary: [JS] Programmatically created dictionary vectors don't get dictionary IDs Key: ARROW-3425 URL: https://issues.apache.org/jira/browse/ARROW-3425 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Fix For: JS-0.4.0 This seems to be the cause of the test failures in https://github.com/apache/arrow/pull/2322 Modifying {{getSingleRecordBatchTable}} to [generate its vectors programmatically|https://github.com/apache/arrow/pull/2322/files#diff-eb6e5955a00e92f7bebb15a03f8437d1R359] (rather than deserializing hard-coded JSON), causes the new round-trip tests added in https://github.com/apache/arrow/pull/2638 to fail. The root cause seems to be that an ID is never allocated for the generated dictionary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Putting out a new JavaScript release?
Thanks for bringing this up Wes. My hope was to get out an 0.4.0 release that just includes the IPC writer and usability improvements relatively soon, and push the refactor out to 0.5.0. Paul's refactor is very exciting and will definitely be good for the project, but I don't think either of us has the time to get it into a release in the short-term. Most of the outstanding tasks in 0.4.0 [1] either have PRs up [2] or are relatively minor housekeeping tasks. I'd be fine with merging the currently open PRs and wrapping up the housekeeping tasks so we can cut a release, but I definitely want to be mindful of Paul's input, since there are almost certainly conflicts with the refactor. Brian [1] https://issues.apache.org/jira/projects/ARROW/versions/12342901 [2] https://github.com/apache/arrow/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+%5BJS%5D+in%3Atitle On Mon, Sep 10, 2018 at 6:30 AM Wes McKinney wrote: > hi folks, > > It's been 6 months since the last JavaScript release. I had read that > Paul was working on some refactoring of internals > (https://issues.apache.org/jira/browse/ARROW-2828), and that might be > the major item on route to the 0.4.0 release, but we might consider > making a new release in the meantime. What does everyone think? > > Thanks > Wes >
[jira] [Created] (ARROW-3113) Merge tool can't specify JS fix version
Brian Hulette created ARROW-3113: Summary: Merge tool can't specify JS fix version Key: ARROW-3113 URL: https://issues.apache.org/jira/browse/ARROW-3113 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Brian Hulette Assignee: Brian Hulette Specifying a JS-x.x.x fix version doesn't work anymore because of the fix for ARROW-2220. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3074) [JS] Date.indexOf generates an error
Brian Hulette created ARROW-3074: Summary: [JS] Date.indexOf generates an error Key: ARROW-3074 URL: https://issues.apache.org/jira/browse/ARROW-3074 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette Fix For: JS-0.4.0 https://github.com/apache/arrow/blob/master/js/src/vector/flat.ts#L150 {{every}} doesn't exist on {{Date}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3073) [JS] Add DateVector.from
Brian Hulette created ARROW-3073: Summary: [JS] Add DateVector.from Key: ARROW-3073 URL: https://issues.apache.org/jira/browse/ARROW-3073 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette Fix For: JS-0.4.0 It should be possible to construct a {{DateVector}} from a list of Date objects -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Creating a user@ mailing list
Agreed. I was concerned about the plan to drop Slack because it was a place users would come to ask questions (for better or worse). I assumed that was because those users were just uncomfortable with mailing lists, but I think Uwe is right, they're probably just uncomfortable with *this* mailing list, where most of the discussion is about development. Brian On Thu, Aug 16, 2018 at 6:52 AM Wes McKinney wrote: > hi Uwe, > > This sounds like a good idea to me. I think we should go ahead and ask > INFRA to set it up. We'll need to add a "Community" landing page on > the website of sorts to explain the mailing lists better. > > - Wes > > > On Thu, Aug 16, 2018 at 4:49 AM, Uwe L. Korn wrote: > > Hello all, > > > > I would like to create a u...@arrow.apache.org mailing list. Some > people are a bit confused that there is only a dev mailing list. They > interpret this as a mailing list that should be used solely for Arrow > development, not usage questions. This is sadly a psychological barrier for > people to get a bit more involved since we have closed Slack. > > > > What are others thinking about this? > > > > Uwe >
[jira] [Created] (ARROW-2909) [JS] Add convenience function for creating a table from a list of vectors
Brian Hulette created ARROW-2909: Summary: [JS] Add convenience function for creating a table from a list of vectors Key: ARROW-2909 URL: https://issues.apache.org/jira/browse/ARROW-2909 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette Similar to ARROW-2766, but requires users to first turn their arrays into vectors, so we don't have to deduce type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2819) [JS] Fails to build with TS 2.8.3
Brian Hulette created ARROW-2819: Summary: [JS] Fails to build with TS 2.8.3 Key: ARROW-2819 URL: https://issues.apache.org/jira/browse/ARROW-2819 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette See the [GitHub issue|https://github.com/apache/arrow/issues/2115#issuecomment-403612925] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2797) [JS] comparison predicates don't work on 64-bit integers
Brian Hulette created ARROW-2797: Summary: [JS] comparison predicates don't work on 64-bit integers Key: ARROW-2797 URL: https://issues.apache.org/jira/browse/ARROW-2797 Project: Apache Arrow Issue Type: Bug Components: JavaScript Affects Versions: JS-0.3.1 Reporter: Brian Hulette The 64-bit integer vector {{get}} function returns a 2-element array, which doesn't compare propery in the comparison predicates. We should special case the comparisons for 64-bit integers and timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2789) [JS] Minor DataFrame improvements
Brian Hulette created ARROW-2789: Summary: [JS] Minor DataFrame improvements Key: ARROW-2789 URL: https://issues.apache.org/jira/browse/ARROW-2789 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette * deprecate count() in favor of a readonly length member (implemented with a getter in FilterdDataFrame) * Add an iterator to FilteredDataFrame -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2778) Add Utf8Vector.from
Brian Hulette created ARROW-2778: Summary: Add Utf8Vector.from Key: ARROW-2778 URL: https://issues.apache.org/jira/browse/ARROW-2778 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2772) [JS] Commit package-lock.json and/or yarn.lock
Brian Hulette created ARROW-2772: Summary: [JS] Commit package-lock.json and/or yarn.lock Key: ARROW-2772 URL: https://issues.apache.org/jira/browse/ARROW-2772 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette We should commit one (or both) of these lockfiles to the repo to make the dependency tree explicit and consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2771) [JS] Add row proxy object accessor
Brian Hulette created ARROW-2771: Summary: [JS] Add row proxy object accessor Key: ARROW-2771 URL: https://issues.apache.org/jira/browse/ARROW-2771 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette The {{Table}} class would be much easier to interact with if it returned familiar Javascript objects representing a row. As Jeff Heer [demonstrated|https://beta.observablehq.com/@jheer/from-apache-arrow-to-javascript-objects] it's possible to create JS Proxy objects that read directly from Arrow memory. We should generate these types of objects in {{Table.get}} and in the {{Table}} iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2767) [JS] Add generic to Table for column names
Brian Hulette created ARROW-2767: Summary: [JS] Add generic to Table for column names Key: ARROW-2767 URL: https://issues.apache.org/jira/browse/ARROW-2767 Project: Apache Arrow Issue Type: Improvement Reporter: Brian Hulette Requested by [~domoritz] Something like: {code:javascript} class Table { ... getColumn(name: ColName): Vector { } ... } {code} It would be even better if we could find a way to map the column names to the actual vector data types, but one thing at a time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2766) [JS] Add ability to construct a Table from a list of Arrays/TypedArrays
Brian Hulette created ARROW-2766: Summary: [JS] Add ability to construct a Table from a list of Arrays/TypedArrays Key: ARROW-2766 URL: https://issues.apache.org/jira/browse/ARROW-2766 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: Brian Hulette Something like {{Table.from({'col1': [...], 'col2': [...], 'col3': [...]})}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2765) [JS] add Vector.map
Brian Hulette created ARROW-2765: Summary: [JS] add Vector.map Key: ARROW-2765 URL: https://issues.apache.org/jira/browse/ARROW-2765 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: Brian Hulette Fix For: JS-0.4.0 Add `Vector.map(f)` that returns a new vector transformed with `f` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2764) [JS] Easy way to add a column to a Table
Brian Hulette created ARROW-2764: Summary: [JS] Easy way to add a column to a Table Key: ARROW-2764 URL: https://issues.apache.org/jira/browse/ARROW-2764 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Fix For: JS-0.4.0 It should be easier to add a new column to a table. API could be either `table.addColumn(vector)` or `table.merge(..tables or vectors)` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2762) [JS] Remove unused perf/config.js
Brian Hulette created ARROW-2762: Summary: [JS] Remove unused perf/config.js Key: ARROW-2762 URL: https://issues.apache.org/jira/browse/ARROW-2762 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette We don't seem to be using {{perf/config.js}} anymore. Let's remove it and replace it with {{perf/table_config.js}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2584) [JS] Node v10 issues
Brian Hulette created ARROW-2584: Summary: [JS] Node v10 issues Key: ARROW-2584 URL: https://issues.apache.org/jira/browse/ARROW-2584 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Assignee: Paul Taylor Build and tests fail with node v10. Fix these issues and bump CI to use node v10 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Continuous benchmarking setup
Is anyone aware of a way we could set up similar continuous benchmarks for JS? We wrote some benchmarks earlier this year but currently have no automated way of running them. Brian On 05/11/2018 08:21 PM, Wes McKinney wrote: Thanks Tom and Antoine! Since these benchmarks are literally running on a machine in my closet at home, there may be some downtime in the future. At some point we should document a process of setting up a new machine from scratch to be the nightly bare metal benchmark slave. - Wes On Fri, May 11, 2018 at 9:08 AM, Antoine Pitrou wrote: Hi again, Tom has configured the benchmarking machine to run and publish Arrow's ASV-based benchmarks. The latest results can now be seen at: https://pandas.pydata.org/speed/arrow/ I expect these are regenerated on a regular (daily?) basis. Thanks Tom :-) Regards Antoine. On Wed, 11 Apr 2018 15:40:17 +0200 Antoine Pitrou wrote: Hello With the following changes, it seems we might reach the point where we're able to run the Python-based benchmark suite accross multiple commits (at least the ones not anterior to those changes): https://github.com/apache/arrow/pull/1775 To make this truly useful, we would need a dedicated host. Ideally a (Linux) OS running on bare metal, with SMT/HyperThreading disabled. If running virtualized, the VM should have dedicated physical CPU cores. That machine would run the benchmarks on a regular basis (perhaps once per night) and publish the results in static HTML form somewhere. (note: nice to have in the future might be access to NVidia hardware, but right now there are no CUDA benchmarks in the Python benchmarks) What should be the procedure here? Regards Antoine.
Re: [Format] Pointer types / span types
List also references another (data) array which can be a different size, but rather than requiring it to be represented with a second schema, we make it a child of the List type. We could do the same thing for a Span type, and give it a new type of buffer that contains start/stop indices rather than offsets. To Antoine's point, maybe there's not enough demand to justify defining this type right now. I definitely agree that it would be good to see an example dataset before adding something like this. Brian On 05/02/2018 03:54 PM, Wes McKinney wrote: Perhaps that could be an argument for making span a core logical type? I think if anything, this argues that it should not be. Because "span" references another array, which can be a different size, you need two schemas to be able to make sense of it. In either case, I would be interested to see what modifications would be proposed to Schema.fbs and an example dataset described with such a schema (that is a single array, instead of multiple -- i.e. a non-composite representation). For the record, if there are sufficiently common "composite" data representations, I don't see a problem with developing community standards based on the building blocks we already have - Wes On Wed, May 2, 2018 at 3:38 PM, Brian Hulette wrote: If this were accomplished at the application level, how would it work with the IPC formats? I'd think you'd need to have two separate files (or streams), since array 1 and array 2 will be different lengths. Perhaps that could be an argument for making span a core logical type? Brian On 05/02/2018 03:34 PM, Antoine Pitrou wrote: On Wed, 2 May 2018 10:12:37 -0400 Wes McKinney wrote: It sounds like the "span" type could be implemented as a composite of multiple Arrow arrays / schemas: array 1 (data) any schema array 2 (view) struct < start: int64, stop: int64 Unless I'm missing something, this feels like an application-level concern rather than something that needs to be addressed in the columnar format / metadata. Well, couldn't the same theoretically be said about list arrays? In the end, I suppose it all depends whether there's enough demand to make it a core logical type inside Arrow, rather than something people write custom code for in their application. Regards Antoine.
Re: [Format] Pointer types / span types
If this were accomplished at the application level, how would it work with the IPC formats? I'd think you'd need to have two separate files (or streams), since array 1 and array 2 will be different lengths. Perhaps that could be an argument for making span a core logical type? Brian On 05/02/2018 03:34 PM, Antoine Pitrou wrote: On Wed, 2 May 2018 10:12:37 -0400 Wes McKinney wrote: It sounds like the "span" type could be implemented as a composite of multiple Arrow arrays / schemas: array 1 (data) any schema array 2 (view) struct < start: int64, stop: int64 Unless I'm missing something, this feels like an application-level concern rather than something that needs to be addressed in the columnar format / metadata. Well, couldn't the same theoretically be said about list arrays? In the end, I suppose it all depends whether there's enough demand to make it a core logical type inside Arrow, rather than something people write custom code for in their application. Regards Antoine.
Re: [Format] Pointer types / span types
Yes my first reaction to both of these requests is - would dictionary-encoding work? - would a List work? I think for the former the analogy is more clear, for the latter, technically a List encodes start and stop indices with an offset array rather than separate arrays for start and stop indices. Is there a reason an offset array wouldn't work for the OAMap use-case though? Brian On 04/30/2018 04:55 PM, Antoine Pitrou wrote: Actually, "pointer type" might just be another name for "dictionary type". Regards Antoine. Le 30/04/2018 à 22:08, Antoine Pitrou a écrit : Hi, Today I got the opportunity to talk with Jim Pivarski, the main developer on the OAMap project (*). Under the hood, he is doing something not unlike the Arrow representation of nested arrays: he stores and processes structured data as linear arrays, allowing very fast processing on seemingly irregular data (in Array parlance, think something like lists of lists of structs). It seems that OAMap data requires two kinds of logical types that Arrow misses : - a pointer type, where a physical array of ints is used to represent indices into another array (the logical value being of course the value pointed to) - a span type, where two physical arrays of ints are used to represent start and stop indices into another array (the logical value being the list of values delimited by the start / stop indices) Did such a feature request already come by? Is this something we should add to our roadmap or future wishlist? (*) https://github.com/diana-hep/oamap Regards Antoine.
Re: Allow dictionary-encoded children?
Thanks Uwe, Wes, glad to hear I'm not too far out there :) The dictionary batch ordering seems like a reasonable requirement for this situation. I made a JIRA to add something like this to the integration tests (https://issues.apache.org/jira/browse/ARROW-2412) and Ill put up a PR shortly. On 04/06/2018 01:43 PM, Wes McKinney wrote: Having dictionaries-within-dictionaries does add some complexity, but I think the use case is valid and so it would be good to determine the best way to handle this in the IPC / messaging protocol. I would suggest: dictionaries can use other dictionaries, so long as those dictionaries occur earlier in the stream. I am not sure either the Java or C++ libraries will be able to properly handle these cases right now, but that's what we have integration tests for! On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn wrote: Hello Brian, I would also have considered this a legitimate use of the Arrow specification. We only specify the DictionaryType to have a dictionary of any Arrow Type. In the context of Arrow's IPC this seems to be a bit more complicated as we seem to have the assumption that there is only one type of Dictionary per column. I would argue that we should be able to support this once we work out a reliable way to transfer them via the IPC mechanism. Just as a related thought (might not produce the result you want): In Parquet, only the values on the lowest level are dictionary-encoded. But this is also due to the fact that Parquet uses repetition and definition levels to encode arbitrarily nested data types. These are more space-efficient when they are correctly encoded but don't provide random access. Uwe On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote: I've been considering a use-case with a dictionary-encoded struct column, which may contain some dictionary-encoded columns itself. More specifically, in this use-case each row represents a single observation in a geospatial track, which includes a position, a time, and some track-level metadata (track id, origin, destination, etc...). I would like to represent the metadata as a dictionary-encoded struct, since unique values will be repeated for each observation of that track, and I would _also_ like to dictionary-encode some of the metadata column's children, since unique values will typically be repeated in multiple tracks. I think one could make a (totally legitimate) argument that this is stretching a format designed for tabular data too far. This use-case could also be accomplished by breaking out the struct metadata column into its own arrow table, and managing a new integer column that references that table. This would look almost identical to what I initially described, it just wouldn't rely on the arrow libraries to manage the "dictionary". The spec doesn't have anything to say on this topic as far as I can tell, but our implementations don't currently allow a dictionary-encoded column's children to be dictionary-encoded themselves [1]. Is this just a simplifying assumption, or a hard rule that should be codified in the spec? Thanks, Brian [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824
[jira] [Created] (ARROW-2412) [Integration] Add nested dictionary integration test
Brian Hulette created ARROW-2412: Summary: [Integration] Add nested dictionary integration test Key: ARROW-2412 URL: https://issues.apache.org/jira/browse/ARROW-2412 Project: Apache Arrow Issue Type: Task Components: Integration Reporter: Brian Hulette Add nested dictionary generator to the integration test. The tests will probably fail at first but can serve as a starting point for developing this capability. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2410) [JS] Add DataFrame.scanAsync
Brian Hulette created ARROW-2410: Summary: [JS] Add DataFrame.scanAsync Key: ARROW-2410 URL: https://issues.apache.org/jira/browse/ARROW-2410 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Add a version of `DataFrame.scan`, `scanAsync` that yields periodically. The yield frequency could be specified either as a number of record batches, or a number of records. This scan should also be cancellable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Allow dictionary-encoded children?
I've been considering a use-case with a dictionary-encoded struct column, which may contain some dictionary-encoded columns itself. More specifically, in this use-case each row represents a single observation in a geospatial track, which includes a position, a time, and some track-level metadata (track id, origin, destination, etc...). I would like to represent the metadata as a dictionary-encoded struct, since unique values will be repeated for each observation of that track, and I would _also_ like to dictionary-encode some of the metadata column's children, since unique values will typically be repeated in multiple tracks. I think one could make a (totally legitimate) argument that this is stretching a format designed for tabular data too far. This use-case could also be accomplished by breaking out the struct metadata column into its own arrow table, and managing a new integer column that references that table. This would look almost identical to what I initially described, it just wouldn't rely on the arrow libraries to manage the "dictionary". The spec doesn't have anything to say on this topic as far as I can tell, but our implementations don't currently allow a dictionary-encoded column's children to be dictionary-encoded themselves [1]. Is this just a simplifying assumption, or a hard rule that should be codified in the spec? Thanks, Brian [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824
[jira] [Created] (ARROW-2327) [JS] Table.fromStruct missing from externs
Brian Hulette created ARROW-2327: Summary: [JS] Table.fromStruct missing from externs Key: ARROW-2327 URL: https://issues.apache.org/jira/browse/ARROW-2327 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette {{Table.fromStruct}} is not listed in externs, so its obfuscated by the closure compiler -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Apache Arrow JavaScript 0.3.1 - RC1
+1 (non-binding). Ran js-verify-release-candidate.sh with Node 8.9.1 on Ubuntu 16.04. Thanks Wes! On 03/15/2018 05:17 AM, Uwe L. Korn wrote: +1 (binding). Ran js-verify-release-candidate.sh with Node 9.8.0 On Thu, Mar 15, 2018, at 1:50 AM, Wes McKinney wrote: +1 (binding). Ran js-verify-release-candidate.sh with Node 8.10.0 LTS On Wed, Mar 14, 2018 at 8:40 PM, Paul Taylor wrote: +1 (non-binding) On Mar 14, 2018, at 5:10 PM, Wes McKinney wrote: Hello all, I\'d like to propose the following release candidate (rc1) of Apache Arrow JavaScript version 0.3.1. The source release rc1 is hosted at [1]. This release candidate is based on commit 077bd53df590cafe26fc784b3c6d03bf1ac24f67 Please download, verify checksums and signatures, run the unit tests, and vote on the release. The easiest way is to use the JavaScript-specific release verification script dev/release/js-verify-release-candidate.sh. The vote will be open for at least 24 hours and will close once enough PMCs have approved the release. [ ] +1 Release this as Apache Arrow JavaScript 0.3.1 [ ] +0 [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because... How to validate a release signature: https://httpd.apache.org/dev/verification.html [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc1/ [2]: https://github.com/apache/arrow/tree/077bd53df590cafe26fc784b3c6d03bf1ac24f67
Re: gReetings
If you prefer slack over (or in addition to) the mailing list there's also the Arrow slack. We recently made a #javascript channel there for discussions about that implementation, you could certainly do the same for R. [1] https://apachearrow.slack.com [2] https://apachearrowslackin.herokuapp.com/ (auto-invite link) On 03/14/2018 02:07 PM, Romain Francois wrote: Sounds great. Le 14 mars 2018 à 19:03, Aneesh Karve a écrit : Hi Romain. Thanks for looking into this. Per discussion with Wes we'll keep the discussion on ASF channels so the community can participate. ᐧ
[jira] [Created] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency
Brian Hulette created ARROW-2297: Summary: [JS] babel-jest is not listed as a dev dependency Key: ARROW-2297 URL: https://issues.apache.org/jira/browse/ARROW-2297 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette babel-jest is not listed as a dev dependency, leading to the following error on new clones of arrow js: {noformat} [10:21:08] Starting 'test:ts'... ● Validation Error: Module ./node_modules/babel-jest/build/index.js in the transform option was not found. Configuration Documentation: https://facebook.github.io/jest/docs/configuration.html [10:21:09] 'test:ts' errored after 306 ms [10:21:09] Error: exited with error code: 1 at ChildProcess.onexit (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36) at emitTwo (events.js:126:13) at ChildProcess.emit (events.js:214:7) at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12) [10:21:09] 'test' errored after 311 ms {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Release Apache Arrow JavaScript 0.3.1 - RC0
-1 (non-binding) I get an error when running js-verify-release-candidate.sh, which I can also replicate with a fresh clone of arrow on commit 17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6 by running `npm install` and then `npm run test -- -t ts`: [10:21:08] Starting 'test:ts'... ● Validation Error: Module ./node_modules/babel-jest/build/index.js in the transform option was not found. Configuration Documentation: https://facebook.github.io/jest/docs/configuration.html [10:21:09] 'test:ts' errored after 306 ms [10:21:09] Error: exited with error code: 1 at ChildProcess.onexit (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36) at emitTwo (events.js:126:13) at ChildProcess.emit (events.js:214:7) at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12) [10:21:09] 'test' errored after 311 ms Seems like the issue is that babel-jest is not included as a dev dependency, so it's not found in node_modules in the new clone. Not sure how it was working in the past, perhaps it was a transitive dependency that was reliably included? I can put up a PR to add the dependency Brian On 03/10/2018 01:52 PM, Wes McKinney wrote: +1 (binding), ran js-verify-release-candidate.sh with NodeJS 8.10.0 LTS on Ubuntu 16.04 On Sat, Mar 10, 2018 at 1:52 PM, Wes McKinney wrote: Hello all, I'd like to propose the 1st release candidate (rc0) of Apache Arrow JavaScript version 0.3.0. This is a bugfix release from 0.3.0. The source release rc0 is hosted at [1]. This release candidate is based on commit 17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6 Please download, verify checksums and signatures, run the unit tests, and vote on the release. The easiest way is to use the JavaScript-specific release verification script dev/release/js-verify-release-candidate.sh. The vote will be open for at least 24 hours and will close once enough PMCs have approved the release. [ ] +1 Release this as Apache Arrow JavaScript 0.3.1 [ ] +0 [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because... Thanks, Wes How to validate a release signature: https://httpd.apache.org/dev/verification.html [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc0/ [2]: https://github.com/apache/arrow/tree/17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6
Re: Making a bugfix Arrow JS release
Naveen, Yes I think when we initially discussed adding the JS dataframe ops we argued that it could be a separate library within the Apache Arrow monorepo, since some users will just want the ability to read/write arrow data, and we shouldn't force them to pull in a dataframe API they won't be using. Right now there's not much to the dataframe parts of arrow js, so I think the cost is pretty minimal, but as it grows it will be a good idea to separate it out. Feel free to make a JIRA for this, maybe it can be a goal for the next JS release. Brian On 03/07/2018 10:00 AM, Naveen Michaud-Agrawal wrote: Hi Brian, Any thoughts on splitting out the dataframe like parts into a separate library, keeping arrowjs to just handle loading data out of the arrow buffer? Regards, Naveen Michaud-Agrawal
Re: Making a bugfix Arrow JS release
We're just wrapping up https://github.com/apache/arrow/pull/1678, and I would also like to merge https://github.com/apache/arrow/pull/1683, even though its technically not a bugfix.. it makes the df interface much more useful. Once we merge those I'd be happy cutting a bugfix release, unless there's anything else Paul would like to get in. Brian On 03/05/2018 02:21 PM, Wes McKinney wrote: Brian mentioned on GitHub that it might be good to make a 0.3.1 JS release due to bugs fixed since 0.3.0. Is there any other work that needs to be merged before doing this? Thanks Wes
[jira] [Created] (ARROW-2236) [JS] Add more complete set of predicates
Brian Hulette created ARROW-2236: Summary: [JS] Add more complete set of predicates Key: ARROW-2236 URL: https://issues.apache.org/jira/browse/ARROW-2236 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2235) [JS] Add tests for IPC messages split across multiple buffers
Brian Hulette created ARROW-2235: Summary: [JS] Add tests for IPC messages split across multiple buffers Key: ARROW-2235 URL: https://issues.apache.org/jira/browse/ARROW-2235 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Brian Hulette See https://github.com/apache/arrow/pull/1670 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2234) [JS] Read timestamp low bits as Uint32s
Brian Hulette created ARROW-2234: Summary: [JS] Read timestamp low bits as Uint32s Key: ARROW-2234 URL: https://issues.apache.org/jira/browse/ARROW-2234 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Assignee: Paul Taylor -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2233) [JS] Error when slicing a DictionaryVector with nullable indices vector
Brian Hulette created ARROW-2233: Summary: [JS] Error when slicing a DictionaryVector with nullable indices vector Key: ARROW-2233 URL: https://issues.apache.org/jira/browse/ARROW-2233 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Falls through the checks and throws this error: https://github.com/apache/arrow/blob/master/js/src/vector.ts#L416 -- This message was sent by Atlassian JIRA (v7.6.3#76005)