Re: [DISCUSS] Updating what are considered reference implementations?

2023-01-11 Thread Brian Hulette
I think this [1] is the thread where the policy was proposed, but it
doesn't look like we ever settled on "Java and C++" vs. "any two
implementations", or had a vote.

I worry that requiring maintainers to add new format features to two
"complete" implementations will just lead to fragmentation. People might
opt to maintain a fork rather than unblock themselves by implementing a
backlog of features they don't need.

[1] https://lists.apache.org/thread/9t0pglrvxjhrt4r4xcsc1zmgmbtr8pxj

On Fri, Jan 6, 2023 at 12:33 PM Weston Pace  wrote:

> I think it would be reasonable to state that a reference
> implementation must be a complete implementation (i.e. supports all
> existing types) that is not derived from another implementation (e.g.
> you can't pick pyarrow and arrow-c++).  If an implementation does not
> plan on ever supporting a new array type then maintainers of that
> implementation should be empowered to vote against it.  Given that, it
> seems like a reasonable burden to ask maintainers to catch up first
> before expanding in new directions.
>
>
> On Fri, Jan 6, 2023 at 10:20 AM Micah Kornfield 
> wrote:
> >
> > >
> > > Note this wording talks about "two reference implementations" not
> "*the*
> > > two reference implementations". So there can be more than two reference
> > > implementations.
> >
> >
> > Maybe reference implementation is the wrong wording here.  My main
> concern
> > is that we try to maintain two "feature complete" implementations at all
> > times.  I worry if there is a pick  2 from N reference implementations
> that
> > potentially leads to fragmentation more quickly.  But maybe this is
> > premature?
> >
> > Cheers,
> > Micah
> >
> >
> > On Fri, Jan 6, 2023 at 10:02 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 06/01/2023 à 18:58, Micah Kornfield a écrit :
> > > > I'm having trouble finding it, but I think we've previously agreed
> that
> > > new
> > > > features needed implementations in 2 reference implementations before
> > > > approval (I had thought the community agreed on Java and C++ as the
> two
> > > > implementations but I can't find the vote thread on it).
> > >
> > > Note this wording talks about "two reference implementations" not
> "*the*
> > > two reference implementations". So there can be more than two reference
> > > implementations.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


Re: [VOTE] Remove compute from Arrow JS

2021-10-27 Thread Brian Hulette
+1

I don't think there's much reason to keep the compute code around when
there's a more performant, easier to use alternative. I think the only
unique feature of the arrow compute code was the ability to optimize
queries on dictionary-encoded columns, but Jeff added this to Arquero
almost a year ago now [1].

Brian

[1] https://github.com/uwdata/arquero/issues/86

On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz  wrote:

> Dear Arrow community,
>
> We are proposing to remove the compute code from Arrow JS. Right now, the
> compute code is encapsulated in a DataFrame class that extends Table. The
> DataFrame implements a few functions such as filtering and counting with
> expressions. However, the predicate code is not very efficient (it’s
> interpreted) and most people only use Arrow to read data but don’t need
> compute. There are also more complete alternatives for doing compute on
> Arrow data structures such as Arquero (https://github.com/uwdata/arquero).
> By removing the compute code, we can focus on the IPC reading/writing and
> primitive types.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Remove compute from Arrow JS
> [ ] +0
> [ ] -1 Do not remove compute because…
>
> Thank you,
> Dominik
>


Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Brian Hulette
I review a decent number of PRs for Apache Beam, and I've built some of my
own tooling to help keep track of open PRs. I wrote a script that pulls
metadata about all relevant PRs and uses some heuristics to categorize them
into:
- incoming review
- outgoing review
- "CC'd" - where I've been mentioned but am not the reviewer or author

In the first two cases I try to highlight the ones that need my
attention, simply by detecting if I'm the person who took the most recent
action or not. This works reasonably well but gets tripped up on several
edge cases:
1) The author might push multiple commits before they're actually ready for
more feedback.
2) A PR might need feedback from multiple reviewers (e.g. people with
domain knowledge of certain areas).

I've been planning to make my script stateful so that I can mark a PR as
"not my turn" (i.e. unhighlight this until there is more activity), and
maybe "never my turn" (i.e. I've finished reviewing this, it's waiting on
someone else), to handle these cases.

The idea of a "Addressing Feedback" -> "Waiting on Review" label that is
automatically transitioned when there is activity would run into these same
edge cases.
If a reviewer had the ability to bump the label back to "Addressing
Feedback", that would at least address #1.

I think Wes's proposal (a read-only web UI) would likely also run into
these edge cases since it stores no state of its own to deconflict in those
situations.

Brian

On Tue, Jun 29, 2021 at 6:26 AM Wes McKinney  wrote:

> On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb  wrote:
> >
> > The thing that would make me more efficient reviewing PRs is figuring out
> > which one of the open reviews are ready for additional feedback.
>
> Yes, I think this would be the single most significant quality-of-life
> improvement for reviewers.
>
> > I think the idea of a webapp or something that shows active reviews would
> > be helpful (though I get most of that from appropriate email filters).
> >
> > What about a system involving labels (for which there is already a basic
> > GUI in github)? Something low tech like
> >
> > (Waiting for Review)
> > (Addressing Feedback)
> > (Approved, waiting for Merge)
> >
> > With maybe some automation prompting people to add the "Waiting on
> Review"
> > label when they want feedback
>
> I think it would have to be a bot that automatically sets the labels.
> If it requires contributors to take some action outside of pushing new
> work (new commits or a rebased version of the patch) to the PR and
> leaving responses to comments on the PR, the system is likely to fail
> some non-trivial percentage of the time.


> Given the quality of off-the-shelf web app components nowadays (e.g.
> https://material-ui.com), throwing together a read-only PR dashboard
> that shows what has changed since you last interacted with them (along
> with some other helpful things, like whether the build is passing) is
> "probably" not a super heavy lift. I haven't done any frontend
> development in years so while the backend part (writing Python code to
> wrangle data from GitHub's REST API and put it in a SQLite database)
> wouldn't take very long I would need some help on the front end
> portion and setting it up for deployment on DigitalOcean or somewhere.
>
> > Andrew
> >
> > On Tue, Jun 29, 2021 at 4:28 AM Wes McKinney 
> wrote:
> >
> > > hi folks,
> > >
> > > I've noted that the volume of PRs for Arrow has been steadily
> > > increasing (and will likely continue to increase), and while I've
> > > personally had less time for development / maintenance / code reviews
> > > over the last year, I would like to have a discussion about what we
> > > could do to improve our tooling for maintainers to optimize the
> > > efficiency of time spent tending to the PR queue. In my own
> > > experience, I have felt that I have wasted a lot of time digging
> > > around the queue looking for PRs that are awaiting feedback or need to
> > > be merged.
> > >
> > > I note first of all that around 70 out of 173 open PRs have been
> > > updated in the last 7 days, so while there is some PR staleness, to
> > > have nearly half of the PRs active is pretty good. That said, ~70
> > > active PRs is a lot of PRs to tend to.
> > >
> > > I scraped the project's code review comment history, and here are the
> > > individuals who have left the most comments on PRs since genesis
> > >
> > > pitrou6802
> > > wesm  5023
> > > emkornfield   3032
> > > bkietz2834
> > > kou   1489
> > > nealrichardson1439
> > > fsaintjacques 1356
> > > kszucs1250
> > > alamb 1133
> > > jorisvandenbossche1094
> > > liyafan82  831
> > > lidavidm   816
> > > westonpace 794
> > > xhochy 770
> > > nevi-me643
> > > BryanCutler639
> > > jorgecarleitao 635
> > > cpcloud551
> > > sunc

Re: [ANNOUNCE] New Arrow committer: Dominik Moritz

2021-06-02 Thread Brian Hulette
Congratulations Dominik! Well deserved!

Really excited to see some momentum in the JavaScript library

On Wed, Jun 2, 2021 at 2:44 PM Dominik Moritz  wrote:

>  Thank you for the warm welcome, Wes.
>
> I look forward to continue working with you all on Arrow and in particular
> the Arrow JavaScript library.
>
> Dominik
>
> On Jun 2, 2021 at 14:19:51, Wes McKinney  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Dominik has
> > accepted an
> > invitation to become a committer on Apache Arrow. Welcome, and thank you
> > for your contributions!
> >
> > Wes
> >
>


Re: Long title on github page

2021-05-17 Thread Brian Hulette
Thank you for bringing this up Dominik. I sampled some of the descriptions
for other Apache projects I frequent, the ones with a meaningful
description have a single sentence:

github.com/apache/spark - Apache Spark - A unified analytics engine for
large-scale data processing
github.com/apache/beam - Apache Beam is a unified programming model for
Batch and Streaming
github.com/apache/avro - Apache Avro is a data serialization system

Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache "
as the description.

+1 for Nate's suggestion "Apache Arrow is a cross-language development
platform for in-memory data. It enables systems to process and transport
data more efficiently."

On Mon, May 17, 2021 at 5:23 AM Wes McKinney  wrote:

> It's probably best for description to limit mentions of specific
> features. There are some high level features mentioned in the
> description now ("computational libraries and zero-copy streaming
> messaging and interprocess communication"), but now in 2021 since the
> project has grown so much, it could leave people with a limited view
> of what they might find here.
>
> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
>  wrote:
> >
> > How about
> > 'Apache Arrow is a cross-language development platform for in-memory
> data.
> > It enables systems to process and transport data efficiently, providing a
> > simple and fast library for partitioning of large tables'?
> >
> > Sorry the delay, long election day
> >
> > On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> natebauernfe...@deephaven.io>
> > wrote:
> >
> > > Suggestion: faster -> more efficiently
> > >
> > > "Apache Arrow is a cross-language development platform for in-memory
> > > data. It enables systems to process and transport data more
> efficiently."
> > >
> > > On Sun, May 16, 2021 at 11:35 AM Wes McKinney 
> wrote:
> > >
> > > > Here's what there now:
> > > >
> > > > "Apache Arrow is a cross-language development platform for in-memory
> > > > data. It specifies a standardized language-independent columnar
> memory
> > > > format for flat and hierarchical data, organized for efficient
> > > > analytic operations on modern hardware. It also provides
> computational
> > > > libraries and zero-copy streaming messaging and interprocess
> > > > communication…"
> > > >
> > > > How about something shorter like
> > > >
> > > > "Apache Arrow is a cross-language development platform for in-memory
> > > > data. It enables systems to process and transport data faster."
> > > >
> > > > Suggestions / refinements from others welcome
> > > >
> > > >
> > > > On Sat, May 15, 2021 at 9:12 PM Dominik Moritz 
> wrote:
> > > > >
> > > > > Super minor issue but could someone make the description on GitHub
> > > > shorter?
> > > > >
> > > > >
> > > > >
> > > > > GitHub puts the description into the title of the page and makes it
> > > hard
> > > > to find it in URL autocomplete.
> > > > >
> > > >
> > >
> > >
> > > --
> > >
>


Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-30 Thread Brian Hulette
+1 this looks good to me.

My only concern is with criteria #3 " Is the underlying encoding of the
type already semantically supported by a type?". I think this is a good
criteria, but it's inconsistent with the current spec. By that criteria
some existing types (Timestamp, Time, Duration, Date) should be well known
extension types, right?

Perhaps we should explicitly indicate these types are grandfathered in [1]
because they existed before extension types, to avoid tension with this
criteria.

Brian

[1] https://en.wikipedia.org/wiki/Grandfather_clause

On Thu, Apr 29, 2021 at 9:13 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thanks for writing this.
>
> I agree. That is a good decision tree. +1
>
> Best,
> Jorge
>
>
> On Thu, Apr 29, 2021 at 6:08 PM Micah Kornfield 
> wrote:
>
> > The discussion around adding another interval type to the Schema.fbs
> raises
> > the issue of when do we decide to add a new type to the Schema.fbs vs
> using
> > other means (primarily extension types [1]).
> >
> > A few criteria come to mind that could help decide (feedback welcome):
> >
> > 1.  Is the type a new parameterization of an existing type?
> > - If Yes, and we believe the parameterization is useful and can be
> done
> > in a forward/backward compatible manner then we would update Schema.fbs.
> >
> > 2.  Does the type itself have its own specification for processing (e.g.
> > JSON, BSON, Thrift, Avro, Protobuf)?
> >   - If yes, we would NOT add them to Schema.fbs.  I think this would
> > potentially yield too many new types.
> >
> > 3.  Is the underlying encoding of the type already semantically supported
> > by a type? (e.g. if we want to encode physical lengths like meters these
> > can be represented by an integer).
> >- If yes, we would NOT update the specification.  This seems like the
> > exact use-case that extension types are meant for.
> >
> > * How does this apply to Interval? *
> > Interval extends an existing type in the specification and multiple
> "packed
> > fields" cannot be easily communicated with the current version of the
> > specification.  Hence, I feel comfortable making the addition to
> Schema.fbs
> >
> > * What does this mean for other common types? *
> >
> > I think as types come up that are very common but we don't want to add to
> > the Schema.fbs we should invest in formalizing them as "Well Known"
> > Extension types.  In this scenario, we would update the specification to
> > include how to specify the extension type metadata (and still require at
> > least two libraries support the Extension type before inclusion as "Well
> > Known").
> >
> > * Practical implications *
> >
> > I think this means the type system in Schema.fbs is mostly closed (i.e.
> > there is a high bar for adding new types). One potentially useful type to
> > have would be a "packed struct" that supports something similar to python
> > struct library [2].  I think this would likely cover many extension type
> > use-cases.
> >
> > Thoughts?
> >
> > -Micah
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> > [2] https://docs.python.org/3/library/struct.html
> >
>


Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Brian Hulette
I agree this would be a great development. It would also be useful for
leveraging compute engines from JS via wasm.

I've thought about something like this in the context of multi-language
relational workloads in Apache Beam, mostly just leading me to wonder if
something like it already exists. But so far I haven't found it.

On Thu, Mar 18, 2021 at 7:39 AM Wes McKinney  wrote:

> I completely agree with developing a common “query protocol” or “physical
> execution plan” IR + serialization scheme inside Apache Arrow. It may take
> some time to stabilize so we should try to avoid being hasty in closing it
> to change until more time has elapsed to allow requirements to percolate.
>
> On Thu, Mar 18, 2021 at 8:17 AM Andy Grove  wrote:
>
> > Hi Paddy,
> >
> > Thanks for raising this.
> >
> > Ballista defines computations using protobuf [1] to describe logical and
> > physical query plans, which consist of operators and expressions. It is
> > actually based on the Gandiva protobuf [2] for describing expressions.
> >
> > I see a lot of value in standardizing some of this across
> implementations.
> > Ballista is essentially becoming a distributed scheduler for Arrow and
> can
> > work with any implementation that supports this protobuf definition of
> > query plans.
> >
> > It would also make it easier to embed C++ in Rust, or Rust in C++, having
> > this common IR, so I would be all for having something like this as an
> > Arrow specification.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1]
> >
> >
> https://github.com/ballista-compute/ballista/blob/main/rust/core/proto/ballista.proto
> > [2]
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >
> >
> > On Thu, Mar 18, 2021 at 7:40 AM paddy horan 
> > wrote:
> >
> > > Hi All,
> > >
> > > I do not have a computer science background so I may not be asking this
> > in
> > > the correct way or using the correct terminology but I wonder if we can
> > > achieve some level of standardization when describing computation over
> > > Arrow data.
> > >
> > > At the moment on the Rust side DataFusion clearly has a way to describe
> > > computation, I believe that Ballista adds the ability to serialize this
> > to
> > > allow distributed computation.  On the C++ side work is starting on a
> > > similar query engine and we already have Gandiva.  Is there an
> > opportunity
> > > to define a kind of IR for computation over Arrow data that could be
> > > adopted across implementations?
> > >
> > > In this case DataFusion could easily incorporate Gandiva to generate
> > > optimized compute kernels if they were using the same IR to describe
> > > computation.  Applications built on Arrow could "describe" computation
> in
> > > any language and take advantage or innovations across the community,
> > adding
> > > this to Arrow's zero copy data sharing could be a game changer in my
> > mind.
> > > I'm not someone who knows enough to drive this forward but I obviously
> > > would like to get involved.  For some time I was playing around with
> > using
> > > TVM's relay IR [1] and applying it to Arrow data.
> > >
> > > As the Arrow memory format has now matured I fell like this could be
> the
> > > next step.  Is there any plan for this kind of work or are we going to
> > > allow sub-projects to "go their own way"?
> > >
> > > Thanks,
> > > Paddy
> > >
> > > [1] - Introduction to Relay IR - tvm 0.8.dev0 documentation (
> apache.org
> > )<
> > > https://tvm.apache.org/docs/dev/relay_intro.html>
> > >
> > >
> >
>


Arrow JS Meetup (02/13)

2021-02-09 Thread Brian Hulette
Hi all,

+Dominik Moritz  recently reached out to +Paul Taylor
 and myself to set up an Arrow JS meetup with the goal
of re-building some momentum around the Arrow JS library. We've scheduled
it for this coming Saturday, 02/13 at 11:30 AM PST. Rough Agenda:

- Arrow JS Design Principles, Future Plans, and How to Contribute (Paul and
Brian)
- Lightning Talks from Arrow JS users
- Discussions/breakouts as needed

If anyone is interested in joining please reach out to Dominik at
domor...@cmu.edu
For anyone who can't join - I will try my best to capture notes and share
them with the mailing list afterward.

Brian


Re: [javascript] streaming IPC examples?

2021-01-24 Thread Brian Hulette
+Paul Taylor  would your work with whatwg streams be
relevant here? Are there any examples that would be useful for Ryan?

Brian

On Sat, Jan 23, 2021 at 4:52 PM Ryan McKinley  wrote:

> Hello-
>
> I am exploring options to support streaming in grafana.  We have a golang
> websocket server and am exploring options to send data to the browser.
>
> Are there any good examples of reading IPC data with callbacks for each
> block?  I see examples for mapd, and for reading whole tables -- but am
> hoping for something that lets me read initial header data, then get each
> record batch as a callback (rxjs)
> https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
>
> Thanks for any pointers
> Ryan
>


Re: [javascript] cant get timestamps in arrow 2.0

2020-12-18 Thread Brian Hulette
Ah good to know, thanks for the clarifications Neal. Clearly I haven't been
keeping up very well.

On Fri, Dec 18, 2020, 09:49 Neal Richardson 
wrote:

> A few clarifications: Feather, in it's version 2, _is_ the Arrow IPC file
> format. We've kept the Feather name as a way of referring to Arrow files.
> The original Feather file format, which had differences from the Arrow IPC
> format, did not support compression. The Arrow IPC format may include
> compression (https://issues.apache.org/jira/browse/ARROW-300), but as
> Micah
> brought up on the user mailing list thread, it's only the C++
> implementation and libraries using it that have implemented yet, and the
> feature is not well documented yet.
>
> So all Arrow libraries support Feather v2 (as it is the IPC file format),
> but currently only C++ (thus Python, R, and glib/Ruby) supports Feather/IPC
> files with compression.
>
> Neal
>
> On Fri, Dec 18, 2020 at 8:18 AM Brian Hulette  wrote:
>
> >  Hi Andrew,
> > I'm glad you got this working! The javascript library only implements the
> > arrow IPC spec, it doesn't have any special handling for feather and its
> > compression support. It's good to know that you can read uncompressed
> > feather files, but I'd only expect it to read an IPC stream or file. This
> > is what I did for the Intro to Arrow JS notebook [1], see scrabble.py
> here
> > [2]. Note that python script was written many versions of arrow ago, I'm
> > sure there's less boilerplate required for this in pyarrow 2.0.
> >
> > Support for feather and compression would certainly be a welcome
> > contribution
> >
> > [1] https://observablehq.com/@theneuralbit/introduction-to-apache-arrow
> > [2]
> https://gist.github.com/TheNeuralBit/64d8cc13050c9b5743281dcf66059de5
> >
> > On Thu, Dec 17, 2020 at 10:10 AM Andrew Clancy  wrote:
> >
> > > So, I figured out the issue here - I had to remove compression from the
> > > pyarrow feather.write_feather(compression='uncompressed'). Is there any
> > way
> > > to read a compressed feather file in arrow js?
> > > See the comment under the first answer here:
> > >
> > >
> >
> https://stackoverflow.com/questions/64629670/how-to-write-a-pandas-dataframe-to-arrow-file/64648955#64648955
> > > I couldn't find anything in the arrow docs or notebooks on this - I'm
> > > assuming that's related to javascript compression libraries being so
> > > limited.
> > >
> > > On Mon, 14 Dec 2020 at 19:02, Andrew Clancy  wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a simple feather file created via a pandas to_feather with a
> > > > datetime64[ns] column, and cannot get timestamps in javascript
> > > > apache-arrow@2.0.0
> > > >
> > > > See this notebook:
> > > > https://observablehq.com/@nite/apache-arrow-timestamp-investigation
> > > >
> > > > I'm guessing I'm missing something, has anyone got any suggestions,
> or
> > > > decent examples of reading a file created in pandas? I've seen in
> > > examples
> > > > of apache-arrow@0.3.1 where dates stored as an array of 2 ints.
> > > >
> > > > File was created with:
> > > >
> > > > import pandas as pd
> > > > pd.read_parquet('sample.parquet')
> > > > df.to_feather('sample-seconds.feather')
> > > >
> > > > Final Q: I'm assuming this is the best place for this question? Happy
> > to
> > > > post elsewhere if there's any other forums, or if this should be a
> JIRA
> > > > ticket?
> > > >
> > > > Thanks!
> > > > Andy
> > > >
> > >
> >
>


Re: [javascript] cant get timestamps in arrow 2.0

2020-12-18 Thread Brian Hulette
 Hi Andrew,
I'm glad you got this working! The javascript library only implements the
arrow IPC spec, it doesn't have any special handling for feather and its
compression support. It's good to know that you can read uncompressed
feather files, but I'd only expect it to read an IPC stream or file. This
is what I did for the Intro to Arrow JS notebook [1], see scrabble.py here
[2]. Note that python script was written many versions of arrow ago, I'm
sure there's less boilerplate required for this in pyarrow 2.0.

Support for feather and compression would certainly be a welcome
contribution

[1] https://observablehq.com/@theneuralbit/introduction-to-apache-arrow
[2] https://gist.github.com/TheNeuralBit/64d8cc13050c9b5743281dcf66059de5

On Thu, Dec 17, 2020 at 10:10 AM Andrew Clancy  wrote:

> So, I figured out the issue here - I had to remove compression from the
> pyarrow feather.write_feather(compression='uncompressed'). Is there any way
> to read a compressed feather file in arrow js?
> See the comment under the first answer here:
>
> https://stackoverflow.com/questions/64629670/how-to-write-a-pandas-dataframe-to-arrow-file/64648955#64648955
> I couldn't find anything in the arrow docs or notebooks on this - I'm
> assuming that's related to javascript compression libraries being so
> limited.
>
> On Mon, 14 Dec 2020 at 19:02, Andrew Clancy  wrote:
>
> > Hi,
> >
> > I have a simple feather file created via a pandas to_feather with a
> > datetime64[ns] column, and cannot get timestamps in javascript
> > apache-arrow@2.0.0
> >
> > See this notebook:
> > https://observablehq.com/@nite/apache-arrow-timestamp-investigation
> >
> > I'm guessing I'm missing something, has anyone got any suggestions, or
> > decent examples of reading a file created in pandas? I've seen in
> examples
> > of apache-arrow@0.3.1 where dates stored as an array of 2 ints.
> >
> > File was created with:
> >
> > import pandas as pd
> > pd.read_parquet('sample.parquet')
> > df.to_feather('sample-seconds.feather')
> >
> > Final Q: I'm assuming this is the best place for this question? Happy to
> > post elsewhere if there's any other forums, or if this should be a JIRA
> > ticket?
> >
> > Thanks!
> > Andy
> >
>


Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library

2020-07-20 Thread Brian Hulette
On Tue, Jul 14, 2020 at 9:36 AM Micah Kornfield 
wrote:

> Hi Adam,
>
> > This sounds really interesting, how about adding the wasm build (C++) to
> > the releases?
>
> I think this just needs someone to volunteer to do it and maintain it (at a
> minimum if it doesn't already exist we need CI for it).  We would also need
> to figure out details of publishing and integrating it into the release
> process.
>

Yes a wasm build for the core C++ library would be a welcome addition as
well (as long as C++ maintainers agree whatever we do doesn't add a large
maintenance burden). As Micah pointed out folks at JPMC have already done
some work on this as part of Perspective, but we don't have any support in
Arrow itself.
I gave this a shot after being encouraged by [1], but ran into issues that
I can't recall and gave up. Probably someone with more knowledge of C++ and
cmake could get past it, especially given there's an example in Perspective.

As far as release/publishing, for Rust there's wasm-pack [2] which would
let us publish build artifacts to the npm registry for use in JS. I'm not
sure if this is helpful for integrating with Spark or not.

FWIW there was another thread [3] about wasm builds for Rust and C++ a
while back.

[1] https://github.com/apache/arrow/pull/3350#issuecomment-464517253
[2] https://rustwasm.github.io/wasm-pack/book/introduction.html
[3]
https://lists.apache.org/thread.html/e15dc80debf9dea1b33581fa6ba95fd84b57c0ccd0162505d5d25079%40%3Cdev.arrow.apache.org%3E


> I've done a lot of asm.js work (different from wasm) in the past, but my
> > assumption would be that using Rust instead of C++ as source for wasm
> > should result in smaller wasm binaries.
>
> I don't know much about either, but I'm curious why you would expect this
> to be the case?
>
> On Tue, Jul 14, 2020 at 8:07 AM Adam Lippai  wrote:
>
> > This sounds really interesting, how about adding the wasm build (C++) to
> > the releases?
> > I've done a lot of asm.js work (tfrom wasm) in the past, but my
> > assumption would be that using Rust instead of C++ as source for wasm
> > should result in smaller wasm binaries.
> > Rust Arrow doesn't really use exotic solutions, eg. simd or tokio
> > dependency can be turned off.
> >
> > Having DataFusion + some performant data access in browsers or even in
> > node.js would be useful.
> > Not needing to build fancy HTTP/GraphQL API over the Rust/C++ impl. but
> > moving the data processing code to the client is viable for "small"
> > workloads.
> > Ofc if JS Arrow lands Flight support this may become less of an issue,
> but
> > AFAIK it's gRPC based which would need setting up a gRPC reverse proxy
> for
> > C++/Rust Arrow.
> > Overall both the code-duplication and feature fragmentation would
> decrease
> > by using a single source (like you don't have a full Python impl. for
> > obvious reasons)
> >
> > Best regards,
> > Adam Lippai
> >
> > On Tue, Jul 14, 2020 at 4:27 PM Micah Kornfield 
> > wrote:
> >
> >> Fwiw, I believe at least the core c++ library already can be compiled to
> >> wasm. I  think perspective does this [1]
> >>
> >>
> >>  I'm curious What are you hoping to achieve with embedded wasm  in
> spark?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://perspective.finos.org/
> >>
> >> On Tuesday, July 14, 2020, Brian Hulette  wrote:
> >>
> >> > That sounds great! I'd like to have some support for using the rust
> >> and/or
> >> > C++ libraries in the browser via wasm as well.
> >> > As long as the community is ok with your overall approach "to add
> >> compiler
> >> > conditionals around any I/O features and libc dependent features of
> >> these
> >> > two libraries," I think it may be best to start with a PR and discuss
> >> > specifics from there.
> >> >
> >> > Do any rust contributors have objections to this?
> >> >
> >> > Brian
> >> >
> >> > On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal  wrote:
> >> >
> >> > >  Hi all,
> >> > >
> >> > > Looking for guidance on how to submit a design and PR to add WASM32
> >> > support
> >> > > to apache arrow's rust libraries.
> >> > >
> >> > > I am looking to use the arrow library to pass data in arrow format
> >> > between
> >> > > the host spark environment and UDFs defined in WASM .
> >> > >
> >> > > I created the following JIRA ticket to capture the work
> >> > > https://issues.apache.org/jira/browse/ARROW-9453
> >> > >
> >> > > Thanks,
> >> > > RJ
> >> > >
> >> >
> >>
> >
>


Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library

2020-07-14 Thread Brian Hulette
That sounds great! I'd like to have some support for using the rust and/or
C++ libraries in the browser via wasm as well.
As long as the community is ok with your overall approach "to add compiler
conditionals around any I/O features and libc dependent features of these
two libraries," I think it may be best to start with a PR and discuss
specifics from there.

Do any rust contributors have objections to this?

Brian

On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal  wrote:

>  Hi all,
>
> Looking for guidance on how to submit a design and PR to add WASM32 support
> to apache arrow's rust libraries.
>
> I am looking to use the arrow library to pass data in arrow format between
> the host spark environment and UDFs defined in WASM .
>
> I created the following JIRA ticket to capture the work
> https://issues.apache.org/jira/browse/ARROW-9453
>
> Thanks,
> RJ
>


Re: [JavaScript] how to set column name after creation?

2020-06-26 Thread Brian Hulette
Hi Ryan,
Here or user@arrow.apache.orgis a fine place to ask :)

The metadata on Table/Column/Field objects are all immutable, so doing this
right now would require creating a new instance of Table with the field
renamed, which takes quite a lot of boilerplate. A helper for renaming a
column (or even better a generalization of select [1] that lets you do a
full projection, including column renames) would be a great contribution.

Here's an example of creating a renamed column, which should get you most
of the way to creating a Table with a renamed column:
https://observablehq.com/@theneuralbit/renaming-an-arrow-column

Brian

[1]
https://github.com/apache/arrow/blob/ff7ee06020949daf66ac05090753e1a17736d9fa/js/src/table.ts#L249

On Thu, Jun 25, 2020 at 4:04 PM Ryan McKinley  wrote:

> Apologies if this is the wrong list or place to ask...
>
> What is the best way to update a column name for a Table in javascript?
>
> const col = table.getColumnAt(i);
> col.name = 'new name!'
>
> Currently: Cannot assign to 'name' because it is a read-only property
>
> Thanks!
>
> ryan
>


Re: Why downloading sources of pyarrow and its requirements takes several minutes?

2020-05-29 Thread Brian Hulette
+1 fo a jira to track this. I looked into it a little bit just out of
curiosity.

I passed --verbose to pip to get insight into what's going on in in the
"Installing build dependencies..." step. I did this for both 0.15.1 and
0.16. They took 4:10 and 5:57 respectively.  It looks like 0.16.0 spent
2:43 installing numpy, which is absent from the 0.15.1 log. I'm not sure
what changed to cause this.

I collected logs with the following command (note it relies on ts in
moreutils for adding timestamps):
  python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary :all:
--verbose 2>&1 | ts | tee /tmp/0.16.0.log
I found the numpy difference and measured its runtime by grepping for
"Running setup.py" in these logs.

The logs are uploaded to google drive:
https://drive.google.com/drive/folders/1rPoYAsVul3HGdrviiCLGPf_P8dOlBCd1?usp=sharing

On Fri, May 29, 2020 at 5:49 AM Wes McKinney  wrote:

> hi Valentyn,
>
> This is the first I've ever heard of anyone doing what you are doing,
> so safe to say that we've given little to no consideration to this use
> case. We have been focused on providing binary packages for pip and
> conda. Could you please open a JIRA and provide more detailed
> information about what you are seeing?
>
> Thanks
> Wes
>
> On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev
>  wrote:
> >
> > Hi Arrow dev community,
> >
> > Do you have any insight why
> >
> >   python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> > :all:
> >
> > takes several minutes to execute? From the output we can see that pip get
> > stuck on:
> >
> >   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
> >   Installing build dependencies ... |
> >
> > There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> > suspect  some build dependencies need to be installed before pip
> > understands the dependencies of pyarrow.  Is there some inefficiency in
> > Avro's setup.py that is causing this?
> >
> > Thanks,
> > Valentyn
>


Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

2020-03-12 Thread Brian Hulette
* What kind of devops tooling would be appropriate to provision and
manage the instances, scaling up and down based on need?
* What CI/CD platform would be appropriate to dispatch work to the
cloud nodes (taking into consideration the high costs of sysadmin, and
seeking to minimize nodes sitting unused)?

I looked into solutions for running CI/CD workers on GCP a (very) little
bit and just wanted to shared some findings.
Appveyor claims it can auto-scale GCE instances [1] but I don't think it
would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a
problem?
BuildKite has documentation about running agents on a scalable GKE cluster
[3], but unfortunately no way to auto-scale based on the backlog. We could
maybe roll our own/contribute something based on their AWS scaler [4].

[1] https://www.appveyor.com/docs/byoc/gce/
[2] https://www.appveyor.com/pricing/
[3]
https://buildkite.com/docs/agent/v3/gcloud#running-the-agent-on-google-kubernetes-engine
[4] https://github.com/buildkite/buildkite-agent-scaler

On Wed, Mar 11, 2020 at 7:49 PM Micah Kornfield 
wrote:

> >
> > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> > donate cloud compute credits to the project
>
> Google has offered a donation of GCP credits based on some estimates I made
> last year when we were facing Travis CI issues. I'm happy to try to do some
> integration work to help make this happen.
>
> For the other questions, I'm happy to do some research, but also happy if
> someone else would like to take up the work here.  I think one blocker in
> the past has been restrictions from Apache Infra, is there any
> documentation on what is and is not supported on that front?
>
> Thanks,
> Micah
> On Wed, Mar 11, 2020 at 3:17 PM Wes McKinney  wrote:
>
> > hi folks,
> >
> > There has periodically been a discussion about employing dedicated
> > compute resources to serve our testing needs beyond what can be
> > accomplished in free / public CI services like GitHub Actions,
> > Appveyor, etc. For example:
> >
> > * Workloads requiring a CUDA-capable GPU
> > * Tests requiring a lot of memory
> > * ARM architecture
> >
> > While physical machines can be hooked up to some CI/CD services like
> > Github Actions and Buildkite, I believe we should not be 100%
> > dependent on the availability of such hardware (the recent tornado in
> > Nashville is a good example of what can go wrong).
> >
> > At some point it will make sense to be able to provision cloud hosts
> > (either temporary spot instances or persistent nodes) to meet these
> > needs. This brings up several questions:
> >
> > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> > donate cloud compute credits to the project
> > * What kind of devops tooling would be appropriate to provision and
> > manage the instances, scaling up and down based on need?
> > * What CI/CD platform would be appropriate to dispatch work to the
> > cloud nodes (taking into consideration the high costs of sysadmin, and
> > seeking to minimize nodes sitting unused)?
> >
> > This will probably take time to work out and there is significant
> > engineering involved in achieving any solution, but it would be good
> > to have all the options on the table with a frank analysis of the
> > pros/cons and costs (both in money and volunteer time) involved.
> >
> > Thanks,
> > Wes
> >
>


Re: [DISCUSS][Java] Support non-nullable vectors

2020-03-11 Thread Brian Hulette
> And there is a "nullable" metadata-only flag at the
> Field level. Could the same kinds of optimizations be implemented in
> Java without introducing a "nullable" concept?

Note Liya Fan did suggest pulling the nullable flag from the Field when the
vector is created in item (1) of the proposed changes.

Brian

On Wed, Mar 11, 2020 at 5:54 AM Fan Liya  wrote:

> Hi Micah,
>
> Thanks a lot for your valuable comments. Please see my comments inline.
>
> > I'm a little concerned that this will change assumptions for at least
> some
> > of the clients using the library (some might always rely on the validity
> > buffer being present).
>
> I can understand your concern and I am also concerned.
> IMO, the client should not depend on this assumption, as the specification
> says "Arrays having a 0 null count may choose to not allocate the validity
> bitmap." [1]
> That being said, I think it would be safe to provide a global flag to
> switch on/off the feature (as you suggested).
>
> > I think this is a good feature to have for the reasons you mentioned. It
> > seems like there would need to be some sort of configuration bit to set
> for
> > this behavior.
>
> Good suggestion. We should be able to switch on and off the feature with a
> single global flag.
>
> > But, I'd be worried about code complexity this would
> > introduce.
>
> I agree with you that code complexity is an important factor to consider.
> IMO, our proposal should not involve too much code change, or increase code
> complexity too much.
> To prove this, maybe we need to show some small experimental code change.
>
> Best,
> Liya Fan
>
> [1] https://arrow.apache.org/docs/format/Columnar.html#logical-types
>
> On Wed, Mar 11, 2020 at 1:53 PM Micah Kornfield 
> wrote:
>
> > Hi Liya Fan,
> > I'm a little concerned that this will change assumptions for at least
> some
> > of the clients using the library (some might always rely on the validity
> > buffer being present).
> >
> > I think this is a good feature to have for the reasons you mentioned. It
> > seems like there would need to be some sort of configuration bit to set
> for
> > this behavior. But, I'd be worried about code complexity this would
> > introduce.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Mar 10, 2020 at 6:42 AM Fan Liya  wrote:
> >
> > > Hi Wes,
> > >
> > > Thanks a lot for your quick reply.
> > > I think what you mentioned is almost exactly what we want to do in
> > Java.The
> > > concept is not important.
> > >
> > > Maybe there are only some minor differences:
> > > 1. In C++, the null_count is mutable, while for Java, once a vector is
> > > constructed as non-nullable, its null count can only be 0.
> > > 2. In C++, a non-nullable array's validity buffer is null, while in
> Java,
> > > the buffer is an empty buffer, and cannot be changed.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Tue, Mar 10, 2020 at 9:26 PM Wes McKinney 
> > wrote:
> > >
> > > > hi Liya,
> > > >
> > > > In C++ we elect certain faster code paths when the null count is 0 or
> > > > computed to be zero. When the null count is 0, we do not allocate a
> > > > validity bitmap. And there is a "nullable" metadata-only flag at the
> > > > Field level. Could the same kinds of optimizations be implemented in
> > > > Java without introducing a "nullable" concept?
> > > >
> > > > - Wes
> > > >
> > > > On Tue, Mar 10, 2020 at 8:13 AM Fan Liya 
> wrote:
> > > > >
> > > > > Dear all,
> > > > >
> > > > > A non-nullable vector is one that is guaranteed to contain no
> nulls.
> > We
> > > > > want to support non-nullable vectors in Java.
> > > > >
> > > > > *Motivations:*
> > > > > 1. It is widely used in practice. For example, in a database
> engine,
> > a
> > > > > column can be declared as not null, so it cannot contain null
> values.
> > > > > 2.Non-nullable vectors has significant performance advantages
> > compared
> > > > with
> > > > > their nullable conterparts, such as:
> > > > >   1) the memory space of the validity buffer can be saved.
> > > > >   2) manipulation of the validity buffer can be bypassed
> > > > >   3) some if-else branches can be replaced by sequential
> instructions
> > > (by
> > > > > the JIT compiler), leading to high throughput for the CPU pipeline.
> > > > >
> > > > > *Potential Cost:*
> > > > > For nullable vectors, there can be extra checks against the
> > > nullablility
> > > > > flag. So we must change the code in a way that minimizes the cost.
> > > > >
> > > > > *Proposed Changes:*
> > > > > 1. There is no need to create new vector classes. We add a final
> > > boolean
> > > > to
> > > > > the vector base classes as the nullability flag. The value of the
> > flag
> > > > can
> > > > > be obtained from the field when creating the vector.
> > > > > 2. Add a method "boolean isNullable()" to the root interface
> > > ValueVector.
> > > > > 3. If a vector is non-nullable, its validity buffer should be an
> > empty
> > > > > buffer (not null, so much of the existing logic can be left
> > unchange

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

2020-02-09 Thread Brian Hulette
> It seems we should potentially disallow dictionaries to contain null
values?
+1 - I've always thought it was odd you could encode null values in two
different places for dictionary encoded columns.
You could argue it's more efficient to encode the nulls in the dictionary,
but I think if we're going to allow that we should go further - we know
there should only be _one_ index with the NULL value in a dictionary, why
encode an entire validity buffer? Maybe this is one place where a sentinel
value makes sense.


The mailing list thread where I brought up the idea of nested dictionaries
[1] is useful context for item 2. I still think this is a good idea, but
I've changed jobs since then and the use-case I described is no longer
motivating me to actually implement it.

> It seems simpler to keep dictionary encoding at the leafs of the schema.
Do we need to go that far? I think we could still allow dictionary encoding
at any level of a hierarchy, and just disallow nested dictionaries.

Brian

[1]
https://lists.apache.org/thread.html/37c0480c4c7a48dd298e8459938444afb901bf01dcebd5f8c5f1dee6%40%3Cdev.arrow.apache.org%3E

On Sat, Feb 8, 2020 at 10:53 PM Micah Kornfield 
wrote:

> I'd like to understand if any one is making use of the following features
> and if we should revisit them before 1.0.
>
> 1. Dictionaries can encode null values.
> - This become error prone for things like parquet.  We seem to be
> calculating the definition level solely based on the null bitmap.
>
> I might have missed something but it appears that we only check if a
> dictionary contains nulls on the optimized path [1] but not when converting
> the dictionary array back to dense, so I think the values written could get
> out of sync with the rep/def levels?
>
> It seems we should potentially disallow dictionaries to contain null
> values?
>
> 2.  Dictionaries can nested columns which are in turn dictionary encoded
> columns.
>
> - Again we aren't handling this in Parquet today, and I'm wondering if it
> worth the effort.
> There was a PR merged a while ago [2] to add a "skipped" integration test
> but it doesn't look like anyone has done follow-up work to make enable
> this/make it pass.
>
> It seems simpler to keep dictionary encoding at the leafs of the schema.
>
> Of the two I'm a little more worried that Option #1 will break people if we
> decide to disallow it.
>
> Thoughts?
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> [2] https://github.com/apache/arrow/pull/1848
>


Re: [Java] PR Reviewers

2020-01-25 Thread Brian Hulette
I'm still pretty new to the Java implementation, but I can probably help
out with some reviews.

On Thu, Jan 23, 2020 at 8:41 PM Micah Kornfield 
wrote:

> I mentioned this elsewhere but my intent is to stop doing java reviews for
> the immediate future once I wrap up the few that I have requested change
> on.
>
> I'm happy to try to triage incoming Java PRs, but in order to do this, I
> need to know which committers have some bandwidth to do reviews (some of
> the existing PRs I've tagged people who never responded).
>
> Thanks,
> Micah
>


Re: [DISCUSS][JAVA] Correct the behavior of ListVector isEmpty

2020-01-24 Thread Brian Hulette
What about returning null for a null list? It looks like now the function
returns a primitive boolean, so I guess that would be a substantial change,
but null seems more correct to me.

On Thu, Jan 23, 2020, 21:38 Micah Kornfield  wrote:

>  I would vote for treating nulls as empty.
>
> On Fri, Jan 10, 2020 at 12:36 AM Ji Liu 
> wrote:
>
> > Hi all,
> >
> > Currently isEmpty API is always return false in BaseRepeatedValueVector,
> > and its subclass ListVector did not overwrite this method.
> > This will lead to incorrect result, for example, a ListVector with data
> > [1,2], null, [], [5,6] would get [false, false, false, false] which is
> not
> > right.
> > I opened a PR to fix this[1] and not sure what’s the right behavior for
> > null value, should it return [false, false, true, false] or [false, true,
> > true, false] ?
> >
> >
> > Thanks,
> > Ji Liu
> >
> >
> > [1] https://github.com/apache/arrow/pull/6044
> >
> >
>


[jira] [Created] (ARROW-7674) Add helpful message for captcha challenge in merge_arrow_pr.py

2020-01-24 Thread Brian Hulette (Jira)
Brian Hulette created ARROW-7674:


 Summary: Add helpful message for captcha challenge in 
merge_arrow_pr.py
 Key: ARROW-7674
 URL: https://issues.apache.org/jira/browse/ARROW-7674
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Brian Hulette
Assignee: Brian Hulette


After an incorrect password jira starts requiring a captcha challenge. When 
this happens with merge_arrow_pr.py its difficult to distinguish from any other 
failed login attempt. We should print a helpful message when this happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: pyarrow and macOS 10.15

2019-10-11 Thread Brian Hulette
Thanks Wes.

I'm not sure about static linking but it seems likely, I'll start a
discussion on https://issues.apache.org/jira/browse/BEAM-8368.

On Fri, Oct 11, 2019 at 10:17 AM Wes McKinney  wrote:

> Does Apache Beam statically-link Protocol Buffers?
>
> I opened https://issues.apache.org/jira/browse/ARROW-6860
>
> It would be great if the Beam community could work with us to resolve
> issues around shipping C++ Protocol Buffers. We don't want you to be
> stuck on pyarrow 0.13.0 and have your users be subjected to bugs and
> other issues.
>
> On Thu, Oct 10, 2019 at 3:11 PM Brian Hulette  wrote:
> >
> > In Beam we've had a few users report issues importing Beam Python after
> > upgrading to macOS 10.15 Catalina, and it seems like our pyarrow import
> is
> > the root cause [1]. Given that I don't see any reports of this on the
> arrow
> > side I suspect that this is an issue just with pyarrow 0.14 (in Beam
> we've
> > restricted to <0.15 [2]), can anyone confirm that the pypi release of
> > pyarrow 0.15 is working on macOS 10.15?
> >
> > Thanks,
> > Brian
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-8368
> > [2] https://github.com/apache/beam/blob/master/sdks/python/setup.py#L122
>


pyarrow and macOS 10.15

2019-10-10 Thread Brian Hulette
In Beam we've had a few users report issues importing Beam Python after
upgrading to macOS 10.15 Catalina, and it seems like our pyarrow import is
the root cause [1]. Given that I don't see any reports of this on the arrow
side I suspect that this is an issue just with pyarrow 0.14 (in Beam we've
restricted to <0.15 [2]), can anyone confirm that the pypi release of
pyarrow 0.15 is working on macOS 10.15?

Thanks,
Brian

[1] https://issues.apache.org/jira/browse/BEAM-8368
[2] https://github.com/apache/beam/blob/master/sdks/python/setup.py#L122


Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

2019-08-09 Thread Brian Hulette
Congratulations Micah! Well deserved :)

On Fri, Aug 9, 2019 at 9:02 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Congrats!
>
> well deserved.
>
> On Fri, Aug 9, 2019 at 11:12 AM Wes McKinney  wrote:
> >
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Micah Kornfield to become a PMC member and we are pleased to announce
> > that Micah has accepted.
> >
> > Congratulations and welcome!
>


Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-07-31 Thread Brian Hulette
I'm a little confused about the proposal now. If the unknown dimension
doesn't have to be the same within a record batch, how would you be able to
deduce it with the approach you described (dividing the logical length of
the values array by the length of the record batch)?

On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney  wrote:

> I agree this sounds like a good application for ExtensionType. At
> minimum, ExtensionType can be used to develop a working version of
> what you need to help guide further discussions.
>
> On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
>  wrote:
> >
> > Hello,
> >
> > if each record has a different size, then I suggest to just use a
> > Struct> where Dim is a struct (or expand in the outer
> > struct). You can probably add your own logic with the recently
> > introduced ExtensionType [1].
> >
> > François
> > [1]
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> >
> > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper 
> wrote:
> > >
> > > The intention is that each individual record could have a different
> size.
> > > This could be consistent within a given batch, but wouldn't need to be.
> > > For example, if I wanted to send a 3-channel image, but the image size
> may
> > > vary for each record, then I could use
> > > FixedSizeList[3]>[-1]>[-1].
> > >
> > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette 
> wrote:
> > >
> > > > This isn't really relevant but I feel compelled to point it out - the
> > > > FixedSizeList type has actually been in the Arrow spec for a while,
> but it
> > > > was only implemented in JS and Java initially. It was implemented in
> C++
> > > > just a few months ago.
> > > >
> > >
> > > Thanks for the clarification -- I was going based on the blame history
> for
> > > Layout.rst, but I guess it just didn't get officially documented there
> > > until the c++ implementation was added.
> > >
> > > -Edward
> > >
> > >
> > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> 
> > > > wrote:
> > > >
> > > > > The FixedSizeList type, which was added to Arrow a few months ago,
> is an
> > > > > array where each slot contains a fixed-size sequence of values.
> It is
> > > > > specified as FixedSizeList[N], where T is a child type and N is
> a
> > > > signed
> > > > > int32 that specifies the length of each list.
> > > > >
> > > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > > 100x8x10
> > > > > tensor, then I can encode it as
> > > > > FixedSizeList[10]>[8]>[100].
> > > > >
> > > > > But I'm also interested in encoding tensors where some dimension
> sizes
> > > > are
> > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > extended
> > > > > to support this fairly easily, by simply defining that N=-1 means
> "each
> > > > > array slot has the same length, but that length is not known in
> advance."
> > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > FixedSizeList[10]>[-1]>[100].
> > > > >
> > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> some
> > > > way
> > > > > to determine what they are.  Luckily, every Field in the schema
> has a
> > > > > corresponding FieldNode in the message; and those FieldNodes can
> be used
> > > > to
> > > > > deduce the row lengths.  In particular, the row length must be
> equal to
> > > > the
> > > > > length of the child node divided by the length of the
> FixedSizeList.
> > > > E.g.,
> > > > > if we have a FixedSizeList[-1] array with the values [[1,
> 2], [3,
> > > > 4],
> > > > > [5, 6]] then the message representation is:
> > > > >
> > > > > * Length: 3, Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Values array (byte array):
> > > > > * Length: 6,  Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Value buffer: [1, 2, 3, 4, 5, 6, ]
> > > > >
> > > > > So we can deduce that the row length is 6/3=2.
> > > > >
> > > > > It looks to me like it would be fairly easy to add support for
> this.
> > > > E.g.,
> > > > > in the FixedSizeListArray constructor in c++, if
> list_type()->list_size()
> > > > > is -1, then set list_size_ to values.length()/length.  There would
> be no
> > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > assigning a
> > > > > meaning to something that's currently meaningless (having
> > > > > FixedSizeList.listSize=-1).
> > > > >
> > > > > If there's support for adding this to Arrow, then I could put
> together a
> > > > > PR.
> > > > >
> > > > > Thanks,
> > > > > -Edward
> > > > >
> > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple
> days
> > > > ago
> > > > > right before subscribing to the mailing list; but I don't see it
> on the
> > > > > archives, presumably because I wasn't subscribed yet when I sent
> it out.
> > > > >
> > > >
>


Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

2019-07-29 Thread Brian Hulette
I think it may be helpful to clarify what you mean by dimensions that are
not known in advance. I believe the intention here is that this unknown
dimension is consistent within a record batch, but it is allowed to vary
from batch to batch. Otherwise, I would say you could just delay creating
the schema until you do know the unknown dimension.

This isn't really relevant but I feel compelled to point it out - the
FixedSizeList type has actually been in the Arrow spec for a while, but it
was only implemented in JS and Java initially. It was implemented in C++
just a few months ago.

On Mon, Jul 29, 2019 at 7:01 AM Edward Loper 
wrote:

> The FixedSizeList type, which was added to Arrow a few months ago, is an
> array where each slot contains a fixed-size sequence of values.  It is
> specified as FixedSizeList[N], where T is a child type and N is a signed
> int32 that specifies the length of each list.
>
> This is useful for encoding fixed-size tensors.  E.g., if I have a 100x8x10
> tensor, then I can encode it as
> FixedSizeList[10]>[8]>[100].
>
> But I'm also interested in encoding tensors where some dimension sizes are
> not known in advance.  It seems to me that FixedSizeList could be extended
> to support this fairly easily, by simply defining that N=-1 means "each
> array slot has the same length, but that length is not known in advance."
>  So e.g. we could encode a 100x?x10 tensor as
> FixedSizeList[10]>[-1]>[100].
>
> Since these N=-1 row-lengths are not encoded in the type, we need some way
> to determine what they are.  Luckily, every Field in the schema has a
> corresponding FieldNode in the message; and those FieldNodes can be used to
> deduce the row lengths.  In particular, the row length must be equal to the
> length of the child node divided by the length of the FixedSizeList.  E.g.,
> if we have a FixedSizeList[-1] array with the values [[1, 2], [3, 4],
> [5, 6]] then the message representation is:
>
> * Length: 3, Null count: 0
> * Null bitmap buffer: Not required
> * Values array (byte array):
> * Length: 6,  Null count: 0
> * Null bitmap buffer: Not required
> * Value buffer: [1, 2, 3, 4, 5, 6, ]
>
> So we can deduce that the row length is 6/3=2.
>
> It looks to me like it would be fairly easy to add support for this.  E.g.,
> in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> is -1, then set list_size_ to values.length()/length.  There would be no
> changes to the schema.fbs/message.fbs files -- we would just be assigning a
> meaning to something that's currently meaningless (having
> FixedSizeList.listSize=-1).
>
> If there's support for adding this to Arrow, then I could put together a
> PR.
>
> Thanks,
> -Edward
>
> P.S. Apologies if this gets posted twice -- I sent it out a couple days ago
> right before subscribing to the mailing list; but I don't see it on the
> archives, presumably because I wasn't subscribed yet when I sent it out.
>


Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-22 Thread Brian Hulette
To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth that cost.

It's been suggested on this list and elsewhere [1] that sparse encodings
that can be operated on without fully decompressing should be added to the
Arrow format. The longer we continue to develop computation libraries
without considering those schemes, the harder it will be to add them.

[1]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html


On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney  wrote:

> On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou 
> wrote:
> >
> > On Fri, 12 Jul 2019 20:37:15 -0700
> > Micah Kornfield  wrote:
> > >
> > > If the latter, I wonder why Parquet cannot simply be used instead of
> > > > reinventing something similar but different.
> > >
> > > This is a reasonable point.  However there is  continuum here between
> file
> > > size and read and write times.  Parquet will likely always be the
> smallest
> > > with the largest times to convert to and from Arrow.  An uncompressed
> > > Feather/Arrow file will likely always take the most space but will much
> > > faster conversion times.
> >
> > I'm curious whether the Parquet conversion times are inherent to the
> > Parquet format or due to inefficiencies in the implementation.
> >
>
> Parquet is fundamentally more complex to decode. Consider several
> layers of logic that must happen for values to end up in the right
> place
>
> * Data pages are usually compressed, and a column consists of many
> data pages each having a Thrift header that must be deserialized
> * Values are usually dictionary-encoded, dictionary indices are
> encoded using hybrid bit-packed / RLE scheme
> * Null/not-null is encoded in definition levels
> * Only non-null values are stored, so when decoding to Arrow, values
> have to be "moved into place"
>
> The current C++ implementation could certainly be made faster. One
> consideration with Parquet is that the files are much smaller, so when
> you are reading them over the network the effective end-to-end time
> including IO and deserialization will frequently win.
>
> > Regards
> >
> > Antoine.
> >
> >
>


[jira] [Created] (ARROW-5741) [JS] Make numeric vector from functions consistent with TypedArray.from

2019-06-26 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5741:


 Summary: [JS] Make numeric vector from functions consistent with 
TypedArray.from
 Key: ARROW-5741
 URL: https://issues.apache.org/jira/browse/ARROW-5741
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette


Described in 
https://lists.apache.org/thread.html/b648a781cba7f10d5a6072ff2e7dab6c03e2d1f12e359d9261891486@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5740) [JS] Add ability to run tests in headless browsers

2019-06-26 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5740:


 Summary: [JS] Add ability to run tests in headless browsers
 Key: ARROW-5740
 URL: https://issues.apache.org/jira/browse/ARROW-5740
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


Now that we have a compatibility check that modifies behavior based on the 
features in a supported browser, we should really be running our tests in 
various browsers to exercise the various cases.

For example right now we don't actually run tests on the non-BigNum code.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5714) [JS] Inconsistent behavior in Int64Builder with/without BigNum

2019-06-24 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5714:


 Summary: [JS] Inconsistent behavior in Int64Builder with/without 
BigNum
 Key: ARROW-5714
 URL: https://issues.apache.org/jira/browse/ARROW-5714
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: 0.14.0


When the Int64Builder is used in a context without BigNum, appending two 
numbers combines them into a single Int64:

{{
> v = Arrow.Builder.new({type: new 
> Arrow.Int64()}).append(1).append(2).finish().toVector()
> v.get(0)
Int32Array [ 1, 2 ]
}}

Whereas the same process with BigNum creates two new Int64s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5689) [JS] Remove hard-coded Field.nullable

2019-06-21 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5689:


 Summary: [JS] Remove hard-coded Field.nullable
 Key: ARROW-5689
 URL: https://issues.apache.org/jira/browse/ARROW-5689
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


Context: https://github.com/apache/arrow/pull/4502#discussion_r296390833

This isn't a huge issue since we can just elide validity buffers when null 
count is zero, but sometimes it's desirable to be able to assert a Field is 
_never_ null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5688) [JS] Add test for EOS in File Format

2019-06-21 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5688:


 Summary: [JS] Add test for EOS in File Format
 Key: ARROW-5688
 URL: https://issues.apache.org/jira/browse/ARROW-5688
 Project: Apache Arrow
  Issue Type: Task
Reporter: Brian Hulette


Either in a unit test, or in the integration tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5491) Remove unecessary semicolons following MACRO definitions

2019-06-03 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5491:


 Summary: Remove unecessary semicolons following MACRO definitions
 Key: ARROW-5491
 URL: https://issues.apache.org/jira/browse/ARROW-5491
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.13.0
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[JS] Proposal for numeric vector `from` functions

2019-05-31 Thread Brian Hulette
I think the current behavior of `from` functions on IntVector and
FloatVector can be quite confusing for new arrow users. The current
behavior can be summarized as:
- if the argument is any type of TypedArray (including one of a mismatched
type), create a new vector backed by that array's buffer.
- otherwise, treat it as an iterable of numbers, and convert them as needed
- ... unless we're making an Int64Vector, then treat each input as a 32-bit
number and pack pairs together

This can give users very unexpected results. For example, you might expect
arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0])) to yield a vector
with the values [1,2,3] - but it doesn't, it gives you the integers that
result from re-interpreting that buffer of floating point numbers as
integers.

I put together a notebook with some more examples of this confusing
behavior, compared to TypedArray.from:
https://observablehq.com/d/6aa80e43b5a97361

I'd like to propose that we re-write these from functions with the
following behavior:
- iff the argument is an ArrayBuffer or a TypedArray of the same numeric
type, create a new vector backed by that array's buffer.
- otherwise, treat is as an iterable of numbers and convert to the
appropriate type.
- no exceptions for Int64

If users really want to preserve the current behavior and use a
TypedArray's memory directly without converting, even when the types are
mismatched, they can still just access the underlying ArrayBuffer and pass
that in. So arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0])) would
yield a vector with [1,2,3], but you could still use
arrow.Int32Vector.from(Float32Array.from([1.0,2.0,3.0]).buffer) to
replicate the current behavior.

Removing the special case for Int64 does make it a little easier to shoot
yourself in the foot by exceeding JS numbers' 53-bit precision, so maybe we
should mitigate that somehow, but I don't think combining pairs of numbers
is the right way to do that. Maybe a warning?

What do you all think? If there's consensus on this I'd like to make the
change prior to 0.14 to minimize the number of releases with the current
behavior.

Brian


Confluence edit access

2019-05-17 Thread Brian Hulette
Can I get edit access on confluence? I wanted to answer some of the
questions about JS here:
https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone

My username is bhulette

Thanks!
Brian


[jira] [Created] (ARROW-5313) [Format] Comments on Field table are a bit confusing

2019-05-13 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5313:


 Summary: [Format] Comments on Field table are a bit confusing
 Key: ARROW-5313
 URL: https://issues.apache.org/jira/browse/ARROW-5313
 Project: Apache Arrow
  Issue Type: Task
  Components: Format
Affects Versions: 0.13.0
Reporter: Brian Hulette
Assignee: Brian Hulette


Currently Schema.fbs has two different explanations of {{Field.children}}

One says "children is only for nested Arrow arrays" and the other says 
"children apply only to nested data types like Struct, List and Union". I think 
both are technically correct but the latter is much more explicit, we should 
remove the former.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC1

2019-03-21 Thread Brian Hulette
+1 (non-binding)

Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1` with Node v11.12.0


On Thu, Mar 21, 2019 at 1:54 PM Krisztián Szűcs 
wrote:

> +1 (binding)
>
> Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1`
> with Node v11.12.0 on OSX 10.14.3 and it looks good.
>
> On Thu, Mar 21, 2019 at 8:45 PM Krisztián Szűcs  >
> wrote:
>
> > Hello all,
> >
> > I would like to propose the following release candidate (rc1) of Apache
> > Arrow JavaScript version 0.4.1. This is the second release candidate,
> > including the fix for node version requirement [3].
> >
> > The source release rc1 is hosted at [1].
> >
> > This release candidate is based on commit
> > e9cf83c48b9740d42b5d18158e61c0962fda59c1
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The easiest way is to use the JavaScript-specific release
> > verification script dev/release/js-verify-release-candidate.sh.
> >
> > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because...
> >
> >
> > How to validate a release signature:
> > https://httpd.apache.org/dev/verification.html
> >
> > [1]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc1/
> > [2]:
> >
> https://github.com/apache/arrow/tree/e9cf83c48b9740d42b5d18158e61c0962fda59c1
> > [3]: https://github.com/apache/arrow/pull/4006/
> >
>


[jira] [Created] (ARROW-4991) [CI] Bump travis node version to 11.12

2019-03-21 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4991:


 Summary: [CI] Bump travis node version to 11.12
 Key: ARROW-4991
 URL: https://issues.apache.org/jira/browse/ARROW-4991
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: JS-0.4.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0

2019-03-21 Thread Brian Hulette
I just merged https://github.com/apache/arrow/pull/4006 that bumps the node
requirement to 11.12 to avoid this issue. Krisztian, can you cut an RC1
with that change included?

Brian

On Thu, Mar 21, 2019 at 10:06 AM Brian Hulette  wrote:

> It looks like this was an issue with node v11.11 that was resolved in
> v11.12 [1,2]. Can you try upgrading and running again?
>
> [1]
> https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear
> [2] https://github.com/nodejs/node/pull/26488
>
> On Thu, Mar 21, 2019 at 8:00 AM Uwe L. Korn  wrote:
>
>> This saldy fails locally for me on OSX High Sierra:
>>
>> ```
>> + npm run test
>>
>> > apache-arrow@0.4.1 test
>> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1
>> > NODE_NO_WARNINGS=1 gulp test
>>
>> [15:23:02] Using gulpfile
>> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1/gulpfile.js
>> [15:23:02] Starting 'test'...
>> [15:23:02] Starting 'test:ts'...
>> [15:23:02] Starting 'test:src'...
>> [15:23:02] Starting 'test:apache-arrow'...
>>
>>   ● Test suite failed to run
>>
>> TypeError: Cannot assign to read only property
>> 'Symbol(Symbol.toStringTag)' of object '#'
>>
>>   at exports.default
>> (node_modules/jest-util/build/create_process_object.js:15:34)
>> ```
>>
>> This is the same error as in the nightlies but the fix there doesn't help
>> for me locally.
>>
>> Uwe
>>
>> On Thu, Mar 21, 2019, at 2:41 AM, Brian Hulette wrote:
>> > +1 (non-binding)
>> >
>> > Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0
>> >
>> > Thanks Krisztian!
>> > Brian
>> >
>> > On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor  wrote:
>> >
>> > > +1 non-binding
>> > >
>> > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High
>> > > Sierra w/ node v11.6.0
>> > >
>> > >
>> > > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou 
>> wrote:
>> > >
>> > > > +1 (binding)
>> > > >
>> > > > I ran the followings on Debian GNU/Linux sid:
>> > > >
>> > > >   * dev/release/js-verify-release-candidate.sh 0.4.1 0
>> > > >
>> > > > with:
>> > > >
>> > > >   * Node.js v11.12.0
>> > > >
>> > > > Thanks,
>> > > > --
>> > > > kou
>> > > >
>> > > > In > z...@mail.gmail.com>
>> > > >   "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019
>> > > > 00:09:54 +0100,
>> > > >   Krisztián Szűcs  wrote:
>> > > >
>> > > > > Hello all,
>> > > > >
>> > > > > I would like to propose the following release candidate (rc0) of
>> Apache
>> > > > > Arrow JavaScript version 0.4.1.
>> > > > >
>> > > > > The source release rc0 is hosted at [1].
>> > > > >
>> > > > > This release candidate is based on commit
>> > > > > f55542eeb59dde8ff4512c707b9eca1b43b62073
>> > > > >
>> > > > > Please download, verify checksums and signatures, run the unit
>> tests,
>> > > and
>> > > > > vote
>> > > > > on the release. The easiest way is to use the JavaScript-specific
>> > > release
>> > > > > verification script dev/release/js-verify-release-candidate.sh.
>> > > > >
>> > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
>> > > > > [ ] +0
>> > > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1
>> because...
>> > > > >
>> > > > >
>> > > > > How to validate a release signature:
>> > > > > https://httpd.apache.org/dev/verification.html
>> > > > >
>> > > > > [1]:
>> > > >
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/
>> > > > > [2]:
>> > > > >
>> > > >
>> > >
>> https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073
>> > > >
>> > >
>> >
>>
>


[jira] [Created] (ARROW-4988) Bump required node version to 11.12

2019-03-21 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4988:


 Summary: Bump required node version to 11.12
 Key: ARROW-4988
 URL: https://issues.apache.org/jira/browse/ARROW-4988
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Brian Hulette
Assignee: Brian Hulette


The cause of ARROW-4948 and 
http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C5ce620e0-0063-4bee-8ad6-a41301ac08c4%40www.fastmail.com%3E

was actually a regression in node v11.11, resolved in v11.12 see 
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear
 and https://github.com/nodejs/node/pull/26488

Bump requirement up to 11.12



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0

2019-03-21 Thread Brian Hulette
It looks like this was an issue with node v11.11 that was resolved in
v11.12 [1,2]. Can you try upgrading and running again?

[1]
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V11.md#2019-03-15-version-11120-current-bridgear
[2] https://github.com/nodejs/node/pull/26488

On Thu, Mar 21, 2019 at 8:00 AM Uwe L. Korn  wrote:

> This saldy fails locally for me on OSX High Sierra:
>
> ```
> + npm run test
>
> > apache-arrow@0.4.1 test
> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1
> > NODE_NO_WARNINGS=1 gulp test
>
> [15:23:02] Using gulpfile
> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1/gulpfile.js
> [15:23:02] Starting 'test'...
> [15:23:02] Starting 'test:ts'...
> [15:23:02] Starting 'test:src'...
> [15:23:02] Starting 'test:apache-arrow'...
>
>   ● Test suite failed to run
>
> TypeError: Cannot assign to read only property
> 'Symbol(Symbol.toStringTag)' of object '#'
>
>   at exports.default
> (node_modules/jest-util/build/create_process_object.js:15:34)
> ```
>
> This is the same error as in the nightlies but the fix there doesn't help
> for me locally.
>
> Uwe
>
> On Thu, Mar 21, 2019, at 2:41 AM, Brian Hulette wrote:
> > +1 (non-binding)
> >
> > Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0
> >
> > Thanks Krisztian!
> > Brian
> >
> > On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor  wrote:
> >
> > > +1 non-binding
> > >
> > > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High
> > > Sierra w/ node v11.6.0
> > >
> > >
> > > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou 
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > I ran the followings on Debian GNU/Linux sid:
> > > >
> > > >   * dev/release/js-verify-release-candidate.sh 0.4.1 0
> > > >
> > > > with:
> > > >
> > > >   * Node.js v11.12.0
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > > > In  z...@mail.gmail.com>
> > > >   "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019
> > > > 00:09:54 +0100,
> > > >   Krisztián Szűcs  wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I would like to propose the following release candidate (rc0) of
> Apache
> > > > > Arrow JavaScript version 0.4.1.
> > > > >
> > > > > The source release rc0 is hosted at [1].
> > > > >
> > > > > This release candidate is based on commit
> > > > > f55542eeb59dde8ff4512c707b9eca1b43b62073
> > > > >
> > > > > Please download, verify checksums and signatures, run the unit
> tests,
> > > and
> > > > > vote
> > > > > on the release. The easiest way is to use the JavaScript-specific
> > > release
> > > > > verification script dev/release/js-verify-release-candidate.sh.
> > > > >
> > > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1
> because...
> > > > >
> > > > >
> > > > > How to validate a release signature:
> > > > > https://httpd.apache.org/dev/verification.html
> > > > >
> > > > > [1]:
> > > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/
> > > > > [2]:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0

2019-03-20 Thread Brian Hulette
+1 (non-binding)

Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0

Thanks Krisztian!
Brian

On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor  wrote:

> +1 non-binding
>
> Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High
> Sierra w/ node v11.6.0
>
>
> On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou  wrote:
>
> > +1 (binding)
> >
> > I ran the followings on Debian GNU/Linux sid:
> >
> >   * dev/release/js-verify-release-candidate.sh 0.4.1 0
> >
> > with:
> >
> >   * Node.js v11.12.0
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019
> > 00:09:54 +0100,
> >   Krisztián Szűcs  wrote:
> >
> > > Hello all,
> > >
> > > I would like to propose the following release candidate (rc0) of Apache
> > > Arrow JavaScript version 0.4.1.
> > >
> > > The source release rc0 is hosted at [1].
> > >
> > > This release candidate is based on commit
> > > f55542eeb59dde8ff4512c707b9eca1b43b62073
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> and
> > > vote
> > > on the release. The easiest way is to use the JavaScript-specific
> release
> > > verification script dev/release/js-verify-release-candidate.sh.
> > >
> > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because...
> > >
> > >
> > > How to validate a release signature:
> > > https://httpd.apache.org/dev/verification.html
> > >
> > > [1]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/
> > > [2]:
> > >
> >
> https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073
> >
>


Re: [DISCUSS] Cutting a JavaScript 0.4.1 bugfix release

2019-03-20 Thread Brian Hulette
Thanks Wes.

Krisztian - Uwe cut 0.4.0 for us and said he was pretty comfortable with
the process, so you may be able to defer to him if you don't have time.

On Wed, Mar 20, 2019 at 3:26 PM Wes McKinney  wrote:

> It seems based on [1] that we are overdue in cutting a bugfix JS
> release because of a problem with the 0.4.0 release on NPM
>
> If there are no objections to this I suggest we call a vote right away
> and close the vote as soon as we have requisite PMC votes. Krisztian,
> would you be able to help with this since you are set up as an RM from
> the 0.12 release? I am traveling until next Tuesday and do not have my
> code signing key on the laptop I have with me otherwise I would do it.
>
> The release can be cut based off of current master version of js/
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/3630
>


Re: Timeline for 0.13 Arrow release

2019-03-20 Thread Brian Hulette
I think that makes sense. I would really like to make JS part of the
mainstream releases, but we already have JS-0.4.1 ready to go [1] with
primarily bugfixes for JS-0.4.0. I think we should just cut that and
integrate JS in 0.14.

[1] https://issues.apache.org/jira/projects/ARROW/versions/12344961

On Wed, Mar 20, 2019 at 8:20 AM Wes McKinney  wrote:

> In light of the discussion on
> https://github.com/apache/arrow/pull/3630 I think we should wait until
> we have a "not broken" JavaScript-only release on NPM and have
> confidence that we can respond to the community's needs
>
> On Tue, Mar 19, 2019 at 11:24 PM Paul Taylor  wrote:
> >
> > I agree, the JS has matured a lot in the last few months. I think it's
> > ready to join the regular Arrow releases. Let me know if I can help
> > integrate the publish scripts :-)
> >
> > The two main things in progress are docs + Vector Builders, neither of
> > which should block this release.
> >
> > We're going to try to get the docs/recipes ready for a PR this weekend.
> > If that lands shortly after 0.13.0 goes out, would it be possible to
> > update the website independently, or would that need to wait until 0.14?
> >
> > Paul
> >
> > On 3/19/19 10:08 AM, Wes McKinney wrote:
> > > I'm in favor of including JS in the 0.13.0 release.
> > >
> > > I'm going to try to fix a couple of the Python Parquet bugs until the
> > > RC is ready to be cut, but none of them need block the release.
> > >
> > > Seems like we need someone else to volunteer to be the RM for 0.13 if
> > > Uwe is unavailable next week. Antoine -- are you possibly up for it
> > > (the initial setup will be a bit painful)? I don't have access to a
> > > machine with my code signing key on it until next week so I cannot do
> > > it
> > >
> > > - Wes
> > >
> > > On Tue, Mar 19, 2019 at 9:46 AM Kouhei Sutou 
> wrote:
> > >> Hi,
> > >>
> > >> There are no blockers on GLib, Ruby and Linux packages.
> > >>
> > >> Can we include JavaScript into 0.13.0?
> > >> If we include JavaScript into 0.13.0, we can remove
> > >> codes to release JavaScript separately. For example, we can
> > >> remove dev/release/js-*. We can enable version update code
> > >> in dev/release/00-prepare.sh:
> > >>
> https://github.com/apache/arrow/blob/master/dev/release/00-prepare.sh#L67-L74
> > >>
> > >> We can merge "JavaScript Releases" document into our release
> > >> document:
> > >>
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-JavaScriptReleases
> > >>
> > >>
> > >> Thanks,
> > >> --
> > >> kou
> > >>
> > >> In <
> cajpuwmbgjzbwrwybwse6bd9lnn_7xozn_aq2job9_mpvmhc...@mail.gmail.com>
> > >>"Re: Timeline for 0.13 Arrow release" on Mon, 18 Mar 2019 20:51:12
> -0500,
> > >>Wes McKinney  wrote:
> > >>
> > >>> hi folks,
> > >>>
> > >>> I think we're basically at the 0.13 end game here. There's some more
> > >>> patches can get in, but do we all think we can cut an RC by the end
> of
> > >>> the week? What are the blocking issues?
> > >>>
> > >>> Thanks
> > >>> Wes
> > >>>
> > >>> On Sat, Mar 16, 2019 at 9:57 PM Kouhei Sutou 
> wrote:
> >  Hi,
> > 
> > > Submitted the packaging builds:
> > >
> https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-452
> >  I've fixed .deb/.rpm packages:
> https://github.com/apache/arrow/pull/3934
> >  It has been merged.
> >  So .deb/.rpm packages are ready for release.
> > 
> >  Thanks,
> >  --
> >  kou
> > 
> >  In <
> cahm19a5somzxgcphc6ee-mr2usvvhwb252udgjrvocq-cb2...@mail.gmail.com>
> > "Re: Timeline for 0.13 Arrow release" on Thu, 14 Mar 2019
> 16:24:43 +0100,
> > Krisztián Szűcs  wrote:
> > 
> > > Submitted the packaging builds:
> > >
> https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93&query=build-452
> > >
> > > On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney 
> wrote:
> > >
> > >> The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard
> labor on
> > >> this.
> > >>
> > >> We should run all the packaging tasks and get a full accounting of
> > >> what is broken so we aren't surprised during the release process
> > >>
> > >> On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs
> > >>  wrote:
> > >>> The proof of the pudding is in the eating. You convinced me.
> > >>>
> > >>> On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney <
> wesmck...@gmail.com>
> > >> wrote:
> >  Krisztian -- are you all right with proceeding with merging the
> CMake
> >  refactor? I'm pretty committed to helping fix the problems that
> come
> >  up. Since most consumers of the project don't test until
> _after_ a
> >  release, we won't find out about some problems until we merge
> it and
> >  release it. Thus, IMHO it doesn't make sense to wait another
> 8-10
> >  weeks since we'd be delaying feedback for that long. There are
> also a
> >  number of 

Re: Flaky Travis CI builds on master

2019-02-27 Thread Brian Hulette
Another instance of #1 for the JS builds:
https://travis-ci.org/apache/arrow/jobs/498967250#L992

I filed https://issues.apache.org/jira/browse/ARROW-4695 about it before
seeing this thread. As noted there I was able to replicate the timeout on
my laptop at least once. I didn't think to monitor memory usage to see if
that was the cause.

On Wed, Feb 27, 2019 at 6:52 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> I think we're witnessing multiple issues.
>
> 1. Travis seems to be slow (is it an OOM issue?)
>   - https://travis-ci.org/apache/arrow/jobs/499122041#L1019
>   - https://travis-ci.org/apache/arrow/jobs/498906118#L3694
>   - https://travis-ci.org/apache/arrow/jobs/499146261#L2316
> 2. https://issues.apache.org/jira/browse/ARROW-4694 detect-changes.py is
> confused
> 3. https://issues.apache.org/jira/browse/ARROW-4684 is failing one python
> test consistently
>
> #2 doesn't help with #1, it could be related to PR based out of "old"
> commits and the velocity of our project. I've suggested that we disable the
> failing test in #3 until resolved since it affects all C++ PRs.
>
> On Tue, Feb 26, 2019 at 5:01 PM Wes McKinney  wrote:
>
> > Here's a build that just ran
> >
> >
> >
> https://travis-ci.org/apache/arrow/builds/498906102?utm_source=github_status&utm_medium=notification
> >
> > 2 failed builds
> >
> > * ARROW-4684
> > * Seemingly a GLib Plasma OOM
> > https://travis-ci.org/apache/arrow/jobs/498906118#L3689
> >
> > 24 hours ago:
> >
> https://travis-ci.org/apache/arrow/builds/498501983?utm_source=github_status&utm_medium=notification
> >
> > * The same GLib Plasma OOM
> > * Rust try_from bug that was just fixed
> >
> > It looks like that GLib test has been failing more than it's been
> > succeeding (also failed in the last build on Feb 22).
> >
> > I think it might be worth setting up some more "annoying"
> > notifications when failing builds persist for a long time.
> >
> > On Tue, Feb 26, 2019 at 3:37 PM Michael Sarahan 
> > wrote:
> > >
> > > Yes, please let us know.  We definitely see 500's from anaconda.org,
> > though
> > > I'd expect less of them from CDN-enabled channels.
> > >
> > > On Tue, Feb 26, 2019 at 3:18 PM Uwe L. Korn  wrote:
> > >
> > > > Hello Wes,
> > > >
> > > > if there are 500er errors it might be useful to report them somehow
> to
> > > > Anaconda. They recently migrated conda-forge to a CDN enabled account
> > and
> > > > this could be one of the results of that. Probably they need to still
> > iron
> > > > out some things.
> > > >
> > > > Uwe
> > > >
> > > > On Tue, Feb 26, 2019, at 8:40 PM, Wes McKinney wrote:
> > > > > hi folks,
> > > > >
> > > > > We haven't had a green build on master for about 5 days now (the
> last
> > > > > one was February 21). Has anyone else been paying attention to
> this?
> > > > > It seems we should start cataloging which tests and build
> > environments
> > > > > are the most flaky and see if there's anything we can do to reduce
> > the
> > > > > flakiness. Since we are dependent on anaconda.org for build
> > toolchain
> > > > > packages, it's hard to control for the 500 timeouts that occur
> there,
> > > > > but I'm seeing other kinds of routine flakiness.
> > > > >
> > > > > - Wes
> > > > >
> > > >
> >
>


[jira] [Created] (ARROW-4695) [JS] Tests timing out on Travis

2019-02-27 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4695:


 Summary: [JS] Tests timing out on Travis
 Key: ARROW-4695
 URL: https://issues.apache.org/jira/browse/ARROW-4695
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Brian Hulette


Example build: https://travis-ci.org/apache/arrow/jobs/498967250

JS tests sometimes fail with the following message:

{noformat}
> apache-arrow@ test /home/travis/build/apache/arrow/js
> NODE_NO_WARNINGS=1 gulp test
[22:14:01] Using gulpfile ~/build/apache/arrow/js/gulpfile.js
[22:14:01] Starting 'test'...
[22:14:01] Starting 'test:ts'...
[22:14:49] Finished 'test:ts' after 47 s
[22:14:49] Starting 'test:src'...
[22:15:27] Finished 'test:src' after 38 s
[22:15:27] Starting 'test:apache-arrow'...
No output has been received in the last 10m0s, this potentially indicates a 
stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: 
https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated
{noformat}

I thought maybe we were just running up against some time limit, but that 
particular build was terminated at 22:25:27, exactly ten minutes after the last 
output, at 22:15:27. So it does seem like the build is somehow stalling.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4686) Only accept 'y' or 'n' in merge_arrow_pr.py prompts

2019-02-26 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4686:


 Summary: Only accept 'y' or 'n' in merge_arrow_pr.py prompts
 Key: ARROW-4686
 URL: https://issues.apache.org/jira/browse/ARROW-4686
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
    Reporter: Brian Hulette
    Assignee: Brian Hulette


The current prompt syntax ("y/n" with neither capitalized) implies there's no 
default, which I think is the right behavior, but it's not implemented that 
way. Script should retry until either y or n is received.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow on WebAssembly

2019-02-19 Thread Brian Hulette
Hi Franco,
I'm not aware of anyone trying this in Rust, but Tim Paine at JPMC recently
contributed a patch [1] to make it possible to compile the C++
implementation with emscripten, so that he could use it in Perspective [2].
Could you use the C++ lib instead?

It would be great if either implementation could target WebAssembly though
- do any Rust contributors know more about the libc/wasm issue? Maybe the
rustwasm community [3] could be of assistance?

Brian

[1] https://github.com/apache/arrow/pull/3350
[2] https://github.com/jpmorganchase/perspective
[3] https://github.com/rustwasm/team

On Tue, Feb 19, 2019 at 11:06 AM Franco Nicolas Bellomo 
wrote:

> Hi!
>
> Actually, Apache Arrow have a really nice implementation on Rust. I
> try to compile this to webAssembly but I have a problem with libc. I
> understand that this is a general problem of libc and wasm.
> In the road map of Arrow, you plan support wasm?
>
> Thanks!!
>


[jira] [Created] (ARROW-4551) [JS] Investigate using Symbols to access Row columns by index

2019-02-12 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4551:


 Summary: [JS] Investigate using Symbols to access Row columns by 
index
 Key: ARROW-4551
 URL: https://issues.apache.org/jira/browse/ARROW-4551
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


Can we use row[Symbol.for(0)] instead of row[0] in order to avoid collisions? 
What would the performance impact be?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4524) [JS] only invoke `Object.defineProperty` once per table

2019-02-09 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4524:


 Summary: [JS] only invoke `Object.defineProperty` once per table
 Key: ARROW-4524
 URL: https://issues.apache.org/jira/browse/ARROW-4524
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: 0.4.1


See 
https://github.com/vega/vega-loader-arrow/commit/19c88e130aaeeae9d0166360db467121e5724352#r32253784



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4523) [JS] Add row proxy generation benchmark

2019-02-09 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4523:


 Summary: [JS] Add row proxy generation benchmark
 Key: ARROW-4523
 URL: https://issues.apache.org/jira/browse/ARROW-4523
 Project: Apache Arrow
  Issue Type: Test
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4519) Publish JS API Docs for v0.4.0

2019-02-08 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-4519:


 Summary: Publish JS API Docs for v0.4.0
 Key: ARROW-4519
 URL: https://issues.apache.org/jira/browse/ARROW-4519
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.0 - RC1

2019-01-31 Thread Brian Hulette
+1

verified on Archlinux with Node v11.9.0

Thanks a lot for putting the RC together Uwe!

On Thu, Jan 31, 2019 at 8:08 AM Uwe L. Korn  wrote:

> +1 (binding),
>
> verified on Ubuntu 16.04 with
> `./dev/release/js-verify-release-candidate.sh 0.4.0 1` and Node v11.9.0 via
> nvm.
>
> Uwe
>
> On Thu, Jan 31, 2019, at 5:07 PM, Uwe L. Korn wrote:
> > Hello all,
> >
> > I would like to propose the following release candidate (rc1) of Apache
> > Arrow JavaScript version 0.4.0.
> >
> > The source release rc1 is hosted at [1].
> >
> > This release candidate is based on commit
> > 6009eaa49ae29826764eb6e626bf0d12b83f3481
> >
> > Please download, verify checksums and signatures, run the unit tests,
> and vote
> > on the release. The easiest way is to use the JavaScript-specific release
> > verification script dev/release/js-verify-release-candidate.sh.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow JavaScript 0.4.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.0 because...
> >
> >
> > How to validate a release signature:
> > https://httpd.apache.org/dev/verification.html
> >
> > [1]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.0-rc1/
> > [2]:
> >
> https://github.com/apache/arrow/tree/6009eaa49ae29826764eb6e626bf0d12b83f3481
>


Re: Benchmarking dashboard proposal

2019-01-18 Thread Brian Hulette
We also have some JS benchmarks [1]. Currently they're only really run on
an ad-hoc basis to manually test major changes but it would be great to
include them in this.

[1] https://github.com/apache/arrow/tree/master/js/perf

On Fri, Jan 18, 2019 at 12:34 AM Uwe L. Korn  wrote:

> Hello,
>
> note that we have(had?) the Python benchmarks continuously running and
> reported at https://pandas.pydata.org/speed/arrow/. Seems like this
> stopped in July 2018.
>
> UWe
>
> On Fri, Jan 18, 2019, at 9:23 AM, Antoine Pitrou wrote:
> >
> > Hi Areg,
> >
> > That sounds like a good idea to me.  Note our benchmarks are currently
> > scattered accross the various implementations.  The two that I know of:
> >
> > - the C++ benchmarks are standalone executables created using the Google
> > Benchmark library, aptly named "*-benchmark" (or "*-benchmark.exe" on
> > Windows)
> > - the Python benchmarks use the ASV utility:
> >
> https://github.com/apache/arrow/blob/master/docs/source/python/benchmarks.rst
> >
> > There may be more in the other implementations.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 18/01/2019 à 07:13, Melik-Adamyan, Areg a écrit :
> > > Hello,
> > >
> > > I want to restart/attach to the discussions for creating Arrow
> benchmarking dashboard. I want to propose performance benchmark run per
> commit to track the changes.
> > > The proposal includes building infrastructure for per-commit tracking
> comprising of the following parts:
> > > - Hosted JetBrains for OSS https://teamcity.jetbrains.com/ as a build
> system
> > > - Agents running in cloud both VM/container (DigitalOcean, or others)
> and bare-metal (Packet.net/AWS) and on-premise(Nvidia boxes?)
> > > - JFrog artifactory storage and management for OSS projects
> https://jfrog.com/open-source/#artifactory2
> > > - Codespeed as a frontend https://github.com/tobami/codespeed
> > >
> > > I am volunteering to build such system (if needed more Intel folks
> will be involved) so we can start tracking performance on various platforms
> and understand how changes affect it.
> > >
> > > Please, let me know your thoughts!
> > >
> > > Thanks,
> > > -Areg.
> > >
> > >
> > >
>


Re: Arrow JS 0.4.0 Release

2018-12-31 Thread Brian Hulette
lt;https://github.com/graphistry/arrow/commits/master> with the
> >> latest version of the library that we can build against, which I update
> >> when I fix any bugs or add features.
> >>
> > It is common for software vendors to have "downstream" releases, so
> > this is reasonable, so long as this work is not promoted as Apache
> > releases
> >
> >> The JS project is young, and sometimes has to move at a rapid pace. I've
> >> felt the turnaround time involved in the vote/prepare/verify/publish
> >> release process is slower than would be helpful to me. I'm used to
> >> publishing patch release to npm as soon as possible, possibly multiple
> >> times a day.
> > Well, surely the recent security problems with NPM demonstrate that
> > there is value in giving the community opportunity to vet a package
> > before it is published for the world to use, and that GPG-signing
> > packages is an important security measure to ensure that production
> > code is coming from a network of trust. It is different if you are
> > publishing packages for your own personal or corporate use.
> >
> >> None of the PMCs contribute to or use the JS version (if that's wrong,
> >> hit me up!) so there's been no release pressure from there. None of the
> >> JS contributors are PMCs so even if we want to do releases, we have to
> >> wait for the a PMC. My take is that everyone on the project (especially
> >> PMCs) are probably ungodly busy people, and since not releasing to npm
> >> hasn't been blocking me, I opt not to bother folks.
> > I am happy to help release the JS package as often as you like, up to
> > multiple times per month. I stated this early on in the process, but
> > there has not seemed to be much desire to release. Brian's recent
> > request to release caught me at a bad time at the end of the year, but
> > there are other active PMCs who should be able to help. If you do
> > decide you want to release in the next week or two, please let me know
> > and I will make the time to help.
> >
> > The lack of PMCs with an interest in JavaScript is a bit of
> > self-perpetuating issue. One of the responsibilities of PMC members
> > (and what will enable a committer to become a PMC) is to promote the
> > growth and development of a healthy community. This includes making
> > sure that the project releases. The JS developer community hasn't
> > grown much, though. My approach to such a problem is to act as a
> > "community of one" until it changes -- drive a project forward and
> > ensure a steady cadence of releases.
> >
> > - Wes
> >
> >>
> >> On 12/13/18 11:52 AM, Wes McKinney wrote:
> >>> +1 for synchronizing to the main releases when possible. In the 0.12
> >>> thread we have discussed moving to time-based releases (e.g. every 2
> >>> months). Time-based releases are helpful to create urgency around
> >>> getting work completed, and making sure that the project is always
> >>> ready to release.
> >>> On Thu, Dec 13, 2018 at 10:39 AM Brian Hulette 
> wrote:
> >>>> Sounds great Paul! Really excited that this refactor is wrapping up.
> My
> >>>> only concern with including this in 0.4.0 is that I'm not going to
> have the
> >>>> time to thoroughly review it for a few weeks, so gating on that would
> >>>> really delay it. But I can just manually test with some use-cases I
> care
> >>>> about in lieu of a thorough review in the interest of time.
> >>>>
> >>>> I think in the future (after 0.12?) it may behoove us to tie back in
> to the
> >>>> main Arrow release cycle. The idea with the separate JS release was to
> >>>> allow us to release faster, but in practice it has done the opposite.
> Since
> >>>> the fall of 2017 we've cut two major JS releases (0.2, 0.3) while
> there
> >>>> were four major main releases (0.8 - 0.11). Not to mention the
> disjoint
> >>>> version numbers can be confusing to users - perhaps not as much of a
> >>>> concern now that the format is pretty stable, but it can still be a
> >>>> friction point. And finally selfishly - if we had been on the main
> release
> >>>> cycle, the contributions I made in the summer would have been
> released in
> >>>> either 0.10 or 0.11 by now.
> >>>>
> >>>> Brian
> >>>>
> >>>> On Thu, De

Re: Arrow JS 0.4.0 Release

2018-12-13 Thread Brian Hulette
Sounds great Paul! Really excited that this refactor is wrapping up. My
only concern with including this in 0.4.0 is that I'm not going to have the
time to thoroughly review it for a few weeks, so gating on that would
really delay it. But I can just manually test with some use-cases I care
about in lieu of a thorough review in the interest of time.

I think in the future (after 0.12?) it may behoove us to tie back in to the
main Arrow release cycle. The idea with the separate JS release was to
allow us to release faster, but in practice it has done the opposite. Since
the fall of 2017 we've cut two major JS releases (0.2, 0.3) while there
were four major main releases (0.8 - 0.11). Not to mention the disjoint
version numbers can be confusing to users - perhaps not as much of a
concern now that the format is pretty stable, but it can still be a
friction point. And finally selfishly - if we had been on the main release
cycle, the contributions I made in the summer would have been released in
either 0.10 or 0.11 by now.

Brian

On Thu, Dec 13, 2018 at 3:29 AM Paul Taylor  wrote:

> The ongoing JS refactor/upgrade branch
> <https://github.com/trxcllnt/arrow/tree/js-data-refactor/js> is just
> about done. It's passing all the integration tests, as well as a hundred
> or so new unit tests. I have to update existing tests where the APIs
> changed, battle with closure-compiler a bit, then it'll be ready to
> merge in and ship out. I think I'll be able to wrap it up in the next
> couple hours.
>
> I started this branch to clean up the Vector Data classes to make it
> easier to add higher-level Table and Vector operators, but as the Data
> classes are fairly embedded in the core, it lead to a larger refactor of
> the DataTypes, Vectors, Visitors, and IPC readers and writers.
>
> While I was updating the IPC readers and writers, I took the opportunity
> to back-port all the Node and WhatWG (browser) streams integration that
> we've built for Graphistry. Putting it in the Arrow JS library means we
> can better ensure zero-copy when possible, empowers library consumers to
> easily build streaming applications in both server and browser
> environments, and (selfishly) reduces complexity in my code base. It
> also advances a longer term personal goal to more closely adhere to the
> structure and organization of ArrowCPP when reasonable.
>
> A non-exhaustive list of updates includes:
>
> * Updates the Table, Schema, RecordBatch, Visitor, Vector, Data, and
> DataTypes to ensure the generic type signatures cascade recursively
> through the type declarations
> * New io primitives that abstract over the (mutually exclusive) file and
> stream APIs in both node and browser environments
> * New RecordBatchReaders and RecordBatchWriters that directly use the
> zero-copy node and browser io primitives
> * A consolidated reflective Visitor implementation that supports late
> binding to shortcut traversal, provides an easy API for building higher
> level Vector operators
> * Fixed bugs/added support for reading and writing DictionaryBatch
> deltas (tricky)
> * Updated all the dependencies and did some config file gardening to
> make debugging tests easier
> * Added a bunch of new tests
>
> I'd be more than happy to help shepherd a 0.4.0 release of what's in
> arrow/master if that's what everyone wants to do. But in the interest of
> cutting a more feature-rich release and preventing customers paying the
> cost of updating twice in a short time span, I vote we hold off for
> another day or two and merge + release the work in the refactor branch.
>
> Paul
>
> On 12/9/18 10:51 AM, Wes McKinney wrote:
> > I agree that we should cut a JavaScript release.
> >
> > With the amount of maintenance work on my plate I have to declare
> > bankruptcy on doing any more than I am right now. Can another PMC
> > volunteer to be the RM for the 0.4.0 JavaScript release?
> >
> > Thanks
> > Wes
> > On Tue, Dec 4, 2018 at 10:07 PM Brian Hulette
> wrote:
> >> Hi all,
> >> It's been quite a while since our last major Arrow JS release (0.3.0 on
> >> February 22!), and since then we've added several new features that will
> >> make Arrow JS much easier to adopt. We've added convenience functions
> for
> >> creating Arrow vectors and tables natively in JavaScript, an IPC writer,
> >> and a row proxy interface that will make integrating with existing JS
> >> libraries much simpler.
> >>
> >> I think it's time we cut 0.4.0, so I spent some time closing out or
> >> postponing the last few JIRAs in JS-0.4.0. I got it down to just one
> JIRA
> >> which involves documenting the release process - hopefully we can close
> >> that out as we go through it again.
> >>
> >> Please let me know if you think it makes sense to cut JS-0.4.0 now, or
> if
> >> you have any concerns.
> >>
> >> Brian
>


[jira] [Created] (ARROW-3993) [JS] CI Jobs Failing

2018-12-10 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3993:


 Summary: [JS] CI Jobs Failing
 Key: ARROW-3993
 URL: https://issues.apache.org/jira/browse/ARROW-3993
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: JS-0.4.0


JS Jobs failing with:
npm ERR! code ETARGET
npm ERR! notarget No matching version found for gulp@next
npm ERR! notarget In most cases you or one of your dependencies are requesting
npm ERR! notarget a package version that doesn't exist.
npm ERR! notarget 
npm ERR! notarget It was specified as a dependency of 'apache-arrow'
npm ERR! notarget 
npm ERR! A complete log of this run can be found in:
npm ERR! /home/travis/.npm/_logs/2018-12-10T22_33_26_272Z-debug.log
The command "$TRAVIS_BUILD_DIR/ci/travis_before_script_js.sh" failed and exited 
with 1 during .

Reported by [~wesmckinn] in 
https://github.com/apache/arrow/pull/3152#issuecomment-446020105



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Arrow JS 0.4.0 Release

2018-12-04 Thread Brian Hulette
Hi all,
It's been quite a while since our last major Arrow JS release (0.3.0 on
February 22!), and since then we've added several new features that will
make Arrow JS much easier to adopt. We've added convenience functions for
creating Arrow vectors and tables natively in JavaScript, an IPC writer,
and a row proxy interface that will make integrating with existing JS
libraries much simpler.

I think it's time we cut 0.4.0, so I spent some time closing out or
postponing the last few JIRAs in JS-0.4.0. I got it down to just one JIRA
which involves documenting the release process - hopefully we can close
that out as we go through it again.

Please let me know if you think it makes sense to cut JS-0.4.0 now, or if
you have any concerns.

Brian


[jira] [Created] (ARROW-3691) [JS] Update dependencies, switch to terser

2018-11-02 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3691:


 Summary: [JS] Update dependencies, switch to terser
 Key: ARROW-3691
 URL: https://issues.apache.org/jira/browse/ARROW-3691
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette
 Fix For: JS-0.4.0


Many dependencies are out of date, give them a bump.

The uglifyjs-webpack-plugin [no longer 
supports|https://github.com/webpack-contrib/uglifyjs-webpack-plugin/releases/tag/v2.0.0]
 ES6 minification, switch to terser-webpack-plugin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3689) [JS] Upgrade to TS 3.1

2018-11-01 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3689:


 Summary: [JS] Upgrade to TS 3.1
 Key: ARROW-3689
 URL: https://issues.apache.org/jira/browse/ARROW-3689
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette
 Fix For: JS-0.5.0


Attempted 
[here|https://github.com/apache/arrow/pull/2611#issuecomment-431318129], but 
ran into issues.

Should upgrade typedoc to 0.13 at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

2018-10-31 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3667:


 Summary: [JS] Incorrectly reads record batches with an all null 
column
 Key: ARROW-3667
 URL: https://issues.apache.org/jira/browse/ARROW-3667
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: JS-0.3.1
Reporter: Brian Hulette
 Fix For: JS-0.4.0


The JS library seems to incorrectly read any columns that come after an 
all-null column in IPC buffers produced by pyarrow.

Here's a python script that generates two arrow buffers, one with an all-null 
column followed by a utf-8 column, and a second with those two reversed

{code:python}
import pyarrow as pa
import pandas as pd

def serialize_to_arrow(df, fd, compress=True):
  batch = pa.RecordBatch.from_pandas(df)
  writer = pa.RecordBatchFileWriter(fd, batch.schema)

  writer.write_batch(batch)
  writer.close()

if __name__ == "__main__":
df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 
'def', 'ghi']}, columns=['nulls', 'not nulls'])
with open('bad.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
with open('good.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
{code}

JS incorrectly interprets the [null, not null] case:

{code:javascript}
> var arrow = require('apache-arrow')
undefined
> var fs = require('fs')
undefined
> arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0)
'abc'
> arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
'\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u'
{code}

Presumably this is because pyarrow is omitting some (or all) of the buffers 
associated with the all-null column, but the JS IPC reader is still looking for 
them, causing the buffer count to get out of sync.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3523) [JS] Assign dictionary IDs in IPC writer rather than on creation

2018-10-15 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3523:


 Summary: [JS] Assign dictionary IDs in IPC writer rather than on 
creation
 Key: ARROW-3523
 URL: https://issues.apache.org/jira/browse/ARROW-3523
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Brian Hulette
 Fix For: JS-0.5.0


 Currently the JS implementation relies on on the user assigning IDs for 
dictionaries that they create, we should do something like the C++ 
implementation, which uses a dictionary id memo to assign and retrieve 
dictionary ids in the IPC writer 
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L495).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3425) [JS] Programmatically created dictionary vectors don't get dictionary IDs

2018-10-03 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3425:


 Summary: [JS] Programmatically created dictionary vectors don't 
get dictionary IDs
 Key: ARROW-3425
 URL: https://issues.apache.org/jira/browse/ARROW-3425
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
 Fix For: JS-0.4.0


This seems to be the cause of the test failures in 
https://github.com/apache/arrow/pull/2322

Modifying {{getSingleRecordBatchTable}} to [generate its vectors 
programmatically|https://github.com/apache/arrow/pull/2322/files#diff-eb6e5955a00e92f7bebb15a03f8437d1R359]
 (rather than deserializing hard-coded JSON), causes the new round-trip tests 
added in https://github.com/apache/arrow/pull/2638 to fail. The root cause 
seems to be that an ID is never allocated for the generated dictionary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Putting out a new JavaScript release?

2018-09-10 Thread Brian Hulette
Thanks for bringing this up Wes. My hope was to get out an 0.4.0 release
that just includes the IPC writer and usability improvements relatively
soon, and push the refactor out to 0.5.0. Paul's refactor is very exciting
and will definitely be good for the project, but I don't think either of us
has the time to get it into a release in the short-term. Most of the
outstanding tasks in 0.4.0 [1] either have PRs up [2] or are relatively
minor housekeeping tasks. I'd be fine with merging the currently open PRs
and wrapping up the housekeeping tasks so we can cut a release, but I
definitely want to be mindful of Paul's input, since there are almost
certainly conflicts with the refactor.

Brian

[1] https://issues.apache.org/jira/projects/ARROW/versions/12342901
[2]
https://github.com/apache/arrow/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+%5BJS%5D+in%3Atitle

On Mon, Sep 10, 2018 at 6:30 AM Wes McKinney  wrote:

> hi folks,
>
> It's been 6 months since the last JavaScript release. I had read that
> Paul was working on some refactoring of internals
> (https://issues.apache.org/jira/browse/ARROW-2828), and that might be
> the major item on route to the 0.4.0 release, but we might consider
> making a new release in the meantime. What does everyone think?
>
> Thanks
> Wes
>


[jira] [Created] (ARROW-3113) Merge tool can't specify JS fix version

2018-08-23 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3113:


 Summary: Merge tool can't specify JS fix version
 Key: ARROW-3113
 URL: https://issues.apache.org/jira/browse/ARROW-3113
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Brian Hulette
Assignee: Brian Hulette


Specifying a JS-x.x.x fix version doesn't work anymore because of the fix for 
ARROW-2220.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3074) [JS] Date.indexOf generates an error

2018-08-17 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3074:


 Summary: [JS] Date.indexOf generates an error
 Key: ARROW-3074
 URL: https://issues.apache.org/jira/browse/ARROW-3074
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: JS-0.4.0


https://github.com/apache/arrow/blob/master/js/src/vector/flat.ts#L150

{{every}} doesn't exist on {{Date}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3073) [JS] Add DateVector.from

2018-08-17 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3073:


 Summary: [JS] Add DateVector.from
 Key: ARROW-3073
 URL: https://issues.apache.org/jira/browse/ARROW-3073
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette
 Fix For: JS-0.4.0


It should be possible to construct a {{DateVector}} from a list of Date objects



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Creating a user@ mailing list

2018-08-16 Thread Brian Hulette
Agreed. I was concerned about the plan to drop Slack because it was a place
users would come to ask questions (for better or worse). I assumed that was
because those users were just uncomfortable with mailing lists, but I think
Uwe is right, they're probably just uncomfortable with *this* mailing list,
where most of the discussion is about development.

Brian

On Thu, Aug 16, 2018 at 6:52 AM Wes McKinney  wrote:

> hi Uwe,
>
> This sounds like a good idea to me. I think we should go ahead and ask
> INFRA to set it up. We'll need to add a "Community" landing page on
> the website of sorts to explain the mailing lists better.
>
> - Wes
>
>
> On Thu, Aug 16, 2018 at 4:49 AM, Uwe L. Korn  wrote:
> > Hello all,
> >
> > I would like to create a u...@arrow.apache.org mailing list. Some
> people are a bit confused that there is only a dev mailing list. They
> interpret this as a mailing list that should be used solely for Arrow
> development, not usage questions. This is sadly a psychological barrier for
> people to get a bit more involved since we have closed Slack.
> >
> > What are others thinking about this?
> >
> > Uwe
>


[jira] [Created] (ARROW-2909) [JS] Add convenience function for creating a table from a list of vectors

2018-07-25 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2909:


 Summary: [JS] Add convenience function for creating a table from a 
list of vectors
 Key: ARROW-2909
 URL: https://issues.apache.org/jira/browse/ARROW-2909
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette


Similar to ARROW-2766, but requires users to first turn their arrays into 
vectors, so we don't have to deduce type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2819) [JS] Fails to build with TS 2.8.3

2018-07-09 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2819:


 Summary: [JS] Fails to build with TS 2.8.3
 Key: ARROW-2819
 URL: https://issues.apache.org/jira/browse/ARROW-2819
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette


See the [GitHub 
issue|https://github.com/apache/arrow/issues/2115#issuecomment-403612925]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2797) [JS] comparison predicates don't work on 64-bit integers

2018-07-05 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2797:


 Summary: [JS] comparison predicates don't work on 64-bit integers
 Key: ARROW-2797
 URL: https://issues.apache.org/jira/browse/ARROW-2797
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Brian Hulette


The 64-bit integer vector {{get}} function returns a 2-element array, which 
doesn't compare propery in the comparison predicates. We should special case 
the comparisons for 64-bit integers and timestamps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2789) [JS] Minor DataFrame improvements

2018-07-03 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2789:


 Summary: [JS] Minor DataFrame improvements
 Key: ARROW-2789
 URL: https://issues.apache.org/jira/browse/ARROW-2789
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette


* deprecate count() in favor of a readonly length member (implemented with a 
getter in FilterdDataFrame)
* Add an iterator to FilteredDataFrame



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2778) Add Utf8Vector.from

2018-07-01 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2778:


 Summary: Add Utf8Vector.from
 Key: ARROW-2778
 URL: https://issues.apache.org/jira/browse/ARROW-2778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2772) [JS] Commit package-lock.json and/or yarn.lock

2018-07-01 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2772:


 Summary: [JS] Commit package-lock.json and/or yarn.lock
 Key: ARROW-2772
 URL: https://issues.apache.org/jira/browse/ARROW-2772
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


We should commit one (or both) of these lockfiles to the repo to make the 
dependency tree explicit and consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2771) [JS] Add row proxy object accessor

2018-06-30 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2771:


 Summary: [JS] Add row proxy object accessor
 Key: ARROW-2771
 URL: https://issues.apache.org/jira/browse/ARROW-2771
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette


The {{Table}} class would be much easier to interact with if it returned 
familiar Javascript objects representing a row. As Jeff Heer 
[demonstrated|https://beta.observablehq.com/@jheer/from-apache-arrow-to-javascript-objects]
 it's possible to create JS Proxy objects that read directly from Arrow memory. 
We should generate these types of objects in {{Table.get}} and in the {{Table}} 
iterator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2767) [JS] Add generic to Table for column names

2018-06-29 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2767:


 Summary: [JS] Add generic to Table for column names
 Key: ARROW-2767
 URL: https://issues.apache.org/jira/browse/ARROW-2767
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Brian Hulette


Requested by [~domoritz]
Something like:

{code:javascript}
class Table {
...
getColumn(name: ColName): Vector {
}
...
}
{code}

It would be even better if we could find a way to map the column names to the 
actual vector data types, but one thing at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2766) [JS] Add ability to construct a Table from a list of Arrays/TypedArrays

2018-06-29 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2766:


 Summary: [JS] Add ability to construct a Table from a list of 
Arrays/TypedArrays
 Key: ARROW-2766
 URL: https://issues.apache.org/jira/browse/ARROW-2766
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Brian Hulette


Something like {{Table.from({'col1': [...], 'col2': [...], 'col3': [...]})}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2765) [JS] add Vector.map

2018-06-29 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2765:


 Summary: [JS] add Vector.map
 Key: ARROW-2765
 URL: https://issues.apache.org/jira/browse/ARROW-2765
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: Brian Hulette
 Fix For: JS-0.4.0


Add `Vector.map(f)` that returns a new vector transformed with `f`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2764) [JS] Easy way to add a column to a Table

2018-06-29 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2764:


 Summary: [JS] Easy way to add a column to a Table
 Key: ARROW-2764
 URL: https://issues.apache.org/jira/browse/ARROW-2764
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette
 Fix For: JS-0.4.0


It should be easier to add a new column to a table. API could be either 
`table.addColumn(vector)` or `table.merge(..tables or vectors)`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2762) [JS] Remove unused perf/config.js

2018-06-28 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2762:


 Summary: [JS] Remove unused perf/config.js
 Key: ARROW-2762
 URL: https://issues.apache.org/jira/browse/ARROW-2762
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette


We don't seem to be using {{perf/config.js}} anymore. Let's remove it and 
replace it with {{perf/table_config.js}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2584) [JS] Node v10 issues

2018-05-15 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2584:


 Summary: [JS] Node v10 issues
 Key: ARROW-2584
 URL: https://issues.apache.org/jira/browse/ARROW-2584
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Paul Taylor


Build and tests fail with node v10. Fix these issues and bump CI to use node v10



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Continuous benchmarking setup

2018-05-14 Thread Brian Hulette
Is anyone aware of a way we could set up similar continuous benchmarks 
for JS? We wrote some benchmarks earlier this year but currently have no 
automated way of running them.


Brian


On 05/11/2018 08:21 PM, Wes McKinney wrote:

Thanks Tom and Antoine!

Since these benchmarks are literally running on a machine in my closet
at home, there may be some downtime in the future. At some point we
should document a process of setting up a new machine from scratch to
be the nightly bare metal benchmark slave.

- Wes

On Fri, May 11, 2018 at 9:08 AM, Antoine Pitrou  wrote:

Hi again,

Tom has configured the benchmarking machine to run and publish Arrow's
ASV-based benchmarks.  The latest results can now be seen at:
https://pandas.pydata.org/speed/arrow/

I expect these are regenerated on a regular (daily?) basis.

Thanks Tom :-)

Regards

Antoine.


On Wed, 11 Apr 2018 15:40:17 +0200
Antoine Pitrou  wrote:

Hello

With the following changes, it seems we might reach the point where
we're able to run the Python-based benchmark suite accross multiple
commits (at least the ones not anterior to those changes):
https://github.com/apache/arrow/pull/1775

To make this truly useful, we would need a dedicated host.  Ideally a
(Linux) OS running on bare metal, with SMT/HyperThreading disabled.
If running virtualized, the VM should have dedicated physical CPU cores.

That machine would run the benchmarks on a regular basis (perhaps once
per night) and publish the results in static HTML form somewhere.

(note: nice to have in the future might be access to NVidia hardware,
but right now there are no CUDA benchmarks in the Python benchmarks)

What should be the procedure here?

Regards

Antoine.





Re: [Format] Pointer types / span types

2018-05-02 Thread Brian Hulette
List also references another (data) array which can be a different size, 
but rather than requiring it to be represented with a second schema, we 
make it a child of the List type. We could do the same thing for a Span 
type, and give it a new type of buffer that contains start/stop indices 
rather than offsets.


To Antoine's point, maybe there's not enough demand to justify defining 
this type right now. I definitely agree that it would be good to see an 
example dataset before adding something like this.


Brian

On 05/02/2018 03:54 PM, Wes McKinney wrote:

Perhaps that could be an argument for making span a core logical type?

I think if anything, this argues that it should not be. Because "span"
references another array, which can be a different size, you need two
schemas to be able to make sense of it.

In either case, I would be interested to see what modifications would
be proposed to Schema.fbs and an example dataset described with such a
schema (that is a single array, instead of multiple -- i.e. a
non-composite representation).

For the record, if there are sufficiently common "composite" data
representations, I don't see a problem with developing community
standards based on the building blocks we already have

- Wes

On Wed, May 2, 2018 at 3:38 PM, Brian Hulette  wrote:

If this were accomplished at the application level, how would it work with
the IPC formats? I'd think you'd need to have two separate files (or
streams), since array 1 and array 2 will be different lengths. Perhaps that
could be an argument for making span a core logical type?

Brian



On 05/02/2018 03:34 PM, Antoine Pitrou wrote:

On Wed, 2 May 2018 10:12:37 -0400
Wes McKinney  wrote:

It sounds like the "span" type could be implemented as a composite of
multiple Arrow arrays / schemas:

array 1 (data)
any schema

array 2 (view)
struct <
start: int64,
stop: int64



Unless I'm missing something, this feels like an application-level
concern rather than something that needs to be addressed in the
columnar format / metadata.

Well, couldn't the same theoretically be said about list arrays?
In the end, I suppose it all depends whether there's enough demand to
make it a core logical type inside Arrow, rather than something people
write custom code for in their application.

Regards

Antoine.






Re: [Format] Pointer types / span types

2018-05-02 Thread Brian Hulette
If this were accomplished at the application level, how would it work 
with the IPC formats? I'd think you'd need to have two separate files 
(or streams), since array 1 and array 2 will be different lengths. 
Perhaps that could be an argument for making span a core logical type?


Brian


On 05/02/2018 03:34 PM, Antoine Pitrou wrote:

On Wed, 2 May 2018 10:12:37 -0400
Wes McKinney  wrote:

It sounds like the "span" type could be implemented as a composite of
multiple Arrow arrays / schemas:

array 1 (data)
any schema

array 2 (view)
struct <
   start: int64,
   stop: int64
  

Unless I'm missing something, this feels like an application-level
concern rather than something that needs to be addressed in the
columnar format / metadata.

Well, couldn't the same theoretically be said about list arrays?
In the end, I suppose it all depends whether there's enough demand to
make it a core logical type inside Arrow, rather than something people
write custom code for in their application.

Regards

Antoine.




Re: [Format] Pointer types / span types

2018-04-30 Thread Brian Hulette

Yes my first reaction to both of these requests is
- would dictionary-encoding work?
- would a List work?

I think for the former the analogy is more clear, for the latter, 
technically a List encodes start and stop indices with an offset array 
rather than separate arrays for start and stop indices. Is there a 
reason an offset array wouldn't work for the OAMap use-case though?


Brian


On 04/30/2018 04:55 PM, Antoine Pitrou wrote:

Actually, "pointer type" might just be another name for "dictionary type".

Regards

Antoine.


Le 30/04/2018 à 22:08, Antoine Pitrou a écrit :

Hi,

Today I got the opportunity to talk with Jim Pivarski, the main
developer on the OAMap project (*).  Under the hood, he is doing
something not unlike the Arrow representation of nested arrays: he
stores and processes structured data as linear arrays, allowing very
fast processing on seemingly irregular data (in Array parlance, think
something like lists of lists of structs).  It seems that OAMap data
requires two kinds of logical types that Arrow misses :

- a pointer type, where a physical array of ints is used to represent
indices into another array (the logical value being of course the value
pointed to)
- a span type, where two physical arrays of ints are used to represent
start and stop indices into another array (the logical value being the
list of values delimited by the start / stop indices)

Did such a feature request already come by?  Is this something we should
add to our roadmap or future wishlist?

(*) https://github.com/diana-hep/oamap

Regards

Antoine.





Re: Allow dictionary-encoded children?

2018-04-06 Thread Brian Hulette
Thanks Uwe, Wes, glad to hear I'm not too far out there :) The 
dictionary batch ordering seems like a reasonable requirement for this 
situation.


I made a JIRA to add something like this to the integration tests 
(https://issues.apache.org/jira/browse/ARROW-2412) and Ill put up a PR 
shortly.


On 04/06/2018 01:43 PM, Wes McKinney wrote:

Having dictionaries-within-dictionaries does add some complexity, but
I think the use case is valid and so it would be good to determine the
best way to handle this in the IPC / messaging protocol.

I would suggest: dictionaries can use other dictionaries, so long as
those dictionaries occur earlier in the stream. I am not sure either
the Java or C++ libraries will be able to properly handle these cases
right now, but that's what we have integration tests for!

On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn  wrote:

Hello Brian,

I would also have considered this a legitimate use of the Arrow specification. 
We only specify the DictionaryType to have a dictionary of any Arrow Type. In 
the context of Arrow's IPC this seems to be a bit more complicated as we seem 
to have the assumption that there is only one type of Dictionary per column. I 
would argue that we should be able to support this once we work out a reliable 
way to transfer them via the IPC mechanism.

Just as a related thought (might not produce the result you want): In Parquet, 
only the values on the lowest level are dictionary-encoded. But this is also 
due to the fact that Parquet uses repetition and definition levels to encode 
arbitrarily nested data types. These are more space-efficient when they are 
correctly encoded but don't provide random access.

Uwe

On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:

I've been considering a use-case with a dictionary-encoded struct
column, which may contain some dictionary-encoded columns itself. More
specifically, in this use-case each row represents a single observation
in a geospatial track, which includes a position, a time, and some
track-level metadata (track id, origin, destination, etc...). I would
like to represent the metadata as a dictionary-encoded struct, since
unique values will be repeated for each observation of that track, and I
would _also_ like to dictionary-encode some of the metadata column's
children, since unique values will typically be repeated in multiple tracks.

I think one could make a (totally legitimate) argument that this is
stretching a format designed for tabular data too far. This use-case
could also be accomplished by breaking out the struct metadata column
into its own arrow table, and managing a new integer column that
references that table. This would look almost identical to what I
initially described, it just wouldn't rely on the arrow libraries to
manage the "dictionary".


The spec doesn't have anything to say on this topic as far as I can
tell, but our implementations don't currently allow a dictionary-encoded
column's children to be dictionary-encoded themselves [1]. Is this just
a simplifying assumption, or a hard rule that should be codified in the
spec?

Thanks,
Brian

[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824




[jira] [Created] (ARROW-2412) [Integration] Add nested dictionary integration test

2018-04-06 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2412:


 Summary: [Integration] Add nested dictionary integration test
 Key: ARROW-2412
 URL: https://issues.apache.org/jira/browse/ARROW-2412
 Project: Apache Arrow
  Issue Type: Task
  Components: Integration
Reporter: Brian Hulette


Add nested dictionary generator to the integration test. The tests will 
probably fail at first but can serve as a starting point for developing this 
capability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2410) [JS] Add DataFrame.scanAsync

2018-04-06 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2410:


 Summary: [JS] Add DataFrame.scanAsync
 Key: ARROW-2410
 URL: https://issues.apache.org/jira/browse/ARROW-2410
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette


Add a version of `DataFrame.scan`, `scanAsync` that yields periodically. The 
yield frequency could be specified either as a number of record batches, or a 
number of records.

This scan should also be cancellable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Allow dictionary-encoded children?

2018-04-06 Thread Brian Hulette
I've been considering a use-case with a dictionary-encoded struct 
column, which may contain some dictionary-encoded columns itself. More 
specifically, in this use-case each row represents a single observation 
in a geospatial track, which includes a position, a time, and some 
track-level metadata (track id, origin, destination, etc...). I would 
like to represent the metadata as a dictionary-encoded struct, since 
unique values will be repeated for each observation of that track, and I 
would _also_ like to dictionary-encode some of the metadata column's 
children, since unique values will typically be repeated in multiple tracks.


I think one could make a (totally legitimate) argument that this is 
stretching a format designed for tabular data too far. This use-case 
could also be accomplished by breaking out the struct metadata column 
into its own arrow table, and managing a new integer column that 
references that table. This would look almost identical to what I 
initially described, it just wouldn't rely on the arrow libraries to 
manage the "dictionary".



The spec doesn't have anything to say on this topic as far as I can 
tell, but our implementations don't currently allow a dictionary-encoded 
column's children to be dictionary-encoded themselves [1]. Is this just 
a simplifying assumption, or a hard rule that should be codified in the 
spec?


Thanks,
Brian

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824


[jira] [Created] (ARROW-2327) [JS] Table.fromStruct missing from externs

2018-03-19 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2327:


 Summary: [JS] Table.fromStruct missing from externs
 Key: ARROW-2327
 URL: https://issues.apache.org/jira/browse/ARROW-2327
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette


{{Table.fromStruct}} is not listed in externs, so its obfuscated by the closure 
compiler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Apache Arrow JavaScript 0.3.1 - RC1

2018-03-15 Thread Brian Hulette
+1 (non-binding). Ran js-verify-release-candidate.sh with Node 8.9.1 on 
Ubuntu 16.04. Thanks Wes!



On 03/15/2018 05:17 AM, Uwe L. Korn wrote:

+1 (binding). Ran js-verify-release-candidate.sh with Node 9.8.0

On Thu, Mar 15, 2018, at 1:50 AM, Wes McKinney wrote:

+1 (binding). Ran js-verify-release-candidate.sh with Node 8.10.0 LTS

On Wed, Mar 14, 2018 at 8:40 PM, Paul Taylor  wrote:

+1 (non-binding)


On Mar 14, 2018, at 5:10 PM, Wes McKinney  wrote:

Hello all,

I\'d like to propose the following release candidate (rc1) of Apache Arrow
JavaScript version 0.3.1.

The source release rc1 is hosted at [1].

This release candidate is based on commit
077bd53df590cafe26fc784b3c6d03bf1ac24f67

Please download, verify checksums and signatures, run the unit tests, and vote
on the release. The easiest way is to use the JavaScript-specific release
verification script dev/release/js-verify-release-candidate.sh.

The vote will be open for at least 24 hours and will close once
enough PMCs have approved the release.

[ ] +1 Release this as Apache Arrow JavaScript 0.3.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...


How to validate a release signature:
https://httpd.apache.org/dev/verification.html

[1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc1/
[2]: 
https://github.com/apache/arrow/tree/077bd53df590cafe26fc784b3c6d03bf1ac24f67




Re: gReetings

2018-03-14 Thread Brian Hulette
If you prefer slack over (or in addition to) the mailing list there's 
also the Arrow slack. We recently made a #javascript channel there for 
discussions about that implementation, you could certainly do the same 
for R.


[1] https://apachearrow.slack.com
[2] https://apachearrowslackin.herokuapp.com/ (auto-invite link)

On 03/14/2018 02:07 PM, Romain Francois wrote:

Sounds great.


Le 14 mars 2018 à 19:03, Aneesh Karve  a écrit :

Hi Romain. Thanks for looking into this. Per discussion with Wes we'll keep
the discussion on ASF channels so the community can participate.
ᐧ




[jira] [Created] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency

2018-03-12 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2297:


 Summary: [JS] babel-jest is not listed as a dev dependency
 Key: ARROW-2297
 URL: https://issues.apache.org/jira/browse/ARROW-2297
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette


babel-jest is not listed as a dev dependency, leading to the following error on 
new clones of arrow js:

{noformat}
[10:21:08] Starting 'test:ts'...
● Validation Error:

  Module ./node_modules/babel-jest/build/index.js in the transform option was 
not found.

  Configuration Documentation:
  https://facebook.github.io/jest/docs/configuration.html

[10:21:09] 'test:ts' errored after 306 ms
[10:21:09] Error: exited with error code: 1
at ChildProcess.onexit 
(/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
at emitTwo (events.js:126:13)
at ChildProcess.emit (events.js:214:7)
at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
[10:21:09] 'test' errored after 311 ms
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JavaScript 0.3.1 - RC0

2018-03-12 Thread Brian Hulette

-1 (non-binding)

I get an error when running js-verify-release-candidate.sh, which
I can also replicate with a fresh clone of arrow on commit
17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6 by running `npm install`
and then `npm run test -- -t ts`:

[10:21:08] Starting 'test:ts'...
● Validation Error:

  Module ./node_modules/babel-jest/build/index.js in the transform option was 
not found.

  Configuration Documentation:
  https://facebook.github.io/jest/docs/configuration.html

[10:21:09] 'test:ts' errored after 306 ms
[10:21:09] Error: exited with error code: 1
at ChildProcess.onexit 
(/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
at emitTwo (events.js:126:13)
at ChildProcess.emit (events.js:214:7)
at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
[10:21:09] 'test' errored after 311 ms


Seems like the issue is that babel-jest is not included as a dev
dependency, so it's not found in node_modules in the new clone.
Not sure how it was working in the past, perhaps it was a
transitive dependency that was reliably included?

I can put up a PR to add the dependency

Brian


On 03/10/2018 01:52 PM, Wes McKinney wrote:

+1 (binding), ran js-verify-release-candidate.sh with NodeJS 8.10.0
LTS on Ubuntu 16.04

On Sat, Mar 10, 2018 at 1:52 PM, Wes McKinney  wrote:

Hello all,

I'd like to propose the 1st release candidate (rc0) of Apache Arrow
JavaScript version 0.3.0.  This is a bugfix release from 0.3.0.

The source release rc0 is hosted at [1].

This release candidate is based on commit
17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6

Please download, verify checksums and signatures, run the unit tests, and vote
on the release. The easiest way is to use the JavaScript-specific release
verification script dev/release/js-verify-release-candidate.sh.

The vote will be open for at least 24 hours and will close once
enough PMCs have approved the release.

[ ] +1 Release this as Apache Arrow JavaScript 0.3.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...

Thanks,
Wes

How to validate a release signature:
https://httpd.apache.org/dev/verification.html

[1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc0/
[2]: 
https://github.com/apache/arrow/tree/17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6




Re: Making a bugfix Arrow JS release

2018-03-07 Thread Brian Hulette

Naveen,

Yes I think when we initially discussed adding the JS dataframe ops we 
argued that it could be a separate library within the Apache Arrow 
monorepo, since some users will just want the ability to read/write 
arrow data, and we shouldn't force them to pull in a dataframe API they 
won't be using.


Right now there's not much to the dataframe parts of arrow js, so I 
think the cost is pretty minimal, but as it grows it will be a good idea 
to separate it out. Feel free to make a JIRA for this, maybe it can be a 
goal for the next JS release.


Brian

On 03/07/2018 10:00 AM, Naveen Michaud-Agrawal wrote:

Hi Brian,

Any thoughts on splitting out the dataframe like parts into a separate
library, keeping arrowjs to just handle loading data out of the arrow
buffer?

Regards,
Naveen Michaud-Agrawal





Re: Making a bugfix Arrow JS release

2018-03-06 Thread Brian Hulette
We're just wrapping up https://github.com/apache/arrow/pull/1678, and I 
would also like to merge https://github.com/apache/arrow/pull/1683, even 
though its technically not a bugfix.. it makes the df interface much 
more useful.


Once we merge those I'd be happy cutting a bugfix release, unless 
there's anything else Paul would like to get in.


Brian


On 03/05/2018 02:21 PM, Wes McKinney wrote:

Brian mentioned on GitHub that it might be good to make a 0.3.1 JS
release due to bugs fixed since 0.3.0. Is there any other work that
needs to be merged before doing this?

Thanks
Wes




[jira] [Created] (ARROW-2236) [JS] Add more complete set of predicates

2018-02-28 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2236:


 Summary: [JS] Add more complete set of predicates
 Key: ARROW-2236
 URL: https://issues.apache.org/jira/browse/ARROW-2236
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Brian Hulette


Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and ||
We should also support !=, <, > at the very least



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2235) [JS] Add tests for IPC messages split across multiple buffers

2018-02-28 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2235:


 Summary: [JS] Add tests for IPC messages split across multiple 
buffers
 Key: ARROW-2235
 URL: https://issues.apache.org/jira/browse/ARROW-2235
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Brian Hulette


See https://github.com/apache/arrow/pull/1670



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2234) [JS] Read timestamp low bits as Uint32s

2018-02-28 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2234:


 Summary: [JS] Read timestamp low bits as Uint32s
 Key: ARROW-2234
 URL: https://issues.apache.org/jira/browse/ARROW-2234
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Paul Taylor






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2233) [JS] Error when slicing a DictionaryVector with nullable indices vector

2018-02-28 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2233:


 Summary: [JS] Error when slicing a DictionaryVector with nullable 
indices vector
 Key: ARROW-2233
 URL: https://issues.apache.org/jira/browse/ARROW-2233
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette


Falls through the checks and throws this error: 
https://github.com/apache/arrow/blob/master/js/src/vector.ts#L416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >