Re: [RESULT][VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-18 Thread Paul Taylor

v6.0.1 JS packages have been uploaded.

On 11/18/21 2:28 AM, Sutou Kouhei wrote:

Hi,

Could someone help post release tasks? Especially,
JavaScripts, Conda, R, vcpkg and document.

2.  [done] upload source
3.  [done] upload binaries
4.  [done] update website
5.  [kou] upload ruby gems
6.  [ ] upload js packages
8.  [done] upload C# packages
10. [ ] update conda recipes
11. [done] upload wheels/sdist to pypi
12. [done] update homebrew packages
13. [done] update maven artifacts
14. [done] update msys2
15. [ ] update R packages
16. [ ] update vcpkg port
17. [done] update tags for Go modules
18. [ ] update docs
 After https://github.com/apache/arrow/pull/11728 is merged.

Thanks,




Re: [VOTE] Remove compute from Arrow JS

2021-11-02 Thread Paul Taylor
+1 from me as well

> On Oct 27, 2021, at 6:58 PM, Brian Hulette  wrote:
> 
> 
> +1
> 
> I don't think there's much reason to keep the compute code around when 
> there's a more performant, easier to use alternative. I think the only unique 
> feature of the arrow compute code was the ability to optimize queries on 
> dictionary-encoded columns, but Jeff added this to Arquero almost a year ago 
> now [1].
> 
> Brian
> 
> [1] https://github.com/uwdata/arquero/issues/86
> 
>> On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz  wrote:
>> Dear Arrow community,
>> 
>> We are proposing to remove the compute code from Arrow JS. Right now, the 
>> compute code is encapsulated in a DataFrame class that extends Table. The 
>> DataFrame implements a few functions such as filtering and counting with 
>> expressions. However, the predicate code is not very efficient (it’s 
>> interpreted) and most people only use Arrow to read data but don’t need 
>> compute. There are also more complete alternatives for doing compute on 
>> Arrow data structures such as Arquero (https://github.com/uwdata/arquero). 
>> By removing the compute code, we can focus on the IPC reading/writing and 
>> primitive types.
>> 
>> The vote will be open for at least 72 hours.
>> 
>> [ ] +1 Remove compute from Arrow JS
>> [ ] +0
>> [ ] -1 Do not remove compute because…
>> 
>> Thank you,
>> Dominik


Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-27 Thread Paul Taylor

JS packages have been uploaded.

Paul

On 4/27/21 9:47 AM, Neal Richardson wrote:

R package has been accepted by CRAN.

Neal

On Tue, Apr 27, 2021 at 7:25 AM Krisztián Szűcs 
wrote:


I've just opened a PR with the updated documentation.

The remaining tasks:

3.  [in-pr|Kou] upload binaries
6.  [Paul] upload js packages
10. [Uwe] update conda recipes
12. [todo] update homebrew packages
14. [Kou] update msys2
15. [Neal] update R packages
16. [in-pr|Krisztian] update docs

On Tue, Apr 27, 2021 at 2:42 PM Krisztián Szűcs
 wrote:

On Tue, Apr 27, 2021 at 2:21 PM Paul Taylor 

wrote:

These look like the errors resolved in
https://github.com/apache/arrow/pull/10156. Can we cherry-pick that
commit to the release branch?

Great, I'll cherry-pick that commit.

Could you please release the JS packages to npm? I think the
lerna.json needs to be updated before npm publish.

Thank Paul!


On 4/27/21 7:04 AM, Krisztián Szűcs wrote:

I'd need some help to both release the JS packages using the new

lerna

configuration and to fix the JS documentation generation [1]. We
should backport these changes to the release-4.0.0 branch.

[1]:

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4297=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181

On Tue, Apr 27, 2021 at 1:50 AM Sutou Kouhei 

wrote:

I'll also update MSYS2 packages:

1.  [x] open a pull request to bump the version numbers in the

source code

2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [x] upload rust crates
10. [ ] update conda recipes
11. [x] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [kou] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

In 
r...@mail.gmail.com>

"Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr

2021 01:48:37 +0200,

Krisztián Szűcs  wrote:


On Tue, Apr 27, 2021 at 1:05 AM Andy Grove 

wrote:

The following Rust crates have been published: arrow,

arrow-flight, parquet, parquet_derive, datafusion

Thanks Andy!

The current status is:
1.  [x] open a pull request to bump the version numbers in the

source code

2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [x] upload rust crates
10. [ ] update conda recipes
11. [x] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

On Mon, Apr 26, 2021 at 4:34 PM Andy Grove 

wrote:

Yes, I can handle the Rust release.

On Mon, Apr 26, 2021, 4:17 PM Krisztián Szűcs <

szucs.kriszt...@gmail.com> wrote:

@Andy Grove could you please handle the rust release?

On Mon, Apr 26, 2021 at 11:51 PM Krisztián Szűcs
 wrote:

1.  [x] open a pull request to bump the version numbers in the

source code

2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [ ] upload rust crates
10. [ ] update conda recipes
11. [in-progress] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

The JS post release task is failing with:

lerna ERR! ENOLERNA `lerna.json` does not exist, have you

run `lerna init`?

I assume the lerna configuration should be updated including

the version number.

@Paul Taylor could you please handle the JS release?

On Mon, Apr 26, 2021 at 9:01 PM Krisztián Szűcs
 wrote:

The current status of the post-release tasks:

1.  [x] open a pull request to bump the version numbers in

the source code

2.  [x] upload source
3.  [can't do] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [ ] upload C# packages
9.  [ ] upload rust crates
10. [ ] update conda recipes
11. [kszucs] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [kszucs] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

On Mon, Apr 26, 2021 at 8:19 PM Krisztián Szűcs
 wrote:

The VOTE carries with 4 binding +1 and 5 non-binding +1

votes.

Thanks everyone!

I'm starting the post release tasks and keep you posted

about the

current status.

On Mon, Apr 26, 2021 at 6:06 PM Neal Richardson
 wrote:

+1 (binding)

GitHub Actions verifications are green and R artifact

builds are successful.

Neal

On Mon, Apr 26, 2021 at 6:02 AM Krisztián Szűcs <

szucs.kriszt...@gmail.com>

wrote:


On Sun, Apr 25, 2021 at 10:59 PM Sutou Kouhei <

k...@clear-code.com> wrote:

Here: https://github.com/apache/arrow/pull/10126

I've incorporated the automatic verification step to the

release

procedure so we can start the VOTE after

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-27 Thread Paul Taylor
These look like the errors resolved in 
https://github.com/apache/arrow/pull/10156. Can we cherry-pick that 
commit to the release branch?



On 4/27/21 7:04 AM, Krisztián Szűcs wrote:

I'd need some help to both release the JS packages using the new lerna
configuration and to fix the JS documentation generation [1]. We
should backport these changes to the release-4.0.0 branch.

[1]: 
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=4297=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181

On Tue, Apr 27, 2021 at 1:50 AM Sutou Kouhei  wrote:

I'll also update MSYS2 packages:

1.  [x] open a pull request to bump the version numbers in the source code
2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [x] upload rust crates
10. [ ] update conda recipes
11. [x] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [kou] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

In 
   "Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Tue, 27 Apr 2021 01:48:37 
+0200,
   Krisztián Szűcs  wrote:


On Tue, Apr 27, 2021 at 1:05 AM Andy Grove  wrote:

The following Rust crates have been published: arrow, arrow-flight, parquet, 
parquet_derive, datafusion

Thanks Andy!

The current status is:
1.  [x] open a pull request to bump the version numbers in the source code
2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [x] upload rust crates
10. [ ] update conda recipes
11. [x] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

On Mon, Apr 26, 2021 at 4:34 PM Andy Grove  wrote:

Yes, I can handle the Rust release.

On Mon, Apr 26, 2021, 4:17 PM Krisztián Szűcs  wrote:

@Andy Grove could you please handle the rust release?

On Mon, Apr 26, 2021 at 11:51 PM Krisztián Szűcs
 wrote:

1.  [x] open a pull request to bump the version numbers in the source code
2.  [x] upload source
3.  [kou] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [x] upload C# packages
9.  [ ] upload rust crates
10. [ ] update conda recipes
11. [in-progress] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [x] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

The JS post release task is failing with:

lerna ERR! ENOLERNA `lerna.json` does not exist, have you run `lerna init`?

I assume the lerna configuration should be updated including the version number.

@Paul Taylor could you please handle the JS release?

On Mon, Apr 26, 2021 at 9:01 PM Krisztián Szűcs
 wrote:

The current status of the post-release tasks:

1.  [x] open a pull request to bump the version numbers in the source code
2.  [x] upload source
3.  [can't do] upload binaries
4.  [x] update website
5.  [x] upload ruby gems
6.  [ ] upload js packages
8.  [ ] upload C# packages
9.  [ ] upload rust crates
10. [ ] update conda recipes
11. [kszucs] upload wheels/sdist to pypi
12. [ ] update homebrew packages
13. [kszucs] update maven artifacts
14. [ ] update msys2
15. [nealrichardson] update R packages
16. [ ] update docs

On Mon, Apr 26, 2021 at 8:19 PM Krisztián Szűcs
 wrote:

The VOTE carries with 4 binding +1 and 5 non-binding +1 votes.

Thanks everyone!

I'm starting the post release tasks and keep you posted about the
current status.

On Mon, Apr 26, 2021 at 6:06 PM Neal Richardson
 wrote:

+1 (binding)

GitHub Actions verifications are green and R artifact builds are successful.

Neal

On Mon, Apr 26, 2021 at 6:02 AM Krisztián Szűcs 
wrote:


On Sun, Apr 25, 2021 at 10:59 PM Sutou Kouhei  wrote:

Here: https://github.com/apache/arrow/pull/10126

I've incorporated the automatic verification step to the release
procedure so we can start the VOTE after having positive feedback from
the verification tasks.

In 
   "Re: [VOTE] Release Apache Arrow 4.0.0 - RC3" on Sun, 25 Apr 2021

15:12:30 -0500,

   Wes McKinney  wrote:


Have we run the GitHub Actions release verifications, or can we do
that? I will try to run the RC verification on my dev machine (I
recently reinstalled Linux so wasn't equipped to immediately run the
verification script)

On Sun, Apr 25, 2021 at 2:31 PM Jorge Cardoso Leitão
 wrote:

+1, based on Rust alone. All tests pass as they should.

Thanks a lot everyone for making this happen.

Best,
Jorge


On Thu, Apr 22, 2021 at 5:17 PM Jonathan Keane 
+1 (non-binding)

Verified wheels, sources, and binaries on macOS 11.2 using the

verification

script (except for Java Integration, Glib, and Ruby). Like Antoine

I ran

into the same issue with Ruby.

I also installed Arrow and the R package locally + ran some adhoc

test

Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

2021-02-26 Thread Paul Taylor
Hi Michael,

The answer to your question about metadata will likely be
application-specific.

For small amounts of metadata (i.e. communicating a bounding box of
included geometry), there isn't much room for optimization, so a string
could be fine.

For larger amounts of metadata (or other constraints, like if the metadata
needs to be constantly modified independent of the data), custom encodings
or a second service and/or arrow table of the metadata could be the way to
go.

The metadata keys/values are UTF-8 strings, so nothing should prevent you
from stuffing a base64-encoded protobuf in there.

As for whether the library is maintained -- yes it is, but lately I've only
had time to work on bug fixes or features required to maintain parity with
the spec and other libs.

I will be using Arrow JS in my work again soon, and that could justify more
"quality of life" improvements again, but without other maintainers jumping
in to contribute or needing it for my work, those things don't get done.

I'd be happy to do a call with you or your team to give a short overview
and introduction to the JS lib. You can also email me directly or in the
#arrow-js channel on the-asf.slack.com with any questions.

Best,
Paul

On Fri, Feb 26, 2021 at 1:47 PM Michael Lavina 
wrote:

> Hey Neal,
>
> Thanks for the response and I am glad I am using this correctly. I have
> never really used email servers so hopefully this works.
>
> That’s exactly what I was thinking of doing is to create a standard
> metadata schema to built on top of Apache Arrow with some predefined user
> types.
>
> I guess I was just wondering if I was trying to use a screwdriver as a
> hammer. It can work because we are using the metadata and that could be
> anything but maybe like you said we should be creating a separate standard
> entirely for defining the schema to render tables instead of defining it
> within Arrow.
>
> Does it defeat the value of Arrow if are sending the data using buffers
> and stream and a giant string of stringified metadata when I could maybe
> define the metadata in protobuf binary separately.
>
> In addition, I was curious with all these visualization tools has someone
> already developed a standard metadata for arrow to help with rendering.
> Stuff like how to denote grouping of data, relationship between columns and
> hidden information.
>
> -Michael
>
> From: Neal Richardson 
> Date: Friday, February 26, 2021 at 1:38 PM
> To: dev 
> Subject: Re: [JS] Exploring usage of apache arrow at my company for
> complex table rendering
> The Arrow IPC specification allows for custom metadata in both the Schema
> and the individual Fields:
>
> https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$
> <
> https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$
> >
>
> Might that work for you? Another alternative would be to track your
> metadata in a separate object outside of the Arrow data.
>
> Neal
>
> On Fri, Feb 26, 2021 at 5:02 AM Michael Lavina  >
> wrote:
>
> > Hello Everyone,
> >
> >
> >
> > Some background. My name is Michael and I work at FactSet, which if you
> > use Arrow you may have heard because one of our architects did a talk on
> > using Arrow and Dremio.
> >
> >
> >
> https://urldefense.com/v3/__https://hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free_source=linkedin_term=na_content=na_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$
> <
> https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free_source=linkedin_term=na_content=na_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$
> >
> >
> >
> >
> > His team has decided to use Arrow as a tabular data interchange format.
> > Other teams are doing other things. We are working on standardizing our
> > tabular data interchange format at our company.
> >
> >
> >
> > We have our own open-sourced columnar based schema defined in protobuf.
> >
> https://urldefense.com/v3/__https://github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$
> <
> https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$
> >
> >
> >
> >
> > We looked into Apache Arrow a few years ago, but decided not to use it as
> > it was not mature enough at the time and we had two specific requirements
> >
> > 1) We needed this data not just for analytics but rendering as well and
> > 

Re: [javascript] streaming IPC examples?

2021-01-26 Thread Paul Taylor

Hey Ryan,

Yes, the IPC primitives were designed explicitly for the use-case you're 
describing.


Rather than building on Observables, they use on a similar fundamental 
primitive native to JS, AsyncIterables. You may already be familiar with 
AsyncIterables, as they're returned by async generator functions (the 
`async function*` syntax), and are consumed via `for await...of` loops.


AsyncIterables are async-pull streams, i.e the consumer pulls a Promise 
on demand, awaits the Promise to unwrap the value, repeat. This is in 
contrast to the Observable, where the producer pushes values on demand 
to the consumer without considering whether the consumer has capacity 
process the value.


AsyncIterables map 1-1 to native node streams, as well as browser WhatWG 
DOM streams <https://github.com/whatwg/streams/blob/main/FAQ.md>, and we 
provide methods to convert to either of those if you need. It's also 
possible to make an AsyncIterable from an Observable source (with a bit 
of buffering at the junction), and an Observable from an AsyncIterable 
(no buffering required).


There are quite a few options baked into the IPC RecordBatchReader and 
RecordBatchWriter for handling advanced use-cases, but I've put together 
this small example to illustrate some of the basics.


https://codepen.io/trxcllnt/pen/OJReoeW

This example uses IxJS <https://github.com/ReactiveX/IxJS>, the sister 
library to RxJS for AsyncIterables. If you're familiar with Rx, Ix 
should feel similar.


I also have a number of other repositories that can serve as examples 
for reading/writing Arrow IPC streams:


https://github.com/trxcllnt/fastify-arrow

https://github.com/trxcllnt/arrow-to-parquet-js

https://github.com/trxcllnt/csv-to-arrow-js

If you need to go even lower-level, the Arrow repository has few 
debugging utilities that use more of the IPC internals:


bin/print-buffer-alignment.js 
<https://github.com/apache/arrow/blob/919980184fe2b27063adec0d0908c75cd17a8437/js/bin/print-buffer-alignment.js>


src/bin/arrow2csv.ts 
<https://github.com/apache/arrow/blob/919980184fe2b27063adec0d0908c75cd17a8437/js/src/bin/arrow2csv.ts>


If you have Arrow installed locally in a project, you can use the above 
script via `npx` to view a table from the command line:


$ cat ./some-table.arrow | npx arrow2csv

Feel free to reach out or @ me on GitHub if you have more questions 
about the Grafana integration.


Best,

Paul



On 1/24/21 4:21 PM, Brian Hulette wrote:
+Paul Taylor <mailto:ptay...@apache.org> would your work with whatwg 
streams be relevant here? Are there any examples that would be useful 
for Ryan?


Brian

On Sat, Jan 23, 2021 at 4:52 PM Ryan McKinley <mailto:ryan...@gmail.com>> wrote:


Hello-

I am exploring options to support streaming in grafana.  We have a
golang
websocket server and am exploring options to send data to the browser.

Are there any good examples of reading IPC data with callbacks for
each
block?  I see examples for mapd, and for reading whole tables --
but am
hoping for something that lets me read initial header data, then
get each
record batch as a callback (rxjs)
https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format

Thanks for any pointers
Ryan



Re: [JS] BigNum toJSON returns a string with quotations in it

2020-10-28 Thread Paul Taylor
The JSON rep will still have quotes when returned by JSON.stringify(), by 
virtue of being a string. The current code is causing the decimal to be 
double-quoted, with a set of escaped inner quotes.

> On Oct 28, 2020, at 10:12 AM, Lee, David  
> wrote:
> 
> Quotes shouldn't be removed. BigNum is a decimal derivative and decimals in 
> JSON have quotes.
> 
> This is needed to distinguish values like "2.1" vs "2.100", etc. which are 
> decimal(2,1) vs decimal(4,3) datatypes.
> 
> -Original Message-
> From: Paul Taylor  
> Sent: Wednesday, October 28, 2020 8:05 AM
> To: dev@arrow.apache.org
> Subject: Re: [JS] BigNum toJSON returns a string with quotations in it
> 
> External Email: Use caution with links and attachments
> 
> 
> Yes, the quotes should be removed.
> 
> I'd recommend using the binary IPC format to send data to the browser though.
> 
> Deserializing to JSON is expensive, loses the benefits of features like 
> dictionary encoding, and reconstructing types on the receiving end is 
> error-prone (as illustrated here).
> 
> Paul
> 
>> On Oct 28, 2020, at 9:31 AM, Sam Peka  wrote:
>> 
>> Hi there,
>> 
>> The toJSON method of BigNum is currently returning a string with quotation 
>> marks around it: 
>> https://urldefense.com/v3/__https://github.com/apache/arrow/blob/fbb781bcdd15d54c52c71881c5c53d0c68069be6/js/src/util/bn.ts*L39__;Iw!!KSjYCgUGsB4!JQvbemEyMOKe9IdEl0I_dvM8_AKplf58CKhSvIyb0d5mHv7dJta2TJGQeGgUBlZ4tFg$
>> 
>> JSON.stringify will include those quotation marks in the output JSON. Here’s 
>> an example illustrating the issue:
>> 
>> const a = BN.new([12])
>> console.log(JSON.stringify({ a }))
>>> {"a":"\"12\""}
>> 
>> Our usecase is to take some arrow data and return it to a web client as json 
>> – but because the format is bad we’re having to work around this by 
>> converting all BigNums to strings before serilaizing the response.
>> 
>> Is BigNum.toJSON intended to work with JSON.stringify? Happy to open a PR if 
>> so.
>> 
>> Best,
>> Sam
> 
> 
> This message may contain information that is confidential or privileged. If 
> you are not the intended recipient, please advise the sender immediately and 
> delete this message. See 
> http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
> information.  Please refer to 
> http://www.blackrock.com/corporate/compliance/privacy-policy for more 
> information about BlackRock’s Privacy Policy.
> 
> 
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> 
> © 2020 BlackRock, Inc. All rights reserved.


Re: [JS] BigNum toJSON returns a string with quotations in it

2020-10-28 Thread Paul Taylor
Yes, the quotes should be removed.

I'd recommend using the binary IPC format to send data to the browser though.

Deserializing to JSON is expensive, loses the benefits of features like 
dictionary encoding, and reconstructing types on the receiving end is 
error-prone (as illustrated here).

Paul

> On Oct 28, 2020, at 9:31 AM, Sam Peka  wrote:
> 
> Hi there,
> 
> The toJSON method of BigNum is currently returning a string with quotation 
> marks around it: 
> https://github.com/apache/arrow/blob/fbb781bcdd15d54c52c71881c5c53d0c68069be6/js/src/util/bn.ts#L39
> 
> JSON.stringify will include those quotation marks in the output JSON. Here’s 
> an example illustrating the issue:
> 
> const a = BN.new([12])
> console.log(JSON.stringify({ a }))
>> {"a":"\"12\""}
> 
> Our usecase is to take some arrow data and return it to a web client as json 
> – but because the format is bad we’re having to work around this by 
> converting all BigNums to strings before serilaizing the response.
> 
> Is BigNum.toJSON intended to work with JSON.stringify? Happy to open a PR if 
> so.
> 
> Best,
> Sam


Re: MEAN Stack Use-case understanding

2020-09-29 Thread Paul Taylor

Hi Thomas,

You can read CSVs in the browser using the browser's File input APis and 
an appropriate CSV library. The CSV library should be able to parse rows 
into JS objects, which can then be passed to the Arrow Struct Builder 
for serialization.


In this example[1] I'm parsing the first row of the CSV to determine the 
schema, constructing an Arrow Builder transform function to parse each 
row into a Struct column, then piping all the rows through the transform 
stream and constructing Arrow RecordBatches.


The Builders propagate options like the null value representations down 
to the child builders, or they can be configured separately by 
specifying their own options.


A limitation of the Builders is that the schema must be known up-front. 
If the schema needs to change mid-stream, either the current stream 
should be terminated and a new one created, or the already-written data 
should be re-run through a Builder with the new schema.


Best,

Paul

1. https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js

On 9/28/20 3:19 AM, thomasroshin wrote:

Hello ,

  I am working on a proof-of-concept for which I am having a bit of
trouble understanding apache-arrow with JS and wanted to clarify a few
things with this regard.

My use case-
I have a MEAN (MongoDB/Express/Angular/NodeJS) that connects to
customer databases and third-party data and performs analytics and
experimentations. In this regard I am looking at Apache arrow from
interoperability angle and performant analytics angle.

Right now I am working on the analytics side - From JS front end I need to
be able to read parquet and big-data CSV files. In this regard please
clarify my understanding :

1. I cannot read parquet file using arrow libraries directly (due to this
 issue). I have to use
something like parquetjs-lite
 for
this.
2. To read big-data CSV into apache-arrow, I have to first use Python
(pyarrow) to convert CSV to arrow format (as in
using-apache-arrow-js-with-large-datasets
)
and then read the arrow file in my JS application.
   a). If (2)  above is correct then can I convert any third-party CSV
to arrow or should I have a predefined schema ahead of time ?
   b). Are nulls and NaNs allowed in the CSV .

If the above understandings are right it seems rather a roundabout way (or
is it just me) . Are there any other paths you can suggest ?

regards,
Thomas



Re: JavaScript/TypeScript lib - officially supported?

2020-09-14 Thread Paul Taylor

Hi Tim,

I've started working on it in a branch on my fork[1] in the evenings, 
but it's slow going as I've been busy with work/moving.


I'll try to get it finished up and PR'd this week, there's not much left 
to do.


Thanks,

Paul

1. https://github.com/trxcllnt/arrow/tree/typescript-3.9

On 9/14/20 11:00 AM, Tim Conkling wrote:

Hello Arrow devs & community,

I'm new to the Arrow project, and am using arrow-js to work with DataFrame
data that's shipped from a Python app to the browser.

I'm using TypeScript for frontend development, and it seems that the
TypeScript type definitions for arrow-js are broken for versions of
TypeScript beyond 3.5 (https://issues.apache.org/jira/browse/ARROW-8394). A
dev in that thread stated that the issue isn't being worked on.

I'm curious if the JavaScript library (and TypeScript support) is
"officially" maintained by the Arrow team, or if this is more of a
community project?

Thanks!
Tim



Re: Upcoming JS fixes and release timeline

2020-07-10 Thread Paul Taylor

Hey Micah,

npm allows you to set the version to anything you wish, but semantic 
versioning[1] is the convention. A few large-ish packages don't follow 
this (closure-compiler uses a timestamp as its version), but the tooling 
strongly nudges package owners and consumers towards semver.


1.0.0 releases are a significant milestone to npm. Before 1.0.0, npm 
enforces strict versioning -- users who depend on `^0.8.0` will receive 
0.8.0, even if there's a `0.9.0` or `0.9.1` available. After 1.0.0, 
users who depend on `^1.0.0` will receive any newer minor and patch 
releases (but not major version bumps). In this sense, Arrow doing major 
version bumps is fine, if a bit foreign to most node devs.


I worry more about releasing a 1.0.0 with type definitions that require 
library consumers use a TypeScript compiler that's a year old (aka a 
decade in JavaScript-years ;]). Bumping to 1.0.0 communicates a level of 
maturity the JS project still needs to achieve IMO.


I'm personally ambivalent and will defer to the Arrow community on 
versioning, but these are the general expectations of larger packages in 
the node community as I understand them.


Best,

Paul

1. https://docs.npmjs.com/about-semantic-versioning


On 7/9/20 10:34 PM, Micah Kornfield wrote:

Hi Paul,
I'm not sure if this was ever resolved, but I think the plan going 
forward is to start bumping major versions on each release.  Would NPM 
allow such changes in that case?


Cheers,
Micah

On Wed, Jul 1, 2020 at 9:23 AM Paul Taylor <mailto:ptaylor.apa...@gmail.com>> wrote:


The TypeScript compiler has made breaking changes in recent releases,
meaning we can't easily upgrade past 3.5 and projects on 3.6+ can't
compile our types.

I'm working on upgrading our tsc dependency to 3.9. The fixes could
include a few backwards-incompatible API changes, and might not be
done
in time for the general Arrow 1.0 release.

JS shouldn't block the 1.0 release, so can we exclude JS from 1.0
if the
fixes aren't ready by then? npm's semantic versioning allows breaking
changes in any version before 1.0, but not between minor versions
after
1.0. I've heard directly from some of our JS users who'd prefer if we
made these changes before bumping to 1.0 on npm.

Thanks,

Paul



Upcoming JS fixes and release timeline

2020-07-01 Thread Paul Taylor
The TypeScript compiler has made breaking changes in recent releases, 
meaning we can't easily upgrade past 3.5 and projects on 3.6+ can't 
compile our types.


I'm working on upgrading our tsc dependency to 3.9. The fixes could 
include a few backwards-incompatible API changes, and might not be done 
in time for the general Arrow 1.0 release.


JS shouldn't block the 1.0 release, so can we exclude JS from 1.0 if the 
fixes aren't ready by then? npm's semantic versioning allows breaking 
changes in any version before 1.0, but not between minor versions after 
1.0. I've heard directly from some of our JS users who'd prefer if we 
made these changes before bumping to 1.0 on npm.


Thanks,

Paul



Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor

Responding to this comment from GitHub[1]:

If we had to make a bet about what % of dictionaries empirically are 
between 128 and 255 elements, I would bet that the percentage is 
small. If it turned out that 40% of dictionaries fell in that range 
then I would agree that this makes sense.


I agree, the case where you'd use uint8 vs. int16 isn't very motivating, 
since those are probably small tables or temporary allocations inside 
larger operations.


The case where you'd use uint16 vs. int32 is more motivating, since 
dictionaries between 32767 and 65535 elements could easily back larger 
columns and saving up to 1GiB on the keys can quickly add up.



I would recommend that the specification advise
against use of 64-bit indices at all unless that are actually needed
to represent the data (i.e. dictionaries have more than INT32_MAX /
UINT32_MAX elements


Agreed.


but doesn't strike me as being a common occurrence


This is somewhat common in certain graph operations on large datasets, 
but I concede I may not be a representative sample of Arrow users :)


Paul

1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352

On 6/26/20 1:53 PM, Wes McKinney wrote:

I think that situations where you need uint64 indices are likely to be
exceedingly esoteric. I would recommend that the specification advise
against use of 64-bit indices at all unless that are actually needed
to represent the data (i.e. dictionaries have more than INT32_MAX /
UINT32_MAX elements, but doesn't strike me as being a common
occurrence)

IIUC, the only change that would be necessary would be a revision to
Columnar.rst and perhaps some language augmentation in Schema.fbs, and
we might want to add some integration tests for unsigned indices to
probe whether each language supports them. The changes needed to
support unsigned integers in C++ are probably not that invasive but I
haven't taken a close look at it yet

On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor  wrote:

If positive integers are expected, I'm in favor of supporting unsigned
index types. I was surprised at Arrow C++ restriction on signed indices
in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.

Based on the flatbuffer schemas, dictionary indices could technically be
any Arrow type, which I assumed was to allow for more
complex/exotic/custom indexing schemes in the future. JS will allow you
to specify _any_ Arrow type as the dictionary codes, though using a
non-numeric type without a custom enumerator is UB.

I'm also curious about how the restriction on dictionary index types
interacts with streaming delta dictionaries. In theory, you could have a
streaming data source produce enough delta dictionaries such that the
total dictionary size grows beyond 2^31-1 elements.

I think that's a valid use-case of delta dictionaries assuming Arrow
aggregates the dictionaries into multiple RecordBatches (or a
ChunkedArray), which is what JS does. But if that were allowed, we would
have to allow 64-bit (signed or unsigned) dictionary index types.

Paul


On 6/26/20 5:58 AM, Wes McKinney wrote:

hi folks,

At the moment, using unsigned integers for dictionary indices/codes
isn't exactly forbidden by the metadata [1], which says that the
indices must be "positive integers". Meanwhile, the columnar format
specification says

"When a field is dictionary encoded, the values are represented by an
array of signed integers representing the index of the value in the
dictionary. The memory layout for a dictionary-encoded array is the
same as that of a primitive signed integer layout."

I was looking at a discussion in RAPIDS about this topic [2]

When we drafted the columnar specification for this, the intention as
I recall was only to support signed integer indices. The rationale
basically is that:

* Better cross platform / language support for signed (e.g. certainly
in the JVM)
* Supporting 4 instead of 8 index types is less burdensome for the
developer and means less code to generate to support them
* Unsigned wraparound bugs could pass silently

I think it would be feasible to support the unsigned indices with the
following caveats:

* Signed integers are recommended as the "compatible" and preferred choice
* Most algorithms in the reference libraries should choose signed over
unsigned when generating indices
* Libraries may choose to promote unsigned to signed (e.g. in Java) if
they don't support unsigned well

I can't say I'm thrilled about having to maintain extra code for the
unsigned case, but it also seems like it would not do great harm
overall. Additionally, if you are certain that the indices are all
non-negative, then you can use the same code to process both intX and
uintX -- we use this trick in the Take implementation in C++ to
generate half as much binary code.

Thoughts?

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
[2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509


Re: [Format][C++] Offering limited support for unsigned dictionary indices

2020-06-26 Thread Paul Taylor
If positive integers are expected, I'm in favor of supporting unsigned 
index types. I was surprised at Arrow C++ restriction on signed indices 
in the RAPIDS thread, perhaps it's newer than when I ported the logic in JS.


Based on the flatbuffer schemas, dictionary indices could technically be 
any Arrow type, which I assumed was to allow for more 
complex/exotic/custom indexing schemes in the future. JS will allow you 
to specify _any_ Arrow type as the dictionary codes, though using a 
non-numeric type without a custom enumerator is UB.


I'm also curious about how the restriction on dictionary index types 
interacts with streaming delta dictionaries. In theory, you could have a 
streaming data source produce enough delta dictionaries such that the 
total dictionary size grows beyond 2^31-1 elements.


I think that's a valid use-case of delta dictionaries assuming Arrow 
aggregates the dictionaries into multiple RecordBatches (or a 
ChunkedArray), which is what JS does. But if that were allowed, we would 
have to allow 64-bit (signed or unsigned) dictionary index types.


Paul


On 6/26/20 5:58 AM, Wes McKinney wrote:

hi folks,

At the moment, using unsigned integers for dictionary indices/codes
isn't exactly forbidden by the metadata [1], which says that the
indices must be "positive integers". Meanwhile, the columnar format
specification says

"When a field is dictionary encoded, the values are represented by an
array of signed integers representing the index of the value in the
dictionary. The memory layout for a dictionary-encoded array is the
same as that of a primitive signed integer layout."

I was looking at a discussion in RAPIDS about this topic [2]

When we drafted the columnar specification for this, the intention as
I recall was only to support signed integer indices. The rationale
basically is that:

* Better cross platform / language support for signed (e.g. certainly
in the JVM)
* Supporting 4 instead of 8 index types is less burdensome for the
developer and means less code to generate to support them
* Unsigned wraparound bugs could pass silently

I think it would be feasible to support the unsigned indices with the
following caveats:

* Signed integers are recommended as the "compatible" and preferred choice
* Most algorithms in the reference libraries should choose signed over
unsigned when generating indices
* Libraries may choose to promote unsigned to signed (e.g. in Java) if
they don't support unsigned well

I can't say I'm thrilled about having to maintain extra code for the
unsigned case, but it also seems like it would not do great harm
overall. Additionally, if you are certain that the indices are all
non-negative, then you can use the same code to process both intX and
uintX -- we use this trick in the Take implementation in C++ to
generate half as much binary code.

Thoughts?

Thanks
Wes

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
[2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509


Re: [JavaScript] how to set column name after creation?

2020-06-26 Thread Paul Taylor
You can also use the `Field.prototype.clone()` method[1] like this to 
further reduce the boilerplate:



function renameColumn(col, new_name) {
  return Column.new(col.field.clone(new_name), col.chunks);
}


1. https://github.com/apache/arrow/blob/master/js/src/schema.ts#L139-L146

On 6/26/20 7:54 AM, Brian Hulette wrote:

Hi Ryan,
Here or user@arrow.apache.orgis a fine place to ask :)

The metadata on Table/Column/Field objects are all immutable, so doing this
right now would require creating a new instance of Table with the field
renamed, which takes quite a lot of boilerplate. A helper for renaming a
column (or even better a generalization of select [1] that lets you do a
full projection, including column renames) would be a great contribution.

Here's an example of creating a renamed column, which should get you most
of the way to creating a Table with a renamed column:
https://observablehq.com/@theneuralbit/renaming-an-arrow-column

Brian

[1]
https://github.com/apache/arrow/blob/ff7ee06020949daf66ac05090753e1a17736d9fa/js/src/table.ts#L249

On Thu, Jun 25, 2020 at 4:04 PM Ryan McKinley  wrote:


Apologies if this is the wrong list or place to ask...

What is the best way to update a column name for a Table in javascript?

const col = table.getColumnAt(i);
col.name = 'new name!'

Currently: Cannot assign to 'name' because it is a read-only property

Thanks!

ryan



Re: [DISCUSS] Need for Arrow 0.17.1 patch release (binary only?)

2020-05-05 Thread Paul Taylor
Would it be possible to include the variant.hpp update 
 for nvcc in 0.17.1?


Thanks,

Paul

On 5/4/20 4:17 PM, Wes McKinney wrote:

hi folks,

We have accumulated a few regressions

ARROW-8657 https://github.com/apache/arrow/pull/7089
ARROW-8694 https://github.com/apache/arrow/pull/7103

there may be a few others.

I think we should try to make a "streamlined" patch release (after
surveying incoming bug reports for other serious regressions) if
possible focused on providing patched binaries to the impacted users
(in the above, this would be any user of the Parquet portion of the
C++ library). The hope would be to be able to trim down the work
required of a release manager in a normal major release in these
scenarios where we need to get out bugfixes sooner.

Thoughts?

Thanks
Wes


[jira] [Created] (ARROW-6886) [C++] arrow::io header nvcc compiler warnings

2019-10-14 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-6886:
--

 Summary: [C++] arrow::io header nvcc compiler warnings
 Key: ARROW-6886
 URL: https://issues.apache.org/jira/browse/ARROW-6886
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.15.0
Reporter: Paul Taylor


Seeing the following compiler warnings statically linking the arrow::io headers 
with nvcc:

{noformat}
arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"

arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"

arrow/install/include/arrow/io/file.h(189): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MemoryMappedFile"

arrow/install/include/arrow/io/memory.h(98): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::MockOutputStream"

arrow/install/include/arrow/io/memory.h(116): warning: overloaded virtual 
function "arrow::io::Writable::Write" is only partially overridden in class 
"arrow::io::FixedSizeBufferWriter"
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-23 Thread Paul Taylor
I'll do the JS updates. Is it safe to validate against the Arrow C++ 
integration tests?



On 8/22/19 7:28 PM, Micah Kornfield wrote:

I created https://issues.apache.org/jira/browse/ARROW-6313 as a tracking
issue with sub-issues on the development work.  So far no-one has claimed
Java and Javascript tasks.

Would it make sense to have a separate dev branch for this work?

Thanks,
Micah

On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney  wrote:


The vote carries with 4 binding +1 votes and 1 non-binding +1

I'll merge the specification patch later today and we can begin
working on implementations so we can get this done for 0.15.0

On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler  wrote:

+1 (non-binding)

On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 

wrote:

Sorry, had forgotten to send my vote on this.

+1 from me.

Regards

Antoine.


On Wed, 14 Aug 2019 17:42:33 -0500
Wes McKinney  wrote:

hi all,

As we've been discussing [1], there is a need to introduce 4 bytes of
padding into the preamble of the "encapsulated IPC message" format to
ensure that the Flatbuffers metadata payload begins on an 8-byte
aligned memory offset. The alternative to this would be for Arrow
implementations where alignment is important (e.g. C or C++) to copy
the metadata (which is not always small) into memory when it is
unaligned.

Micah has proposed to address this by adding a
4-byte "continuation" value at the beginning of the payload
having the value 0x. The reason to do it this way is that
old clients will see an invalid length (what is currently the
first 4 bytes of the message -- a 32-bit little endian signed
integer indicating the metadata length) rather than potentially
crashing on a valid length. We also propose to expand the "end of
stream" marker used in the stream and file format from 4 to 8
bytes. This has the additional effect of aligning the file footer
defined in File.fbs.

This would be a backwards incompatible protocol change, so older

Arrow

libraries would not be able to read these new messages. Maintaining
forward compatibility (reading data produced by older libraries)

would

be possible as we can reason that a value other than the continuation
value was produced by an older library (and then validate the
Flatbuffer message of course). Arrow implementations could offer a
backward compatibility mode for the sake of old readers if they

desire

(this may also assist with testing).

Additionally with this vote, we want to formally approve the change

to

the Arrow "file" format to always write the (new 8-byte)

end-of-stream

marker, which enables code that processes Arrow streams to safely

read

the file's internal messages as though they were a normal stream.

The PR making these changes to the IPC documentation is here

https://github.com/apache/arrow/pull/4951

Please vote to accept these changes. This vote will be open for at
least 72 hours

[ ] +1 Adopt these Arrow protocol changes
[ ] +0
[ ] -1 I disagree because...

Here is my vote: +1

Thanks,
Wes

[1]:

https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E






[jira] [Created] (ARROW-6053) [Python] RecordBatchStreamReader::Open2 cdef type signature doesn't match C++

2019-07-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-6053:
--

 Summary: [Python] RecordBatchStreamReader::Open2 cdef type 
signature doesn't match C++
 Key: ARROW-6053
 URL: https://issues.apache.org/jira/browse/ARROW-6053
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 0.14.1
Reporter: Paul Taylor
Assignee: Paul Taylor


The Cython method signature for RecordBatchStreamReader::Open2 doesn't match 
the C++ type signature and causes a compiler type error trying to call Open2 
from Cython.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-24 Thread Paul Taylor

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>


Sorry I didn't mean to say an int64_t length, just that now we'd be 
reserving 8 bytes in the "metadata length" position where today we 
reserve 4.


I'm not sure about every language, but at least in Python/JS an external 
forwards-compatible solution would involve slicing the message buffer up 
front like this:


def adjust_message_buffer(message_bytes):
  buf = pa.py_buffer(message_bytes)
  if first_four_bytes_are_max_int32(message_bytes):
    return buf.slice(4)
  return buf



On 7/23/19 7:31 PM, Micah Kornfield wrote:

Could we detect the 4-byte length, incur a penalty copying the memory to
an aligned buffer, then continue consuming the stream?

I think that is the plan (or at least would be my plan) if we go ahead with
the change




(It's probably
fine if we only write the 8-byte length, since consumers on older
versions of Arrow could slice from the 4th byte before passing a buffer
to the reader).

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0x>

Thanks,
Micah

On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor 
wrote:


+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we
should avoid breaking backwards compatibility if possible. It's common
for apps/libs to be pinned on specific Arrow versions, and I worry it'd
cause a lot of work for downstream devs to audit their tool suite for
full Arrow binary compatibility (and/or require their customers to do
the same).

Could we detect the 4-byte length, incur a penalty copying the memory to
an aligned buffer, then continue consuming the stream? (It's probably
fine if we only write the 8-byte length, since consumers on older
versions of Arrow could slice from the 4th byte before passing a buffer
to the reader).

I've always understood the metadata to be a few dozen/hundred KB, a
small percentage of the total message size. I could be underestimating
the ratios though -- is it common to have tables w/ 1000+ columns? I've
seen a few reports like that in cuDF, but I'm curious to hear
Jacques'/Dremio's experience too.

If copying is feasible, it doesn't seem so bad a trade-off to maintain
backwards-compatibility. As libraries and consumers upgrade their Arrow
dependencies, the 4-byte length will be less and less common, and
they'll be less likely to pay the cost.



On 7/23/19 2:22 AM, Uwe L. Korn wrote:

It is also a good way to test the change in public. We don't want to

adjust something like this anymore in a 1.0.0 release. Already doing this
in 0.15.0 and then maybe doing adjustments due to issues that appear "in
the wild" is psychologically the easier way. There is a lot of thinking of
users bound with the magic 1.0, thus I would plan to minimize what is
changed between 1.0 and pre-1.0. This also should save us maintainers some
time as I would expect different behaviour in bug reports between 1.0 and
pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:

I think the main reason to do a release before 1.0.0 is if we want to

make

the change that would give a good error message for forward

incompatibility

(I think this could be done as 0.14.2 since it would just be clarifying

an

error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:


I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
wrote:

Hello,

Recently we've discussed breaking the IPC format to fix a

long-standing

alignment issue.  See this discussion:


https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.







Building on Arrow CUDA

2019-07-24 Thread Paul Taylor
I'm looking at options to replace the custom Arrow logic in cuDF with 
Arrow library calls. What's the recommended way to declare a dependency 
on pyarrow/arrowcpp with CUDA support?


I see in the docs it says to build from source, but that's only an 
option for an (advanced) end-user. And building/vendoring 
libarrow_cuda.so isn't a great option for a non-Arrow library, because 
someone who does source build Arrow-with-cuda will conflict with the 
version we ship.


Right now we're considering statically linking libarrow_cuda into 
libcudf.so and vendoring Arrow's cuda cython alongside ours, but this 
increases compile times/library size.


Is there a package management solution (like pip/conda install 
pyarrow[cuda]) that I'm missing? If not, should there be?


Best,

Paul



Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Paul Taylor

+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we 
should avoid breaking backwards compatibility if possible. It's common 
for apps/libs to be pinned on specific Arrow versions, and I worry it'd 
cause a lot of work for downstream devs to audit their tool suite for 
full Arrow binary compatibility (and/or require their customers to do 
the same).


Could we detect the 4-byte length, incur a penalty copying the memory to 
an aligned buffer, then continue consuming the stream? (It's probably 
fine if we only write the 8-byte length, since consumers on older  
versions of Arrow could slice from the 4th byte before passing a buffer 
to the reader).


I've always understood the metadata to be a few dozen/hundred KB, a 
small percentage of the total message size. I could be underestimating 
the ratios though -- is it common to have tables w/ 1000+ columns? I've 
seen a few reports like that in cuDF, but I'm curious to hear 
Jacques'/Dremio's experience too.


If copying is feasible, it doesn't seem so bad a trade-off to maintain 
backwards-compatibility. As libraries and consumers upgrade their Arrow 
dependencies, the 4-byte length will be less and less common, and 
they'll be less likely to pay the cost.




On 7/23/19 2:22 AM, Uwe L. Korn wrote:

It is also a good way to test the change in public. We don't want to adjust something 
like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing 
adjustments due to issues that appear "in the wild" is psychologically the 
easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would 
plan to minimize what is changed between 1.0 and pre-1.0. This also should save us 
maintainers some time as I would expect different behaviour in bug reports between 1.0 
and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:

I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney  wrote:


I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
wrote:


Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:


https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.





Re: Error building cuDF on new Arrow with std::variant backport

2019-07-19 Thread Paul Taylor

Hi Micah,

We were able to build Arrow standalone with both c++ 11 and 14, but cuDF 
needs c++ 14.


I found this line[1] in one of our cuda files after sending and realized 
we may have a collision/polluted namespace. Does that sound like a 
possibility?


Thanks,
Paul

1. 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30


On 7/19/19 8:41 PM, Micah Kornfield wrote:

Hi Paul,
This actually looks like it might be a problem with arrow-4800.   Did 
the build of arrow use c++14 or c++11?


Thanks,
Micah

On Friday, July 19, 2019, Paul Taylor <mailto:ptaylor.apa...@gmail.com>> wrote:


We're updating cuDF to Arrow 0.14 but encountering errors building
that look related to PR #4259
<https://github.com/apache/arrow/pull/4259
<https://github.com/apache/arrow/pull/4259>>. We can build Arrow
itself, but we can't build cuDF when we include Arrow headers.
Using C++ 14 and have tried gcc/g++ 5, 7, and clang.

Has anyone seen these before or know of a fix?

Thanks,

Paul

/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
warning: attribute does not apply to any entity

/cudf/cpp/build/arrow/install/include/arrow/result.h: In
member function 'void
arrow::Result::AssignVariant(mpark::variant&&)':
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:24:
error: expected primary-expression before ',' token
 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:32:
error: expected primary-expression before ',' token
 variant_.~variant();
  ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:34:
error: expected primary-expression before 'const'
 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:34:
error: expected ')' before 'const'
/cudf/cpp/build/arrow/install/include/arrow/result.h: In
member function 'void arrow::Result::AssignVariant(const
mpark::variant&)':
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:24:
error: expected primary-expression before ',' token
 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:32:
error: expected primary-expression before ',' token
 variant_.~variant();
  ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:34:
error: expected primary-expression before 'const'
 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:34:
error: expected ')' before 'const'






Error building cuDF on new Arrow with std::variant backport

2019-07-19 Thread Paul Taylor
We're updating cuDF to Arrow 0.14 but encountering errors building that 
look related to PR #4259 . We 
can build Arrow itself, but we can't build cuDF when we include Arrow 
headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.


Has anyone seen these before or know of a fix?

Thanks,

Paul

/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195): 
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196): 
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195): 
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196): 
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195): 
warning: attribute does not apply to any entity
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196): 
warning: attribute does not apply to any entity


/cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function 'void arrow::Result::AssignVariant(mpark::variantarrow::Status, const char*>&&)':
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: error: 
expected primary-expression before ',' token

 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: error: 
expected primary-expression before ',' token

 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error: 
expected primary-expression before 'const'

 variant_.~variant();
  ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error: 
expected ')' before 'const'
/cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function 'void arrow::Result::AssignVariant(const mpark::variantarrow::Status, const char*>&)':
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:24: error: 
expected primary-expression before ',' token

 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:32: error: 
expected primary-expression before ',' token

 variant_.~variant();
    ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error: 
expected primary-expression before 'const'

 variant_.~variant();
  ^
/cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error: 
expected ')' before 'const'




Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-06 Thread Paul Taylor

Hi Micah,

Similar to Jacques I'm not disagreeing, but wondering if they belong in 
Arrow vs. can be done externally. I'm mostly interested in changes that 
might impact SIMD processing, considering Arrow's already made conscious 
design decisions to trade memory for speed. Apologies in advance if I've 
misunderstood any of the proposals.



a. Add a run-length encoding scheme to efficiently represent repeated
values (the actual scheme encodes run ends instead of length to preserve
sub-linear random access).
Couldn't one do RLE at the buffer level via a custom 
FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?



b. Add a “packed” sparse representation (null values don’t take up
space in value buffers)
This would be fine for simple SIMD aggregations like count/avg/mean, but 
compacting null slots complicates more advanced parallel routines that 
execute independently and rely on indices aligning with an element's 
logical position.


It sounds like here the logical position depends on knowing the number 
of nulls up to that point (via something like sequentially iterating 
both data and validity buffers). An efficient parallel routine would 
likely need to scan beforehand to inflate the packed representation, 
where today it can simply slice/mmap the data buffer directly.



a. Add frame of reference integer encoding [7] (this allows for lower
bit-width encoding of integer types by subtracting a
“reference” value from all values in the buffer).
I agree this is useful, but couldn't it also live in userland/an 
ExtensionType?



b. Add a sparse integer set encoding.  This encoding allows more
efficient encoding of validity bit-masks for cases when all values are
either null or not null.
If this is in reference to the discussion at link #4 [1], it sounds 
similar to the BufferLayout metadata that used to exist but was removed 
a while back [2]. Knowing the buffer layouts allows an implementation to 
generically elide any buffer at will, but would probably be a lot to 
bring back in. I can't say whether adding a different set of metadata 
would raise the same concerns issues Jacques mentioned in the JIRA 
thread in [2].



Data compression.  Similar to encodings but compression is solely for
reduction of data at rest/on the wire.  The proposal is to allow
compression of individual buffers. Right now zstd is proposed, but I don’t
feel strongly on the specific technologies here.
What's the goal for this? Random element access into compressed 
in-memory columns, or compression at I/O boundaries?


* If the former, is Parquet a better alternative here? Again, I'm 
cautious about the impact to parallel routines. CPU speeds are 
plateauing while memory and tx/rx keep growing. Compressed element 
access seems to be on the CPU side of that equation (meanwhile parallel 
deflate already exists, and I remember seeing research into parallel 
inflate).


* If the later, could we do a comparison of Arrow dictionary-encoding + 
different compression formats, vs. building them into the spec? I know 
content-aware compression yields significant size reductions, but I 
wonder if the maintenance burden on Arrow contributors is worth the cost 
vs. a simpler dictionary-encoding + streaming gzip.



Data Integrity.  While the arrow file format isn’t meant for archiving
data, I think it is important to allow for optional native data integrity
checks in the format.  To this end, I proposed a new “Digest” message type
that can be added after other messages to record a digest/hash of the
preceding data. I suggested xxhash, but I don’t have a strong opinion here,
as long as there is some minimal support that can potentially be expanded
later.

:thumbs up:


Best,
Paul


1. 
https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E


2. 
https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902


On 7/5/19 11:53 AM, Micah Kornfield wrote:

Hi Arrow-dev,

I’d like to make a straw-man proposal to cover some features that I think
would be useful to Arrow, and that I would like to make a proof-of-concept
implementation for in Java and C++.  In particular, the proposal covers
allowing for smaller data sizes via compression and encoding [1][2][8],
data integrity [3] and avoiding unnecessary data transfer [4][5].

I’ve put together a PR  [6] that has proposed changes to the flatbuffer
metadata to support the new features.  The PR introduces:

-

A new “SparseRecordBatch” that can support one of multiple possible
encodings (both dense and sparse), compression and column elision.
-

A “Digest” message type to support optional data integrity.


Going into more details on the specific features in the PR:

1.

Sparse encodings for arrays and buffers.  The guiding principles behind
the suggested encodings are to support encodings that 

Re: JS - proper way to instantiate a RecordBatch with nulls

2019-06-21 Thread Paul Taylor

Hi Jenny,

In the 0.13 you have to pre-allocate the null bitmap and data buffers 
ahead of time, then use `vec.set(idx, val)` to write each value in.


In 0.14 you can instead use the new Builders from PR #4476 
. These will create and 
resize the underlying buffers as you append values, and includes 
chunking and streaming APIs for convenience. You can use them by 
pointing npm to my github repo like I did here 
 
if you don't mind using a prerelease build until 0.14 is out.


Best,
Paul

On 6/20/19 7:03 PM, Jenny Kwan wrote:

Apologies if this has been discussed before. I'm attempting to create Arrow 
files in an Electron desktop app using JS. I can't find docs showing how to 
create a RecordBatch with a column of, say, nullable bools. Using Table.from 
and Table.new both keep coercing nulls to falses. Any pointers?

Thanks,
Jenny

⁣Get BlueMail for Android ​




[jira] [Created] (ARROW-5537) [JS] Support delta dictionaries in RecordBatchWriter and DictionaryBuilder

2019-06-09 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-5537:
--

 Summary: [JS] Support delta dictionaries in RecordBatchWriter and 
DictionaryBuilder
 Key: ARROW-5537
 URL: https://issues.apache.org/jira/browse/ARROW-5537
 Project: Apache Arrow
  Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.14.0


The new JS DictionaryBuilder and RecordBatchWriter and should support building 
and writing delta dictionary batches to enable creating DictionaryVectors while 
streaming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5396) [JS] Ensure reader and writer support files and streams with no RecordBatches

2019-05-22 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-5396:
--

 Summary: [JS] Ensure reader and writer support files and streams 
with no RecordBatches
 Key: ARROW-5396
 URL: https://issues.apache.org/jira/browse/ARROW-5396
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.13.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.14.0


Re: https://issues.apache.org/jira/browse/ARROW-2119 and 
[https://github.com/apache/arrow/pull/3871], the JS reader and writer should 
support files and streams with a Schema but no RecordBatches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Format] Zero size record batches

2019-05-21 Thread Paul Taylor

I'd be happy to PR a fix for JS today if someone can link me to Wes's PR.

On 5/21/19 11:02 AM, Wes McKinney wrote:

I agree also. As a practical use case, the results of a request made
with Arrow Flight might yield an empty result set. I'm not sure if
this needs to be formally noted in the specification documents but it
might not hurt.

If someone can fix the Java implementation we could enable the
integration test (minus JavaScript for now) in my PR

On Tue, May 21, 2019 at 12:47 AM Ravindra Pindikura  wrote:

On Tue, May 21, 2019 at 10:35 AM Micah Kornfield 
wrote:


Today, the format docs are ambiguous on whether zero sized batches are
supported.  Wes opened a PR [1] for empty record batches that shows C++
handles them but Java and javascript fail to handle them.


I'd like to propose:
1.  Make it explicit in the format docs, that 0 size record batches are
supported
2.  Update Java and javascript implementations to work with them (I can put
the Java work on my backlog, but would need a volunteer for JS).  And any
other implementations that don't currently handle them.

Thoughts?


Will need to add a test case for gandiva also - and fix if it shows up any
bugs. but, I agree we should support zero sized batches.




Thanks,
Micah



--
Thanks and regards,
Ravindra.


[jira] [Created] (ARROW-5115) [JS] Implement the Vector Builders

2019-04-03 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-5115:
--

 Summary: [JS] Implement the Vector Builders
 Key: ARROW-5115
 URL: https://issues.apache.org/jira/browse/ARROW-5115
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.13.0
Reporter: Paul Taylor
Assignee: Paul Taylor


We should implement the streaming Vector Builders in JS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5100) [JS] Writer swaps byte order if buffers share the same underlying ArrayBuffer

2019-04-03 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-5100:
--

 Summary: [JS] Writer swaps byte order if buffers share the same 
underlying ArrayBuffer
 Key: ARROW-5100
 URL: https://issues.apache.org/jira/browse/ARROW-5100
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.13.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.14.0


We collapse contiguous Uint8Arrays that share the same underlying ArrayBuffer 
and have overlapping byte ranges. This was done to maintain true zero-copy 
behavior when using certain node core streams that use a buffer pool 
internally, and could write chunks of the same logical Arrow Message at 
out-of-order byte offsets in the pool.

Unfortunately this can also lead to a bug where, in rare cases, buffers are 
swapped while writing Arrow Messages too. We could have a flag to indicate 
whether we think collapsing out-of-order same-buffer chunks is safe, but I'm 
not sure if we can always know that, so I'd prefer to take it out and incur the 
copy cost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC1

2019-03-24 Thread Paul Taylor
Yes, the solution here is to publish all the packages again, and ignore errors 
from ones that have already been published via:

npx lerna exec --no-bail -- npm publish

Best,
Paul

> On Mar 24, 2019, at 2:10 PM, Krisztián Szűcs  
> wrote:
> 
> Hi Kou,
> 
> Paul has already added me, and I was trying to publish the
> packages, but it fails with:
> 19 verbose stack Error: 403 Forbidden - PUT
> https://registry.npmjs.org/@apache-arrow%2fes2015-umd - You cannot publish
> over the previously published versions: 0.4.1.
> 
> It seems like the script has updated three of the packages:
> https://www.npmjs.com/settings/apache-arrow/packages
> 
> And now it fails to publish again.
> Any suggestions?
> 
> On Sun, Mar 24, 2019 at 10:06 PM Kouhei Sutou  wrote:
> 
>> Hi,
>> 
>> I've published 0.4.1:
>>  https://www.npmjs.com/package/apache-arrow/v/0.4.1
>> 
>> (It seems that "npx lerna exec -- npm publish" in
>> npm-release.sh doesn't work with 2FA enabled account. I
>> couldn't input one time password from the standard input. I
>> passed one time password by --otp option: npm publish --otp OTP)
>> 
>> Krisztian, could you tell me your user name at mpmjs? I'll
>> add you to maintainers.
>> 
>> 
>> Thanks,
>> --
>> kou
>> 
>> In 
>>  "Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC1" on Sun, 24 Mar 2019
>> 19:27:25 +0100,
>>  Krisztián Szűcs  wrote:
>> 
>>> Hi All,
>>> 
>>> The vote carries with 3 binding +1 votes. Thanks to everyone for
>>> helping verify the release!
>>> 
>>> I've published the release to the Apache dist system [1], however
>>> I don't have rights to push the NPM package [2].
>>> Could someone either publish it to NPM or grant me rights?
>>> 
>>> Thanks, Krisztian
>>> 
>>> [1]: https://dist.apache.org/repos/dist/release/arrow/arrow-js-0.4.1/
>>> [2]: https://www.npmjs.com/package/apache-arrow
>>> 
>>> On Thu, Mar 21, 2019 at 10:19 PM Brian Hulette 
>> wrote:
>>> 
 +1 (non-binding)
 
 Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1` with Node
>> v11.12.0
 
 
 On Thu, Mar 21, 2019 at 1:54 PM Krisztián Szűcs <
>> szucs.kriszt...@gmail.com
> 
 wrote:
 
> +1 (binding)
> 
> Ran `dev/release/js-verify-release-candidate.sh 0.4.1 1`
> with Node v11.12.0 on OSX 10.14.3 and it looks good.
> 
> On Thu, Mar 21, 2019 at 8:45 PM Krisztián Szűcs <
 szucs.kriszt...@gmail.com
>> 
> wrote:
> 
>> Hello all,
>> 
>> I would like to propose the following release candidate (rc1) of
>> Apache
>> Arrow JavaScript version 0.4.1. This is the second release
>> candidate,
>> including the fix for node version requirement [3].
>> 
>> The source release rc1 is hosted at [1].
>> 
>> This release candidate is based on commit
>> e9cf83c48b9740d42b5d18158e61c0962fda59c1
>> 
>> Please download, verify checksums and signatures, run the unit
>> tests,
 and
>> vote
>> on the release. The easiest way is to use the JavaScript-specific
 release
>> verification script dev/release/js-verify-release-candidate.sh.
>> 
>> [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1
>> because...
>> 
>> 
>> How to validate a release signature:
>> https://httpd.apache.org/dev/verification.html
>> 
>> [1]:
>> 
 https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc1/
>> [2]:
>> 
> 
 
>> https://github.com/apache/arrow/tree/e9cf83c48b9740d42b5d18158e61c0962fda59c1
>> [3]: https://github.com/apache/arrow/pull/4006/
>> 
> 
 
>> 



Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0

2019-03-20 Thread Paul Taylor
+1 non-binding

Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High
Sierra w/ node v11.6.0


On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou  wrote:

> +1 (binding)
>
> I ran the followings on Debian GNU/Linux sid:
>
>   * dev/release/js-verify-release-candidate.sh 0.4.1 0
>
> with:
>
>   * Node.js v11.12.0
>
> Thanks,
> --
> kou
>
> In 
>   "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019
> 00:09:54 +0100,
>   Krisztián Szűcs  wrote:
>
> > Hello all,
> >
> > I would like to propose the following release candidate (rc0) of Apache
> > Arrow JavaScript version 0.4.1.
> >
> > The source release rc0 is hosted at [1].
> >
> > This release candidate is based on commit
> > f55542eeb59dde8ff4512c707b9eca1b43b62073
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The easiest way is to use the JavaScript-specific release
> > verification script dev/release/js-verify-release-candidate.sh.
> >
> > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because...
> >
> >
> > How to validate a release signature:
> > https://httpd.apache.org/dev/verification.html
> >
> > [1]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/
> > [2]:
> >
> https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073
>


[jira] [Created] (ARROW-4976) [JS] RecordBatchReader should reset its Node/DOM streams

2019-03-20 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4976:
--

 Summary: [JS] RecordBatchReader should reset its Node/DOM streams
 Key: ARROW-4976
 URL: https://issues.apache.org/jira/browse/ARROW-4976
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


RecordBatchReaders should reset their internal platform streams on reset so 
they can be piped to separate output streams when reset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Union typeIds width

2019-03-20 Thread Paul Taylor
I noticed the the DenseUnion docs[1] says the typeIds buffer is 8-bit 
signed integers, but in the flatbuffer schema[2] it's typed as int (and 
flatc generates a function that returns an Int32Array).


How are the other implementations treating this buffer, and should we 
update the docs or the flatbuffers schema?


Thanks,

Paul

1. https://arrow.apache.org/docs/format/Layout.html#dense-union-type

2. 
https://github.com/apache/arrow/blob/50bc9f49671afb56594910f49b9bf34e080a70e7/format/Schema.fbs#L92




Re: Timeline for 0.13 Arrow release

2019-03-19 Thread Paul Taylor
I agree, the JS has matured a lot in the last few months. I think it's 
ready to join the regular Arrow releases. Let me know if I can help 
integrate the publish scripts :-)


The two main things in progress are docs + Vector Builders, neither of 
which should block this release.


We're going to try to get the docs/recipes ready for a PR this weekend. 
If that lands shortly after 0.13.0 goes out, would it be possible to 
update the website independently, or would that need to wait until 0.14?


Paul

On 3/19/19 10:08 AM, Wes McKinney wrote:

I'm in favor of including JS in the 0.13.0 release.

I'm going to try to fix a couple of the Python Parquet bugs until the
RC is ready to be cut, but none of them need block the release.

Seems like we need someone else to volunteer to be the RM for 0.13 if
Uwe is unavailable next week. Antoine -- are you possibly up for it
(the initial setup will be a bit painful)? I don't have access to a
machine with my code signing key on it until next week so I cannot do
it

- Wes

On Tue, Mar 19, 2019 at 9:46 AM Kouhei Sutou  wrote:

Hi,

There are no blockers on GLib, Ruby and Linux packages.

Can we include JavaScript into 0.13.0?
If we include JavaScript into 0.13.0, we can remove
codes to release JavaScript separately. For example, we can
remove dev/release/js-*. We can enable version update code
in dev/release/00-prepare.sh:
https://github.com/apache/arrow/blob/master/dev/release/00-prepare.sh#L67-L74

We can merge "JavaScript Releases" document into our release
document:
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-JavaScriptReleases


Thanks,
--
kou

In 
   "Re: Timeline for 0.13 Arrow release" on Mon, 18 Mar 2019 20:51:12 -0500,
   Wes McKinney  wrote:


hi folks,

I think we're basically at the 0.13 end game here. There's some more
patches can get in, but do we all think we can cut an RC by the end of
the week? What are the blocking issues?

Thanks
Wes

On Sat, Mar 16, 2019 at 9:57 PM Kouhei Sutou  wrote:

Hi,


Submitted the packaging builds:
https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452

I've fixed .deb/.rpm packages: https://github.com/apache/arrow/pull/3934
It has been merged.
So .deb/.rpm packages are ready for release.

Thanks,
--
kou

In 
   "Re: Timeline for 0.13 Arrow release" on Thu, 14 Mar 2019 16:24:43 +0100,
   Krisztián Szűcs  wrote:


Submitted the packaging builds:
https://github.com/kszucs/crossbow/branches/all?utf8=%E2%9C%93=build-452

On Thu, Mar 14, 2019 at 4:19 PM Wes McKinney  wrote:


The CMake refactor is merged! Kudos to Uwe for 3+ weeks of hard labor on
this.

We should run all the packaging tasks and get a full accounting of
what is broken so we aren't surprised during the release process

On Wed, Mar 13, 2019 at 9:39 AM Krisztián Szűcs
 wrote:

The proof of the pudding is in the eating. You convinced me.

On Wed, Mar 13, 2019 at 3:31 PM Wes McKinney 

wrote:

Krisztian -- are you all right with proceeding with merging the CMake
refactor? I'm pretty committed to helping fix the problems that come
up. Since most consumers of the project don't test until _after_ a
release, we won't find out about some problems until we merge it and
release it. Thus, IMHO it doesn't make sense to wait another 8-10
weeks since we'd be delaying feedback for that long. There are also a
number of follow-on issues blocking on the refactor

On Tue, Mar 12, 2019 at 11:39 AM Andy Grove 

wrote:

I've cleaned up my issues for Rust, moving most of them to 0.14.0.

I have two PRs in progress that I would appreciate reviews on:

https://github.com/apache/arrow/pull/3671 - [Rust] Table API (a.k.a
DataFrame)

https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data

source

in

DataFusion

Once these are merged I have some small follow up PRs for 0.13.0

that I

can

get done this week.

Thanks,

Andy.


On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney 

wrote:

hi folks,

I think we are on track to be able to release toward the end of

this

month. My proposed timeline:

* This week (March 11-15): feature/improvement push mostly
* Next week (March 18-22): shift to bug fixes, stabilization, empty
backlog of feature/improvement JIRAs
* Week of March 25: propose release candidate

Does this seem reasonable? This puts us at about 9-10 weeks from

0.12.

We need an RM for 0.13, any PMCs want to volunteer?

Take a look at our release page:



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219

Out of the open or in-progress issues, we have:

* C#: 3 issues
* C++ (all components): 51 issues
* Java: 3 issues
* Python: 38 issues
* Rust (all components): 33 issues

Please help curating the backlogs for each component. There's a
smattering of issues in other categories. There are also 10 open
issues with No Component (and 20 resolved issues), those need their
metadata fixed.

Thanks,
Wes

On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney 

wrote:

The timeline for the 0.13 release is 

Re: [DISCUSS] Format changes: process and requirements

2019-03-17 Thread Paul Taylor

Hi Jacques,


I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals.


Agree 100%. We may already be in this situation with the DictionaryBatch 
"isDelta" flag. I haven't checked the C++ in a while so I may be 
mistaken, but I think JS is the only impl with support for interleaved 
Dictionary/RecordBatches. It'd be good to put a process in place that 
helps avoid this in the future.



I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio).
No argument here either, though I should mention with the exception of 
Tensor messages the JS version is also feature-complete from the 
standpoint of the format.


It's still early in terms of adoption, but we've seen some interest from 
the Vega, Jupyter, and Uber Deck.gl projects in either contributing to 
or integrating with ArrowJS.


So while we're certainly not at the level of Spark or Pandas, we may be 
poised for wider adoption, and I'd request we take the JS implementation 
into account when making format changes. I'm happy to implement new 
features and update the integration tests as necessary.



Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?
The thing that springs to mind is anything to do with 64-bit indexing, 
as recently discussed in the sparse matrix thread. IIRC none of the JS 
engines presently allow allocating buffers greater than 2GiB. 
Limitations in JS shouldn't block other implementations from moving 
ahead, but it would be good for the community to come to a consensus on 
guidance or workarounds for JS interop when we are in that sort of 
situation.


Thanks,

Paul


On 3/17/19 6:07 PM, Jacques Nadeau wrote:

How about "at least two native implementations" instead of
"Java and C++"? Now, we have multiple native
implementations:


I think we should have two complete implementations. I don't think having
one feature in C# and Go and another in JavaScript and Rust does justice to
the project goals. I think Java and C++ should always be complete. They are
the first two implementations. I believe they are the most complete and
broadly used/popular (C++ given Python & Pandas integration and Java via
Spark & Dremio). This is a compromise between setting a high barrier for
creation of new features and making sure that we have validated things
across impls.

Are there specific changes to format/ that have been merged that you
are concerned about that you feel need to be discussed separately?
There have been some changes related to serializing tensor metadata
that are clearly marked as experimental, and they also do not interact
with the columnar format.

There are several things we've introduced over time that suffered this
problem. Alignment changes, dictionary encoding, union behavior, interval
behavior, tensors, unsigned integrations, etc that we've failed to make
sure we have integration tests for. I've meant to send this email for
months but saw a couple of recent proposed changes which made me feel like
we should discuss further.



[jira] [Created] (ARROW-4781) [JS] Ensure empty data initializes empty typed arrays

2019-03-05 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4781:
--

 Summary: [JS] Ensure empty data initializes empty typed arrays
 Key: ARROW-4781
 URL: https://issues.apache.org/jira/browse/ARROW-4781
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


Empty ArrayData instances should initialize with the appropriate 0-length 
buffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4780) [JS] Package sourcemap files, update default package JS version

2019-03-05 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4780:
--

 Summary: [JS] Package sourcemap files, update default package JS 
version
 Key: ARROW-4780
 URL: https://issues.apache.org/jira/browse/ARROW-4780
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Paul Taylor
Assignee: Paul Taylor


The build should split the sourcemaps out to speed up client builds, and 
include a "module" entry in the package.json for @pika/web, and the main 
package should ship the latest ESNext JS versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4738) [JS] NullVector should include a null data buffer

2019-03-01 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4738:
--

 Summary: [JS] NullVector should include a null data buffer
 Key: ARROW-4738
 URL: https://issues.apache.org/jira/browse/ARROW-4738
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


Arrow C++ and pyarrow expect NullVectors to include a null data buffer, so 
ArrowJS should write one into the buffer layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Passing user-defined "extension" types in the Arrow protocol

2019-02-26 Thread Paul Taylor
An alternative that's worked for us is (ab)using single-child 
SparseUnions to represent custom types. We have an enum of "well-known" 
typeIds (UUID, vec2's, IP addresses, etc), whose data is stored in one 
of the known Arrow types, as you've done.


Pros are the typeIds buffer is tiny, and doesn't require metadata 
propagation or string matching to maintain type information.


Cons are this is really an abuse of the Union type, and since the typeId 
buffer is (implicitly?) an Int8, we can only have 255 extension types 
today. We don't have that many yet, but that could be an issue if this 
pattern were generalized to any number of custom types.


I'm not sure how widely supported Unions are across the Arrow 
implementations or ecosystem (unsure about pandas, Rapids/cuDF no 
support yet), but maybe this pattern could work more generally if we 
defined an enum of "well-known" extension typeIds?


Thanks,

Paul


On 2/25/19 3:32 PM, Wes McKinney wrote:

hi folks,

I recently wrote a patch to propose a C++ API for user-defined "extension" types

https://github.com/apache/arrow/pull/3694

The idea is that an extension type wraps a pre-existing Arrow type.
For example a UUIDType can be represented as FixedSizeBinary(16). The
intent is that Arrow consumers which are not aware of an extension
type can ignore the additional type metadata and still interact with
the raw storage

One question is how to permit such metadata to be preserved through
IPC / RPC messages (i.e., Schema.fbs) and how other languages can
interact with it. There are couple options:

* What I implemented in my patch: use the Field-level custom_metadata
field with known key names "arrow_extension_name" and
"arrow_extension_data" for the type's unique identifier and serialized
form, respectively. If we opt for this, then we should add a section
to the specification to codify the convention used

* Add a new field to the Field table in Schema.fbs

The former is attractive in the sense that consumers who don't have
special handling for an extension type will carry along the Field
metadata in their schema, so it can be passed on in subsequent IPC
messages without writing any extra code.

Thoughts about this? With a C++ implementation landing, it would be
great to identify a champion to create a Java implementation and also
add integration test support to ensure that consumers do not destroy
the extension type metadata for unrecognized types (i.e. if I send you
data that says it's "uuid" and you don't know what that is yet, you
preserve the metadata fields anyway).

Thanks
Wes


[jira] [Created] (ARROW-4682) [JS] Writer should be able to write empty tables

2019-02-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4682:
--

 Summary: [JS] Writer should be able to write empty tables
 Key: ARROW-4682
 URL: https://issues.apache.org/jira/browse/ARROW-4682
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The writer should be able to write empty tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4674) [JS] Update arrow2csv to new Row API

2019-02-25 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4674:
--

 Summary: [JS] Update arrow2csv to new Row API
 Key: ARROW-4674
 URL: https://issues.apache.org/jira/browse/ARROW-4674
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The {{arrow2csv}} utility uses {{row.length}} to measure cells, but now that 
we've made Rows use Symbols for their internal properties, it should enumerate 
the values with the iterator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4652) [JS] RecordBatchReader throughNode should respect autoDestroy

2019-02-21 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4652:
--

 Summary: [JS] RecordBatchReader throughNode should respect 
autoDestroy
 Key: ARROW-4652
 URL: https://issues.apache.org/jira/browse/ARROW-4652
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The Reader transform stream closes after reading one set of tables even when 
autoDestroy is false. Instead it should reset/reopen the reader, like 
{{RecordBatchReader.readAll()}} does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4579) [JS] Add more interop with BigInt/BigInt64Array/BigUint64Array

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4579:
--

 Summary: [JS] Add more interop with 
BigInt/BigInt64Array/BigUint64Array
 Key: ARROW-4579
 URL: https://issues.apache.org/jira/browse/ARROW-4579
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


We should use or return the new native [BigInt 
types|https://developers.google.com/web/updates/2018/05/bigint] whenever it's 
available.

* Use the native {{BigInt}} to convert/stringify i64s/u64s
* Support the {{BigInt}} type in element comparator and {{indexOf()}}
* Add zero-copy {{toBigInt64Array()}} and {{toBigUint64Array()}} methods to 
{{Int64Vector}} and {{Uint64Vector}}, respectively




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4580) [JS] Accept Iterables in IntVector/FloatVector from() signatures

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4580:
--

 Summary: [JS] Accept Iterables in IntVector/FloatVector from() 
signatures
 Key: ARROW-4580
 URL: https://issues.apache.org/jira/browse/ARROW-4580
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


Right now {{IntVector.from()}} and {{FloatVector.from()}} expect the data is 
already in typed-array form. But if we know the desired Vector type before hand 
(e.g. if {{Int32Vector.from()}} is called), we can accept any JS iterable of 
the values.

In order to do this, we should ensure {{Float16Vector.from()}} properly clamps 
incoming f32/f64 values to u16s, in case the source is a vanilla 64-bit JS 
float.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4578) [JS] Float16Vector toArray should be zero-copy

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4578:
--

 Summary: [JS] Float16Vector toArray should be zero-copy
 Key: ARROW-4578
 URL: https://issues.apache.org/jira/browse/ARROW-4578
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The {{Float16Vector#toArray()}} implementation currently transforms each half 
float into a single float, and returns a Float32Array. All the other 
{{toArray()}} implementations are zero-copy, and this deviation would break 
anyone expecting to give two-byte half floats to native APIs like WebGL. We 
should instead include {{Float16Vector#toFloat32Array()}} and 
{{Float16Vector#toFloat64Array()}} convenience methods that do rely on copying.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4557) [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` method

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4557:
--

 Summary: [JS] Add Table/Schema/RecordBatch `selectAt(...indices)` 
method
 Key: ARROW-4557
 URL: https://issues.apache.org/jira/browse/ARROW-4557
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.5.0


Presently Table, Schema, and RecordBatch have basic {{select(...colNames)}} 
implementations. Having an easy {{selectAt(...colIndices)}} impl would be a 
nice complement, especially when there are duplicate column names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4555) [JS] Add high-level Table and Column creation methods

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4555:
--

 Summary: [JS] Add high-level Table and Column creation methods
 Key: ARROW-4555
 URL: https://issues.apache.org/jira/browse/ARROW-4555
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


It'd be great to have a few high-level functions that implicitly create the 
Schema, RecordBatches, etc. from a Table and a list of Columns. For example:
{code:actionscript}
const table = Table.new(
  Column.new('foo', ...),
  Column.new('bar', ...)
);
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4554) [JS] Implement logic for combining Vectors with different lengths/chunksizes

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4554:
--

 Summary: [JS] Implement logic for combining Vectors with different 
lengths/chunksizes
 Key: ARROW-4554
 URL: https://issues.apache.org/jira/browse/ARROW-4554
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


We should add logic to combine and possibly slice/re-chunk and uniformly 
partition chunks into separate RecordBatches. This will make it easier to 
create Tables or RecordBatches from Vectors of different lengths. This is also 
necessary for {{Table#assign()}}. PR incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4553) [JS] Implement Schema/Field/DataType comparators

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4553:
--

 Summary: [JS] Implement Schema/Field/DataType comparators
 Key: ARROW-4553
 URL: https://issues.apache.org/jira/browse/ARROW-4553
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.1


Some basic type comparison logic is necessary for {{Table#assign()}}. PR 
incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4552) [JS] Table and Schema assign implementations

2019-02-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4552:
--

 Summary: [JS] Table and Schema assign implementations
 Key: ARROW-4552
 URL: https://issues.apache.org/jira/browse/ARROW-4552
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor


It'd be really handy to have a basic {{assign}} methods on the Table and 
Schema. I've extracted and cleaned up some internal helper methods I have that 
does this. PR incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [RESULT][VOTE] Release Apache Arrow JS 0.4.0 - RC1

2019-02-05 Thread Paul Taylor

Yeah, I'd be happy to do a write up tomorrow evening.

On 2/4/19 10:48 PM, Uwe L. Korn wrote:

With 4 +1 (3 binding), this VOTE passes. I will upload the release and send out 
the announcement later.

@Paul, Brian can one of you write a short blog post about the new things in 0.4 
for the Arrow blog?

Uwe

On Mon, Feb 4, 2019, at 10:45 PM, Paul Taylor wrote:

+1

verified on 18.04.1-Ubuntu and node v11.6.0


On 1/31/19 9:38 AM, Krisztián Szűcs wrote:
+1 (binding),

verified on OSX Mojave and Node v11.4.0


On Thu, Jan 31, 2019 at 6:06 PM Brian Hulette  wrote:

+1

verified on Archlinux with Node v11.9.0

Thanks a lot for putting the RC together Uwe!


On Thu, Jan 31, 2019 at 8:08 AM Uwe L. Korn  wrote:

+1 (binding),

verified on Ubuntu 16.04 with
`./dev/release/js-verify-release-candidate.sh 0.4.0 1` and Node v11.9.0

via

nvm.

Uwe


On Thu, Jan 31, 2019, at 5:07 PM, Uwe L. Korn wrote:
Hello all,

I would like to propose the following release candidate (rc1) of Apache
Arrow JavaScript version 0.4.0.

The source release rc1 is hosted at [1].

This release candidate is based on commit
6009eaa49ae29826764eb6e626bf0d12b83f3481

Please download, verify checksums and signatures, run the unit tests,

and vote

on the release. The easiest way is to use the JavaScript-specific

release

verification script dev/release/js-verify-release-candidate.sh.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow JavaScript 0.4.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow JavaScript 0.4.0 because...


How to validate a release signature:
https://httpd.apache.org/dev/verification.html

[1]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.0-rc1/

[2]:

https://github.com/apache/arrow/tree/6009eaa49ae29826764eb6e626bf0d12b83f3481


[jira] [Created] (ARROW-4477) [JS] Bn shouldn't override constructor of the resulting typed array

2019-02-04 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4477:
--

 Summary: [JS] Bn shouldn't override constructor of the resulting 
typed array
 Key: ARROW-4477
 URL: https://issues.apache.org/jira/browse/ARROW-4477
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.0


There's an undefined constructor property definition in the {{Object.assign()}} 
call for the BigNum mixins that's overriding the constructor of the returned 
TypedArrays. I think this was left over from the first iteration where I used 
{{Object.create()}}. These should be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.0 - RC1

2019-02-04 Thread Paul Taylor
+1

verified on 18.04.1-Ubuntu and node v11.6.0

> On 1/31/19 9:38 AM, Krisztián Szűcs wrote:
> +1 (binding),
> 
> verified on OSX Mojave and Node v11.4.0
> 
>> On Thu, Jan 31, 2019 at 6:06 PM Brian Hulette  wrote:
>> 
>> +1
>> 
>> verified on Archlinux with Node v11.9.0
>> 
>> Thanks a lot for putting the RC together Uwe!
>> 
>>> On Thu, Jan 31, 2019 at 8:08 AM Uwe L. Korn  wrote:
>>> 
>>> +1 (binding),
>>> 
>>> verified on Ubuntu 16.04 with
>>> `./dev/release/js-verify-release-candidate.sh 0.4.0 1` and Node v11.9.0
>> via
>>> nvm.
>>> 
>>> Uwe
>>> 
 On Thu, Jan 31, 2019, at 5:07 PM, Uwe L. Korn wrote:
 Hello all,
 
 I would like to propose the following release candidate (rc1) of Apache
 Arrow JavaScript version 0.4.0.
 
 The source release rc1 is hosted at [1].
 
 This release candidate is based on commit
 6009eaa49ae29826764eb6e626bf0d12b83f3481
 
 Please download, verify checksums and signatures, run the unit tests,
>>> and vote
 on the release. The easiest way is to use the JavaScript-specific
>> release
 verification script dev/release/js-verify-release-candidate.sh.
 
 The vote will be open for at least 72 hours.
 
 [ ] +1 Release this as Apache Arrow JavaScript 0.4.0
 [ ] +0
 [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.0 because...
 
 
 How to validate a release signature:
 https://httpd.apache.org/dev/verification.html
 
 [1]:
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.0-rc1/
 [2]:
>> https://github.com/apache/arrow/tree/6009eaa49ae29826764eb6e626bf0d12b83f3481



[jira] [Created] (ARROW-4442) [JS] Overly broad type annotation for Chunked typeId leading to type mismatches in generated typing

2019-01-31 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4442:
--

 Summary: [JS] Overly broad type annotation for Chunked typeId 
leading to type mismatches in generated typing
 Key: ARROW-4442
 URL: https://issues.apache.org/jira/browse/ARROW-4442
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor


Typescript is generating an overly broad type for the `typeId` property of the 
ChunkedVector class, leading to a type mismatch and failure to infer Column 
is Vector:


{code:actionscript}

let col: Vector;
col = new Chunked(new Utf8());
  ^
/*
Argument of type 'Chunked' is not assignable to parameter of type 
'Vector'.
  Type 'Chunked' is not assignable to type 'Vector'.
Types of property 'typeId' are incompatible.
  Type 'Type' is not assignable to type 'Type.Utf8'.
*/
{code}

The fix is to add an explicit return annotation to the Chunked typeId getter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Format] Passing selection masks with Arrow record batches

2019-01-27 Thread Paul Taylor
We’ve been doing this in a few different ways at Graphistry, mostly guided by 
use case and device characteristics.

For temporary/in-memory/microservice CPU workloads, we’ll compute a set of 
valid row indices as one side of a DictionaryVector, with the original 
table/column as the dictionary side. If we want to filter on the filtered set, 
we compute a new set of indices, and use the DictionaryVector as the dictionary 
side of another DictionaryVector. Usually we aren’t filtering more than three 
or four levels, and with this approach we can always map back to the original 
indices in a tight loop.

For GPU workloads we have two approaches:

The physics kernels always operate on the graph in place, with host/device 
transfers kept to a minimum (and absolutely zero reallocation). We have two 
buffers of indices that the kernels use for row lookups into the node and edge 
tables. Even though we don’t resize these, we do update them, and set the 
“logical length” for the tables to their length. In our testing we’ve observed 
the coalesced global memory accesses is more performant/predictable than 
validity bitmaps due to BarnesHut’s irregularity each step.

The other approach is our cuda/cudf-backed analytical services. For these I’ve 
been generating and shuttling validity bitmaps between services. These 
workloads are bursty and infrequent enough (per client) that it’s more 
important to free the GPU memory for another client. The masks are shipped back 
to the service tier, where they’re used to build indices for those 
DictionaryVectors I mentioned before.

For sending materialized datasets over the wire, we also have a few approaches.

For the GPU/visualization state, we have a server-side kernel that prepares the 
filtered rows as arrow recordbatch bodies, so the only thing the CPU has to do 
is copy them to the network interface and write the arrow metadata.

For general-purpose (and pre-arrow) rich-client/server APIs we use a project I 
built at Netflix called Falcor, which is essentially Netflix-flavored GraphQL. 
Clients issue queries for a subset of resources from the server’s expansive 
“virtual graph” data model, intelligently caching/invalidating as necessary. We 
read directly from those nested-DictionaryVectors at the service tier, just 
copying out to small json payloads.

I did have a “dictionary-filter” transform stream for selecting only the 
referenced rows in the set of dictionary batches, but abandoned it shortly 
after I finished the big JS refactor. It’s probably still useful in other 
contexts, but for now we’re happy with just sending around indices/masks, and 
have opted to keep dictionaries in blob storage or shared memory (when 
necessary).

Paul


> On Jan 27, 2019, at 9:52 PM, Wes McKinney  wrote:
> 
> I was having a discussion recently about Arrow and the topic of
> server-side filtering vs. client-side filtering came up.
> 
> The basic problem is this:
> 
> If you have a RecordBatch that you wish to filter out some of the
> "rows", one way to track this in-memory is to create a separate array
> of true/false values instead of forcing a materialization of a
> filtered RecordBatch.
> 
> So you might have a record batch with 2 fields and 4 rows
> 
> a: [1, 2, 3, 4]
> b: ['foo', 'bar', 'baz', 'qux']
> 
> and then a filter
> 
> is_selected: [true, true, false, true]
> 
> This can be easily handled as an application-level concern. Creating a
> bitmask is generally cheap relative to materializing the filtered
> version, and some operators may support "pushing down" such filters
> (e.g. aggregations may accept a selection mask to exclude "off"
> values). I myself implemented such a scheme in the past for a query
> engine I built in the 2013-2014 time frame and it yielded material
> performance improvements in some cases.
> 
> One question is what you should do when you want to put the data on
> the wire, e.g. via RPC / Flight or IPC. Two options
> 
> * Pass the complete RecordBatch, plus the filter as a "special" field
> and attach some metadata so that you know that filter field is
> "special"
> 
> * Filter the RecordBatch before sending, send only the selected rows
> 
> The first option can of course be implemented as an application-level
> detail, much as we handle the serialization of pandas row indexes
> right now (where we have custom pandas metadata and "special"
> fields/columns for the index arrays). But it could be a common enough
> use case to merit a more well-defined standardized approach.
> 
> I'm not sure what the answer is, but I wanted to describe the problem
> as I see it and see if anyone has any thoughts about it.
> 
> I'm aware that Dremio is a user of "selection vectors" (selected
> indices, instead of boolean true/false values, so we would have [0, 1,
> 3] in the above case), so the similar discussion may apply to passing
> selection vectors on the wire.
> 
> - Wes


[jira] [Created] (ARROW-4396) Update Typedoc to support TypeScript 3.2

2019-01-27 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4396:
--

 Summary: Update Typedoc to support TypeScript 3.2
 Key: ARROW-4396
 URL: https://issues.apache.org/jira/browse/ARROW-4396
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor


Update TypeDoc now that it supports TypeScript 3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4395) ts-node throws type error running `bin/arrow2csv.js`

2019-01-27 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4395:
--

 Summary: ts-node throws type error running `bin/arrow2csv.js`
 Key: ARROW-4395
 URL: https://issues.apache.org/jira/browse/ARROW-4395
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.0


ts-node is being too strict, throws this (inaccurate) error JIT'ing the TS 
source:

{code:none}
$ cat test/data/cpp/stream/simple.arrow | ./bin/arrow2csv.js 

/home/ptaylor/dev/arrow/js/node_modules/ts-node/src/index.ts:228
return new TSError(diagnosticText, diagnosticCodes)
   ^
TSError: ⨯ Unable to compile TypeScript:
src/vector/map.ts(25,57): error TS2345: Argument of type 'Field[]' is not assignable to parameter of type 'Field[]'.
  Type 'Field' is not assignable to type 
'Field'.
Type 'T[string] | T[number] | T[symbol]' is not assignable to type 'T[keyof 
T]'.
  Type 'T[symbol]' is not assignable to type 'T[keyof T]'.
Type 'DataType' is not assignable to type 'T[keyof T]'.
  Type 'symbol' is not assignable to type 'keyof T'.
Type 'symbol' is not assignable to type 'string | number'.
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4283) Should RecordBatchStreamReader/Writer be AsyncIteraable?

2019-01-17 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4283:
--

 Summary: Should RecordBatchStreamReader/Writer be AsyncIteraable?
 Key: ARROW-4283
 URL: https://issues.apache.org/jira/browse/ARROW-4283
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Paul Taylor


Filing this issue after a discussion today with [~xhochy] about how to 
implement streaming pyarrow http services. I had attempted to use both Flask 
and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s streaming 
interfaces because they seemed familiar, but no dice. I have no idea how hard 
this would be to add -- supporting all the asynciterable primitives in JS was 
non-trivial.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Problems building Arrow Java

2018-12-31 Thread Paul Taylor

Thanks Wes, that was the fix.

After verifying everything's still good, I opened the JS refactor PR 
here: https://github.com/apache/arrow/pull/3290.


Y'all feel free to review it on your own time, I'm in no rush. Just 
happy I'm not holding everybody else up finally.


Paul

On 12/30/18 5:11 PM, Wes McKinney wrote:

Hi Paul -- Java development has all been happening on Java 8 and 9. You
might want to try an older JDK since newer ones aren't being formally
supported yet

Wes

On Sun, Dec 30, 2018, 12:39 PM Paul Taylor 
Ah, just realized I was missing javac. I installed
`openjdk-11-jdk-headless` and verifying it exists, now the build gets
further along, but fails related to an undeclared
`com.google.code.findbugs` dependency? Thanks in advance for any
guidance here.

*~/dev/arrow/java*$ sudo update-alternatives --config java
There is only one alternative in link group java (providing
/usr/bin/java): /usr/lib/jvm/java-11-openjdk-amd64/bin/java
Nothing to configure.

*~/dev/arrow/java*$ sudo update-alternatives --config javac
There is only one alternative in link group javac (providing
/usr/bin/javac): /usr/lib/jvm/java-11-openjdk-amd64/bin/javac
Nothing to configure.
*~/dev/arrow/java*$ mvn install -e
[*INFO*]*Tests run: 15*, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.119 s - in org.apache.arrow.vector.*TestCopyFrom*
[*INFO*]
[*INFO*] Results:
[*INFO*]
[*INFO*]*Tests run: 187, Failures: 0, Errors: 0, Skipped: 0*
[*INFO*]
[*INFO*]
[*INFO*]*--- *maven-jar-plugin:3.0.0:jar  *(default-jar)*
@arrow-vector*---*
[*INFO*] Building jar:
/home/ptaylor/dev/arrow/java/vector/target/arrow-vector-0.12.0-SNAPSHOT.jar
[*INFO*]
[*INFO*]*--- *maven-site-plugin:3.5.1:attach-descriptor
*(attach-descriptor)*  @arrow-vector*---*
[*INFO*]
[*INFO*]*--- *maven-jar-plugin:3.0.0:test-jar  *(default)*
@arrow-vector*---*
[*INFO*] Building jar:
/home/ptaylor/dev/arrow/java/vector/target/arrow-vector-0.12.0-SNAPSHOT-tests.jar
[*INFO*]
[*INFO*]*--- *maven-enforcer-plugin:3.0.0-M1:enforce
*(avoid_bad_dependencies)*  @arrow-vector*---*
[*INFO*]
[*INFO*]*--- *maven-dependency-plugin:3.0.1:analyze-only  *(analyze)*
@arrow-vector*---*
[*WARNING*] Used undeclared dependencies found:
[*WARNING*]com.google.code.findbugs:jsr305:jar:3.0.2:compile

[*INFO*]**
[*INFO*]*Reactor Summary:*
[*INFO*]
[*INFO*] Apache Arrow Java Root POM .*SUCCESS*  [
2.865 s]
[*INFO*] Arrow Format ...*SUCCESS*  [
2.170 s]
[*INFO*] Arrow Memory ...*SUCCESS*  [
2.669 s]
[*INFO*] Arrow Vectors ..*FAILURE*  [
6.199 s]
[*INFO*] Arrow Tools *SKIPPED*
[*INFO*] Arrow JDBC Adapter .*SKIPPED*
[*INFO*] Arrow Plasma Client *SKIPPED*
[*INFO*] Arrow Flight ...*SKIPPED*

[*INFO*]**
[*INFO*]*BUILD FAILURE*

[*INFO*]**
[*INFO*] Total time: 14.101 s
[*INFO*] Finished at: 2018-12-30T10:40:11-08:00
[*INFO*] Final Memory: 97M/376M

[*INFO*]**
[*ERROR*] Failed to execute
goalorg.apache.maven.plugins:maven-dependency-plugin:3.0.1:analyze-only
*(analyze)*  on projectarrow-vector:*Dependency problems found*  ->*[Help
1]*

On 12/30/18 10:21 AM, Paul Taylor wrote:

Is anyone else having issues building Arrow Java? I'm trying to run
the integration tests locally, but can't figure out why `mvn install`
is failing. I see a number of warnings, and a few checkstyle errors,
but nothing besides that stands out.

Thanks,
Paul

*~/dev/arrow/java*$ java --version
Picked up JAVA_TOOL_OPTIONS:
openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed

mode)

*~/dev/arrow/java*$ mvn --version
Picked up JAVA_TOOL_OPTIONS:
*Apache Maven 3.5.2*
Maven home: /usr/share/maven
Java version: 10.0.2, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-43-generic", arch: "amd64", family:

"unix"

***~/dev/arrow/java*$ mvn install -e
Picked up JAVA_TOOL_OPTIONS:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by

com.google.inject.internal.cglib.core.$ReflectUtils$1
(file:/usr/share/maven/lib/guice.jar) to method
java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)

WARNING: Please consider reporting this to the maintainers of

com.google.inject.internal.cglib.core.$ReflectUtils$1

WARNING: Use --illegal-access

Re: Problems building Arrow Java

2018-12-30 Thread Paul Taylor
Ah, just realized I was missing javac. I installed 
`openjdk-11-jdk-headless` and verifying it exists, now the build gets 
further along, but fails related to an undeclared 
`com.google.code.findbugs` dependency? Thanks in advance for any 
guidance here.


*~/dev/arrow/java*$ sudo update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java): 
/usr/lib/jvm/java-11-openjdk-amd64/bin/java
Nothing to configure.

*~/dev/arrow/java*$ sudo update-alternatives --config javac
There is only one alternative in link group javac (providing /usr/bin/javac): 
/usr/lib/jvm/java-11-openjdk-amd64/bin/javac
Nothing to configure.
*~/dev/arrow/java*$ mvn install -e
[*INFO*]*Tests run: 15*, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.119 s - in org.apache.arrow.vector.*TestCopyFrom*
[*INFO*]
[*INFO*] Results:
[*INFO*]
[*INFO*]*Tests run: 187, Failures: 0, Errors: 0, Skipped: 0*
[*INFO*]
[*INFO*]
[*INFO*]*--- *maven-jar-plugin:3.0.0:jar  *(default-jar)*  @arrow-vector*---*
[*INFO*] Building jar: 
/home/ptaylor/dev/arrow/java/vector/target/arrow-vector-0.12.0-SNAPSHOT.jar
[*INFO*]
[*INFO*]*--- *maven-site-plugin:3.5.1:attach-descriptor  *(attach-descriptor)*  
@arrow-vector*---*
[*INFO*]
[*INFO*]*--- *maven-jar-plugin:3.0.0:test-jar  *(default)*  @arrow-vector*---*
[*INFO*] Building jar: 
/home/ptaylor/dev/arrow/java/vector/target/arrow-vector-0.12.0-SNAPSHOT-tests.jar
[*INFO*]
[*INFO*]*--- *maven-enforcer-plugin:3.0.0-M1:enforce  
*(avoid_bad_dependencies)*  @arrow-vector*---*
[*INFO*]
[*INFO*]*--- *maven-dependency-plugin:3.0.1:analyze-only  *(analyze)*  
@arrow-vector*---*
[*WARNING*] Used undeclared dependencies found:
[*WARNING*]com.google.code.findbugs:jsr305:jar:3.0.2:compile
[*INFO*]**
[*INFO*]*Reactor Summary:*
[*INFO*]
[*INFO*] Apache Arrow Java Root POM .*SUCCESS*  [  
2.865 s]
[*INFO*] Arrow Format ...*SUCCESS*  [  
2.170 s]
[*INFO*] Arrow Memory ...*SUCCESS*  [  
2.669 s]
[*INFO*] Arrow Vectors ..*FAILURE*  [  
6.199 s]
[*INFO*] Arrow Tools *SKIPPED*
[*INFO*] Arrow JDBC Adapter .*SKIPPED*
[*INFO*] Arrow Plasma Client *SKIPPED*
[*INFO*] Arrow Flight ...*SKIPPED*
[*INFO*]**
[*INFO*]*BUILD FAILURE*
[*INFO*]**
[*INFO*] Total time: 14.101 s
[*INFO*] Finished at: 2018-12-30T10:40:11-08:00
[*INFO*] Final Memory: 97M/376M
[*INFO*]**
[*ERROR*] Failed to execute 
goalorg.apache.maven.plugins:maven-dependency-plugin:3.0.1:analyze-only  
*(analyze)*  on projectarrow-vector:*Dependency problems found*  ->*[Help 1]*

On 12/30/18 10:21 AM, Paul Taylor wrote:


Is anyone else having issues building Arrow Java? I'm trying to run 
the integration tests locally, but can't figure out why `mvn install` 
is failing. I see a number of warnings, and a few checkstyle errors, 
but nothing besides that stands out.


Thanks,
Paul

*~/dev/arrow/java*$ java --version
Picked up JAVA_TOOL_OPTIONS:
openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)
*~/dev/arrow/java*$ mvn --version
Picked up JAVA_TOOL_OPTIONS:
*Apache Maven 3.5.2*
Maven home: /usr/share/maven
Java version: 10.0.2, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-43-generic", arch: "amd64", family: "unix"
***~/dev/arrow/java*$ mvn install -e
Picked up JAVA_TOOL_OPTIONS:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
com.google.inject.internal.cglib.core.$ReflectUtils$1 
(file:/usr/share/maven/lib/guice.jar) to method 
java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of 
com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
[...lots of info-level logs...]
[*INFO*] Reading existing properties file 
[/home/ptaylor/dev/arrow/java/format/target/classes/git.properties] (for module 
Arrow Format)...
[*INFO*] Properties file 
[/home/ptaylor/dev/arrow/java/format/target/classes/git.properties] is 
up-to-date (for module Arrow Format)...
[*INFO*]
[*INFO*]*--- *maven-dependency-plugin

Problems building Arrow Java

2018-12-30 Thread Paul Taylor
Is anyone else having issues building Arrow Java? I'm trying to run the 
integration tests locally, but can't figure out why `mvn install` is 
failing. I see a number of warnings, and a few checkstyle errors, but 
nothing besides that stands out.


Thanks,
Paul

*~/dev/arrow/java*$ java --version
Picked up JAVA_TOOL_OPTIONS:
openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)
*~/dev/arrow/java*$ mvn --version
Picked up JAVA_TOOL_OPTIONS:
*Apache Maven 3.5.2*
Maven home: /usr/share/maven
Java version: 10.0.2, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-43-generic", arch: "amd64", family: "unix"
***~/dev/arrow/java*$ mvn install -e
Picked up JAVA_TOOL_OPTIONS:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
com.google.inject.internal.cglib.core.$ReflectUtils$1 
(file:/usr/share/maven/lib/guice.jar) to method 
java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of 
com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
[...lots of info-level logs...]
[*INFO*] Reading existing properties file 
[/home/ptaylor/dev/arrow/java/format/target/classes/git.properties] (for module 
Arrow Format)...
[*INFO*] Properties file 
[/home/ptaylor/dev/arrow/java/format/target/classes/git.properties] is 
up-to-date (for module Arrow Format)...
[*INFO*]
[*INFO*]*--- *maven-dependency-plugin:3.0.1:copy  *(copy-flatc)*  
@arrow-format*---*
[*INFO*] Configured Artifact: com.github.icexelloss:flatc-linux-x86_64:1.9.0:exe
[*INFO*] Copying flatc-linux-x86_64-1.9.0.exe to 
/home/ptaylor/dev/arrow/java/format/target/flatc-linux-x86_64-1.9.0.exe
[*INFO*]
[*INFO*]*--- *exec-maven-plugin:1.4.0:exec  *(script-chmod)*  @arrow-format*---*
[*INFO*]
[*INFO*]*--- *exec-maven-plugin:1.4.0:exec  *(default)*  @arrow-format*---*
[*INFO*]
[*INFO*]*--- *build-helper-maven-plugin:1.9.1:add-source  
*(add-generated-sources-to-classpath)*  @arrow-format*---*
[*INFO*] Source directory: 
/home/ptaylor/dev/arrow/java/format/target/generated-sources/flatc added.
[*INFO*]
[*INFO*]*--- *license-maven-plugin:2.3:format  *(default)*  @arrow-format*---*
[*INFO*] Updating license headers...
[*INFO*]
[*INFO*]*--- *maven-remote-resources-plugin:1.5:process  
*(process-resource-bundles)*  @arrow-format*---*
[*INFO*]
[*INFO*]*--- *maven-resources-plugin:2.6:resources  *(default-resources)*  
@arrow-format*---*
[*INFO*] Using 'UTF-8' encoding to copy filtered resources.
[*INFO*] skip non existing resourceDirectory 
/home/ptaylor/dev/arrow/java/format/src/main/resources
[*INFO*] Copying 3 resources
[*INFO*]
[*INFO*]*--- *maven-compiler-plugin:3.6.2:compile  *(default-compile)*  
@arrow-format*---*
[*INFO*] Compiling 39 source files to 
/home/ptaylor/dev/arrow/java/format/target/classes
[*WARNING*] Unable to autodetect 'javac' path, using 'javac' from the 
environment.
[*INFO*]**
[*INFO*]*Reactor Summary:*
[*INFO*]
[*INFO*] Apache Arrow Java Root POM .*SUCCESS*  [  
2.542 s]
[*INFO*] Arrow Format ...*FAILURE*  [  
1.154 s]
[*INFO*] Arrow Memory ...*SKIPPED*
[*INFO*] Arrow Vectors ..*SKIPPED*
[*INFO*] Arrow Tools *SKIPPED*
[*INFO*] Arrow JDBC Adapter .*SKIPPED*
[*INFO*] Arrow Plasma Client *SKIPPED*
[*INFO*] Arrow Flight ...*SKIPPED*
[*INFO*]**
[*INFO*]*BUILD FAILURE*
[*INFO*]**
[*INFO*] Total time: 3.885 s
[*INFO*] Finished at: 2018-12-30T10:16:26-08:00
[*INFO*] Final Memory: 67M/280M
[*INFO*]**
[*ERROR*] Failed to execute 
goalorg.apache.maven.plugins:maven-compiler-plugin:3.6.2:compile  
*(default-compile)*  on projectarrow-format:*Compilation failure*  ->*[Help 1]*
*org.apache.maven.lifecycle.LifecycleExecutionException*:*Failed to execute goal 
**org.apache.maven.plugins:maven-compiler-plugin:3.6.2:compile*  *(default-compile)*  on projectarrow-format:*Compilation failure*

*at*  org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(*MojoExecutor.java:213*)
*at*  org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(*MojoExecutor.java:154*)
   

Re: npmjs.com account to release Apache Arrow JavaScript

2018-12-14 Thread Paul Taylor

Hi Kouhei,

I've added you as a maintainer of the apache-arrow top level package, as 
well as an owner on the @apache-arrow organization on npm.


Paul

On 12/14/18 1:59 PM, Kouhei Sutou wrote:

Hi Brian,

I read this change:

   
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=87298036=46=45

Can you add me to collaborators of apache-arrow? I may
release apache-arrow npm package as a PMC member.

Here is my account on npmjs.com:

   https://www.npmjs.com/~kou


Thanks,
--
kou


Re: Arrow JS 0.4.0 Release

2018-12-14 Thread Paul Taylor

Wes,

I didn't mean to sound like I was criticizing you, the project, or the 
release process. You're doing an outstanding job as a project lead, and 
it's a fine release process that helps ensure quality and security. Nor 
was I passive-aggressively expressing desire to be a PMC -- I'm 
overworked as it is and don't have the bandwidth to take on that 
responsibility. If I was and did, I'd be much more explicit about taking 
on those responsibilities regardless of PMC status :-)


I was only attempting to describe some of the reasons I (and perhaps 
others) haven't/don't push to release the JS package more often, and 
compare reality with the original intent behind having JS on a separate 
release track.


I also don't mean to criticize when I say I think a reason we don't 
release often might be because none of the JS users or maintainers are 
PMCs -- only trying to acknowledge the maintenance and release cycle is 
an attention-driven process. Since most of us contribute in conjunction 
with our other professional responsibilities, it's totally reasonable 
that if JS isn't part of a PMC's day-to-day, it'd be left to us to drive 
it forward.


I have been curious if there isn't a middle ground between the full 
RC/GM release process, and releasing what are essentially nightlies. npm 
has a feature to publish tagged releases that aren't considered mainline 
releases yet are still accessible to CI/auditing services. As long as 
the list of npm users authorized to publish the packages are Arrow 
contributors (and we force npm 2FA), we could have a lane for rapid 
iteration and release while we work out the kinks.


And lastly, an update on the refactor branch: all the features are 
working again, now just fixing the last few issues in the build scripts. 
I'm especially pleased that `cat ./some-gigantic-table.arrow | npx 
arrow2csv | less` doesn't stream the entire table to less and terminate 
with a broken-pipe error anymore :-)


Paul


On 12/14/18 10:31 AM, Wes McKinney wrote:

hi Paul,

On Thu, Dec 13, 2018 at 8:59 PM Paul Taylor  wrote:

Another update: all the existing features and unit tests are working
again except for the Table/RecordBatch streaming toString()
implementations (and the `arrow2csv` utility), which I'll update later
tonight.

On JS release cadence, I think Brian's right that the current setup is
working counter to our original intent. I am used to (and prefer) a
faster-paced release cycle, essentially releasing early and as often as
bugs are fixed or features are added. Indeed, Graphistry maintains a
repo <https://github.com/graphistry/arrow/commits/master> with the
latest version of the library that we can build against, which I update
when I fix any bugs or add features.


It is common for software vendors to have "downstream" releases, so
this is reasonable, so long as this work is not promoted as Apache
releases


The JS project is young, and sometimes has to move at a rapid pace. I've
felt the turnaround time involved in the vote/prepare/verify/publish
release process is slower than would be helpful to me. I'm used to
publishing patch release to npm as soon as possible, possibly multiple
times a day.

Well, surely the recent security problems with NPM demonstrate that
there is value in giving the community opportunity to vet a package
before it is published for the world to use, and that GPG-signing
packages is an important security measure to ensure that production
code is coming from a network of trust. It is different if you are
publishing packages for your own personal or corporate use.


None of the PMCs contribute to or use the JS version (if that's wrong,
hit me up!) so there's been no release pressure from there. None of the
JS contributors are PMCs so even if we want to do releases, we have to
wait for the a PMC. My take is that everyone on the project (especially
PMCs) are probably ungodly busy people, and since not releasing to npm
hasn't been blocking me, I opt not to bother folks.

I am happy to help release the JS package as often as you like, up to
multiple times per month. I stated this early on in the process, but
there has not seemed to be much desire to release. Brian's recent
request to release caught me at a bad time at the end of the year, but
there are other active PMCs who should be able to help. If you do
decide you want to release in the next week or two, please let me know
and I will make the time to help.

The lack of PMCs with an interest in JavaScript is a bit of
self-perpetuating issue. One of the responsibilities of PMC members
(and what will enable a committer to become a PMC) is to promote the
growth and development of a healthy community. This includes making
sure that the project releases. The JS developer community hasn't
grown much, though. My approach to such a problem is to act as a
"community of one" until it changes -- drive a project forward and
ensure a steady cadence of releases.

- Wes



On 12/13/18 11:52 A

Re: Arrow JS 0.4.0 Release

2018-12-13 Thread Paul Taylor
Another update: all the existing features and unit tests are working 
again except for the Table/RecordBatch streaming toString() 
implementations (and the `arrow2csv` utility), which I'll update later 
tonight.


On JS release cadence, I think Brian's right that the current setup is 
working counter to our original intent. I am used to (and prefer) a 
faster-paced release cycle, essentially releasing early and as often as 
bugs are fixed or features are added. Indeed, Graphistry maintains a 
repo <https://github.com/graphistry/arrow/commits/master> with the 
latest version of the library that we can build against, which I update 
when I fix any bugs or add features.


The JS project is young, and sometimes has to move at a rapid pace. I've 
felt the turnaround time involved in the vote/prepare/verify/publish 
release process is slower than would be helpful to me. I'm used to 
publishing patch release to npm as soon as possible, possibly multiple 
times a day.


None of the PMCs contribute to or use the JS version (if that's wrong, 
hit me up!) so there's been no release pressure from there. None of the 
JS contributors are PMCs so even if we want to do releases, we have to 
wait for the a PMC. My take is that everyone on the project (especially 
PMCs) are probably ungodly busy people, and since not releasing to npm 
hasn't been blocking me, I opt not to bother folks.



On 12/13/18 11:52 AM, Wes McKinney wrote:

+1 for synchronizing to the main releases when possible. In the 0.12
thread we have discussed moving to time-based releases (e.g. every 2
months). Time-based releases are helpful to create urgency around
getting work completed, and making sure that the project is always
ready to release.
On Thu, Dec 13, 2018 at 10:39 AM Brian Hulette  wrote:

Sounds great Paul! Really excited that this refactor is wrapping up. My
only concern with including this in 0.4.0 is that I'm not going to have the
time to thoroughly review it for a few weeks, so gating on that would
really delay it. But I can just manually test with some use-cases I care
about in lieu of a thorough review in the interest of time.

I think in the future (after 0.12?) it may behoove us to tie back in to the
main Arrow release cycle. The idea with the separate JS release was to
allow us to release faster, but in practice it has done the opposite. Since
the fall of 2017 we've cut two major JS releases (0.2, 0.3) while there
were four major main releases (0.8 - 0.11). Not to mention the disjoint
version numbers can be confusing to users - perhaps not as much of a
concern now that the format is pretty stable, but it can still be a
friction point. And finally selfishly - if we had been on the main release
cycle, the contributions I made in the summer would have been released in
either 0.10 or 0.11 by now.

Brian

On Thu, Dec 13, 2018 at 3:29 AM Paul Taylor  wrote:


The ongoing JS refactor/upgrade branch
<https://github.com/trxcllnt/arrow/tree/js-data-refactor/js> is just
about done. It's passing all the integration tests, as well as a hundred
or so new unit tests. I have to update existing tests where the APIs
changed, battle with closure-compiler a bit, then it'll be ready to
merge in and ship out. I think I'll be able to wrap it up in the next
couple hours.

I started this branch to clean up the Vector Data classes to make it
easier to add higher-level Table and Vector operators, but as the Data
classes are fairly embedded in the core, it lead to a larger refactor of
the DataTypes, Vectors, Visitors, and IPC readers and writers.

While I was updating the IPC readers and writers, I took the opportunity
to back-port all the Node and WhatWG (browser) streams integration that
we've built for Graphistry. Putting it in the Arrow JS library means we
can better ensure zero-copy when possible, empowers library consumers to
easily build streaming applications in both server and browser
environments, and (selfishly) reduces complexity in my code base. It
also advances a longer term personal goal to more closely adhere to the
structure and organization of ArrowCPP when reasonable.

A non-exhaustive list of updates includes:

* Updates the Table, Schema, RecordBatch, Visitor, Vector, Data, and
DataTypes to ensure the generic type signatures cascade recursively
through the type declarations
* New io primitives that abstract over the (mutually exclusive) file and
stream APIs in both node and browser environments
* New RecordBatchReaders and RecordBatchWriters that directly use the
zero-copy node and browser io primitives
* A consolidated reflective Visitor implementation that supports late
binding to shortcut traversal, provides an easy API for building higher
level Vector operators
* Fixed bugs/added support for reading and writing DictionaryBatch
deltas (tricky)
* Updated all the dependencies and did some config file gardening to
make debugging tests easier
* Added a bunch of new tests

I'd be more than happy to help shepherd a 0.4.0 relea

Re: Arrow JS 0.4.0 Release

2018-12-13 Thread Paul Taylor
The ongoing JS refactor/upgrade branch 
 is just 
about done. It's passing all the integration tests, as well as a hundred 
or so new unit tests. I have to update existing tests where the APIs 
changed, battle with closure-compiler a bit, then it'll be ready to 
merge in and ship out. I think I'll be able to wrap it up in the next 
couple hours.


I started this branch to clean up the Vector Data classes to make it 
easier to add higher-level Table and Vector operators, but as the Data 
classes are fairly embedded in the core, it lead to a larger refactor of 
the DataTypes, Vectors, Visitors, and IPC readers and writers.


While I was updating the IPC readers and writers, I took the opportunity 
to back-port all the Node and WhatWG (browser) streams integration that 
we've built for Graphistry. Putting it in the Arrow JS library means we 
can better ensure zero-copy when possible, empowers library consumers to 
easily build streaming applications in both server and browser 
environments, and (selfishly) reduces complexity in my code base. It 
also advances a longer term personal goal to more closely adhere to the 
structure and organization of ArrowCPP when reasonable.


A non-exhaustive list of updates includes:

* Updates the Table, Schema, RecordBatch, Visitor, Vector, Data, and 
DataTypes to ensure the generic type signatures cascade recursively 
through the type declarations
* New io primitives that abstract over the (mutually exclusive) file and 
stream APIs in both node and browser environments
* New RecordBatchReaders and RecordBatchWriters that directly use the 
zero-copy node and browser io primitives
* A consolidated reflective Visitor implementation that supports late 
binding to shortcut traversal, provides an easy API for building higher 
level Vector operators
* Fixed bugs/added support for reading and writing DictionaryBatch 
deltas (tricky)
* Updated all the dependencies and did some config file gardening to 
make debugging tests easier

* Added a bunch of new tests

I'd be more than happy to help shepherd a 0.4.0 release of what's in 
arrow/master if that's what everyone wants to do. But in the interest of 
cutting a more feature-rich release and preventing customers paying the 
cost of updating twice in a short time span, I vote we hold off for 
another day or two and merge + release the work in the refactor branch.


Paul

On 12/9/18 10:51 AM, Wes McKinney wrote:

I agree that we should cut a JavaScript release.

With the amount of maintenance work on my plate I have to declare
bankruptcy on doing any more than I am right now. Can another PMC
volunteer to be the RM for the 0.4.0 JavaScript release?

Thanks
Wes
On Tue, Dec 4, 2018 at 10:07 PM Brian Hulette  wrote:

Hi all,
It's been quite a while since our last major Arrow JS release (0.3.0 on
February 22!), and since then we've added several new features that will
make Arrow JS much easier to adopt. We've added convenience functions for
creating Arrow vectors and tables natively in JavaScript, an IPC writer,
and a row proxy interface that will make integrating with existing JS
libraries much simpler.

I think it's time we cut 0.4.0, so I spent some time closing out or
postponing the last few JIRAs in JS-0.4.0. I got it down to just one JIRA
which involves documenting the release process - hopefully we can close
that out as we go through it again.

Please let me know if you think it makes sense to cut JS-0.4.0 now, or if
you have any concerns.

Brian


Re: Assign/update : NA bitmap vs sentinel

2018-11-10 Thread Paul Taylor
While I'm not qualified to debate the merits of various physical 
representations inside databases, I would like to chime in from the 
perspective of both an Arrow contributor and architect of perhaps one of 
the more exotic applications of Arrow in the wild (client/server + 
JavaScript + GPUs + graphs).


As Wes mentioned, the Arrow design needs to accommodate a wide range of 
use-cases across analytic workflows. A premium has been placed on 
simplicity and consistency, as design decisions related to on-the-wire 
representation have an outsize impact on the complexity and architecture 
of the tools and products we all build with it. To rip off Einstein, 
Arrow should be as simple as it can be, but no simpler.


I can't speak for others here, but the questions I ask when evaluating a 
physical format are things like:


1. What impact do these decisions have on streaming architectures, both
   on-device and on-the-wire?
2. How difficult is it to write optimized routines on massively
   parallel (either on or off-chip) architectures?
3. Does this add complexity to (already complicated) byte alignment
   requirements of non-CPU devices?
4. Is this straightforward to implement in multiple languages, and can
   I support the maintenance burden necessary to fix issues as they arise?

And specifically for handling nulls, the questions I'd pose are:

1. How should Arrow represent nulls in nested Struct, Map, or Union
   types, which don't allocate an underlying data buffer of their own?
2. If variable-width types encode their nulls in the offsets table, how
   does that impact thread divergence?
3. How should nulls be encoded in fixed-size binary columns?

A validity bitmap is a consistent approach to handle nulls across the 
primitive, variable width, and nested types. It's been my experience in 
the past that other representations may be simpler locally, but are 
either more complex or outright preclude their use when the rest of the 
pipeline is taken into account.


Additionally, the validity bitmap doesn't preclude you from using 
sentinels informally in your own systems. The Arrow spec leaves the data 
bytes for null positions unspecified, which you could certainly fill in 
with INT_MIN for any tasks where that's most efficient or convenient, 
and is very fast with a trivial bit of opencl or cuda (this is a common 
technique working with GPUs, as it's often faster to scan over a 
streaming dataset up front vs. incur the cost of branching or thread 
divergence later). One might argue this is abusing the format, but from 
the perspective of shipping high-quality features quickly, I've found 
Arrow's flexibility in this area and others a net plus.


Best,
Paul

On 11/9/18 3:53 PM, Wes McKinney wrote:

hi Matt,
On Fri, Nov 9, 2018 at 6:36 PM Matt Dowle  wrote:

On Fri, Nov 9, 2018 at 2:14 PM Wes McKinney  wrote:


On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle  wrote:

There is one database that I'm aware of that uses sentinels _and_

supports complex types with missing values: Kx's KDB+.
I read this and was pleased that KDB is being used as a reference.  It

is a

seriously good database: the gold-standard in many people's eyes.


This has led to some seriously strange choices like the ASCII space

character being used as the sentinel value for strings.
But then I saw this. Surely if sentinels are good enough for KDB then

isn't

that a sign that sentinels are not as bad as this group fears?

KDB has a good reputation in the financial world, but it is a very
niche product. I personally wouldn't draw any inferences about
database design from something with such a small and specialized
audience.


I find this view hard to understand. Why not draw some inference from a
highly respected product.

To make an analogy, KDB is like a Formula 1 car. Formula 1 cars are
built to drive in a very particular way in a particular environment,
and I don't think it's representative of driving or car design in
general. KDB cannot be used interchangeably where PostgreSQL is used,
for example.




What about grouping and joining columns that contain NA?   Here's an
example from R data.table :


DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6)
DT

x v
 
1: 1 1
2: 3 2
3: 3 3
4:NA 4
5: 1 5
6:NA 6

DT[,sum(v),keyby=x]

xV1
 
1:NA10
2: 1 6
3: 3 5

The NAs are grouped as a distinct value and are not excluded for
statistical robustness reasons.  This is very easy to achieve efficiently
internally; in fact there is no special code to deal with the NA values
because they are just another distinct value (the sentinel).  In Arrow

if a

bitmap is present, there would be more code needed to deal with the NAs
(either way: including the NA group or excluding the NA group), if I
understand correctly.

It depends on who's doing the analysis. Some database systems exclude
nulls in aggregations altogether. In others you indeed would need to
reserve an 

[jira] [Created] (ARROW-3337) JS writer doesn't serialize the dictionary of nested Vectors

2018-09-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-3337:
--

 Summary: JS writer doesn't serialize the dictionary of nested 
Vectors
 Key: ARROW-3337
 URL: https://issues.apache.org/jira/browse/ARROW-3337
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


The JS writer only serializes dictionaries for [top-level 
children|https://github.com/apache/arrow/blob/ee9b1ba426e2f1f117cde8d8f4ba6fbe3be5674c/js/src/ipc/writer/binary.ts#L40]
 of a Table. This is wrong, and an oversight on my part. The fix here is to put 
the actual Dictionary vectors in the `schema.dictionaries` map instead of the 
dictionary fields, like I understand the C++ does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3336) JS writer doesn't serialize sliced Vectors correctly

2018-09-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-3336:
--

 Summary: JS writer doesn't serialize sliced Vectors correctly
 Key: ARROW-3336
 URL: https://issues.apache.org/jira/browse/ARROW-3336
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


The JS IPC writer is slicing the data and valueOffset buffers by starting from 
the data's current logical offset. This is incorrect, since the slice function 
already does this for the data, type, and valueOffset TypedArrays internally. 
PR incoming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3304) JS stream reader should yield all messages

2018-09-23 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-3304:
--

 Summary: JS stream reader should yield all messages
 Key: ARROW-3304
 URL: https://issues.apache.org/jira/browse/ARROW-3304
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


The JS stream reader should yield all parsed messages from the source stream so 
an external consumer of the iterator can read multiple tables from one combined 
source stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [RESULT][VOTE] Release Apache Arrow 0.10.0 (RC1)

2018-08-06 Thread Paul Taylor
Looks like typedoc is using the wrong version of typescript. I can take a
look at updating it later this afternoon if it’s blocking the release.

On Mon, Aug 6, 2018 at 12:55 PM Li Jin  wrote:

> I was trying to build the documentation but hit a JavaScript error. I am
> not familiar with JavaScript and TypeScript ... Any suggestion?
>
> Error log:
>
> /apache-arrow/arrow/js /apache-arrow
>
> + npm install
>
> npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
>
> npm WARN notsup Not compatible with your operating system or architecture:
> fsevents@1.2.4
>
> npm WARN optional Skipping failed optional dependency /sane/fsevents:
>
> npm WARN notsup Not compatible with your operating system or architecture:
> fsevents@1.2.4
>
> npm WARN optional Skipping failed optional dependency
> /watchpack/chokidar/fsevents:
>
> npm WARN notsup Not compatible with your operating system or architecture:
> fsevents@1.2.4
>
> npm WARN ajv-keywords@3.2.0 requires a peer of ajv@^6.0.0 but none was
> installed.
>
> npm WARN uglifyjs-webpack-plugin@1.1.6 requires a peer of webpack@^2.0.0
> ||
> ^3.0.0 but none was installed.
>
> + npm run doc
>
>
> > apache-arrow@0.3.0 doc /apache-arrow/arrow/js
>
> > shx rm -rf ./doc && typedoc --mode file --out doc src/Arrow.ts
>
>
>
> Using TypeScript 2.7.2 from
> /apache-arrow/arrow/js/node_modules/typedoc/node_modules/typescript/lib
>
> Error: /apache-arrow/arrow/js/src/vector.ts(161)
>
>  Property 'childData' does not exist on type 'any[] | Data'.
>
>   Property 'childData' does not exist on type 'any[]'.
>
>
> npm ERR! Linux 4.9.93-linuxkit-aufs
>
> npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "run" "doc"
>
> npm ERR! node v8.10.0
>
> npm ERR! npm  v3.5.2
>
> npm ERR! code ELIFECYCLE
>
> npm ERR! apache-arrow@0.3.0 doc: `shx rm -rf ./doc && typedoc --mode file
> --out doc src/Arrow.ts`
>
> npm ERR! Exit status 4
>
> npm ERR!
>
> npm ERR! Failed at the apache-arrow@0.3.0 doc script 'shx rm -rf ./doc &&
> typedoc --mode file --out doc src/Arrow.ts'.
>
> npm ERR! Make sure you have the latest version of node.js and npm
> installed.
>
> npm ERR! If you do, this is most likely a problem with the apache-arrow
> package,
>
> npm ERR! not with npm itself.
>
> npm ERR! Tell the author that this fails on your system:
>
> npm ERR! shx rm -rf ./doc && typedoc --mode file --out doc src/Arrow.ts
>
> npm ERR! You can get information on how to open an issue for this project
> with:
>
> npm ERR! npm bugs apache-arrow
>
> npm ERR! Or if that isn't available, you can get their info via:
>
> npm ERR! npm owner ls apache-arrow
>
> npm ERR! There is likely additional logging output above.
>
>
> npm ERR! Please include the following file with any support request:
>
> npm ERR! /apache-arrow/arrow/js/npm-debug.log
>
>
> On Mon, Aug 6, 2018 at 7:41 PM, Wes McKinney  wrote:
>
> > I have just rebased master on tag apache-arrow-0.10.0. I will rebase
> > the PRs that might be affected
> >
> > On Mon, Aug 6, 2018 at 2:55 PM, Krisztián Szűcs
> >  wrote:
> > > Wes, I can help You with the forge packages. I'm creating the PRs.
> > >
> > > On Aug 6 2018, at 8:52 pm, Wes McKinney  wrote:
> > >>
> > >> I'm going to start kicking along the conda-forge packages. If anyone
> > >> can assist with these, it would be much appreciated.
> > >>
> > >> I will update the website and write a blog post summarizing the 0.10.0
> > >> release (it's the biggest release we've ever done).
> > >>
> > >> I'm going to rebase master as soon as I merge the build for this PR
> > >> completes: https://travis-ci.org/apache/arrow/builds/412745140
> > >>
> > >> On Mon, Aug 6, 2018 at 12:25 PM, Phillip Cloud 
> > wrote:
> > >> > With 4 binding +1 votes (myself, Uwe, Wes, Kou), 2 non-binding +1
> > (Krisztián,
> > >> > Li), and no other votes, the vote passes. Thanks everyone!
> > >> >
> > >> > I will upload the Java packages as per the release management wiki.
> > >> > Would some folks please volunteer to get the Python packaging,
> > >> > documentation and website updates rolling?
> > >> >
> > >> > -Phillip
> >
>


Re: Help understanding IPC Message/Buffer structure

2018-07-12 Thread Paul Taylor

Hi Randy,

The first four bytes are the int32 length of the flatbuffers Message 
metadata 
 
plus 4 bytes of padding between the length and the Message metadata 
itself. The Message metadata starts on the 8th byte.


So to read an entire Message, read and store the first four bytes (the 
metadata length). Then advance past the 4 padding bytes, and use the 
flatbuffers API to read the Message table.


The Message table has a bodyLength field, which is byte length of all 
the buffers (data, validity, offsets, and typeIds) for all the Arrays in 
the Message (since Schema messages don't contain any data, its 
bodyLength is always 0).


Once you've read the Message table via flabuffers, advance `metadata 
length` number of bytes to position yourself to read the Array buffers.


After reading the buffers, advance another `bodyLength` number of bytes 
to read the next message. Repeat this process to read all Messages from 
an Arrow stream.


If you're familiar with JavaScript/TypeScript, you can reference the 
implementation here 
.


Hope this clears things up,

Paul


On 07/12/2018 11:30 AM, Randy Zwitch wrote:

I’m trying to understand how to parse a Buffer into a Schema, but using
using pdb with Python and reading the TS/Python/C++ Arrow source hasn’t
really cleared much up for me. Nor has studying
https://arrow.apache.org/docs/ipc.html


Here’s are the steps of what I’ve tried (the code is Julia, but only
because I’m trying to do this natively, rather than wrap the Arrow C code):


# Thrift API method returning a struct (sm_buf, sm_size, df_buf, df_size)
  (works as expected)
julia> tdf = sql_execute_df(conn, "select * from flights_2008_7m limit
1000", 0, 0, 1000)

MapD.TDataFrame(UInt8[0xba, 0x58, 0x1b, 0x3d], 93856, UInt8[0xab, 0xd7,
0x7e, 0x50], 10)

# Wrap shared memory into julia array, based on handle and size (works as
expected)
julia> sm_buf = MapD.load_buffer(tdf.sm_handle, tdf.sm_size) #wrapper using
shmget/shmat
93856-element Array{UInt8,1}:
  0x2c
  0x16
  0x00
  0x00
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
 ⋮
  0x20
  0x74
  0x6f
  0x20
  0x4d
  0x66
  0x72
  0x00
  0x00

At this point, walking through an similar Python process, I know that
sm_buf represents
- type: Schema
 - metadata length: 5676
- body_length: 0

Where I’m confused is how to proceed.

I am getting metadata_length by reinterpreting the first 4-bytes as Int32.

julia> mlen = reinterpret(Int32, sm_buf[1:4])[1]
5676

I then assumed that I could start at byte 5 and take the next `mlen-1`
bytes:

julia> metadata = sm_buf[5:5+mlen-1]
5676-element Array{UInt8,1}:
  0x14
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x00
  0x0c
  0x00
 ⋮
  0x79
  0x65
  0x61
  0x72
  0x00
  0x00
  0x00
  0x00
  0x00


Am I on the right track here? I *think* that my `metadata` variable above
is a FlatBuffer, but how do I know what its structure is? Additionally,
what am I supposed to do with all of the bytes that haven’t been read from
`sm_buf` yet? `sm_buf` is 93856 bytes and I’ve only read the first 4 bytes
+ metadata length, leaving some 88,000 bytes not processed yet.

Any help would be greatly appreciated here. Please note that I’m not asking
for julia coding help, but rather what the Arrow bytes actually mean/their
structure and how to process them further.

Thanks,
Randy Zwitch





[jira] [Created] (ARROW-2839) [JS] Support whatwg/streams in IPC reader/writer

2018-07-12 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2839:
--

 Summary: [JS] Support whatwg/streams in IPC reader/writer
 Key: ARROW-2839
 URL: https://issues.apache.org/jira/browse/ARROW-2839
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.4.0


We should make it easy to stream Arrow in the browser via 
[whatwg/streams|https://github.com/whatwg/streams]. I already have this working 
at Graphistry, but I had to use some of the IPC internal methods. Creating this 
issue to track back-porting that work and the few minor refactors to the IPC 
internals that we'll need to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2828) [JS] Refactor Vector Data classes

2018-07-10 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2828:
--

 Summary: [JS] Refactor Vector Data classes
 Key: ARROW-2828
 URL: https://issues.apache.org/jira/browse/ARROW-2828
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


In order to make it easier to build some of the higher-level APIs, we need to 
slim the Vector Data classes down to just one base implementation.

Initial WIP commit here, and work will continue in this branch: 
https://github.com/trxcllnt/arrow/commit/dfad9023583bef4f8d2a50ea25f643e4bccbc805#diff-2512057432c4ebf55c6308cb06b43b08



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Concerns about the Arrow Slack channel

2018-07-09 Thread Paul Taylor
Brian, Leo, and I yesterday discussed creating an ArrowJS channel in the 
Graphistry Slack as an alternative. Graphistry doesn't mind being the 
channel admins, we just don't want to run afoul of any ASF guidelines 
for project organization.



On 07/09/2018 11:47 AM, Uwe L. Korn wrote:

Bumping this thread again as we still have to discuss how to deal with the 
JavaScript community in Slack.

The main difference here with all other parts of the Arrow community is that 
they are very active users and also use Slack for communication between 
developers.

Paul, Brian & co: Is there an alternative that you could think off that works 
as good as the current approach? Maybe should be do a Slack for solely Arrow JS?

Uwe

On Sat, Jul 7, 2018, at 5:20 PM, Wes McKinney wrote:

I have just started a vote about closing the channel.

On Tue, Jun 26, 2018 at 11:03 AM, Wes McKinney  wrote:

I would personally prefer to have all questions on the mailing list
for now. I don't know if the community is large enough to provide
consistent attention on an additional communication channel. If there
end up being too many user-centric questions on dev@, we can activate
and use user@ as some other projects (like Spark) do.

I see Discourse as a longer-term possibility. We also need to see if
ASF Infrastructure will support it, which would be the ideal route.

- Wes

On Tue, Jun 26, 2018 at 10:55 AM, Dhruv Madeka  wrote:

It might be nice to have the discourse option before shutting it down. As
someone who asks, that would be a nice way to get me to migrate

On Tue, Jun 26, 2018 at 10:53 AM, Wes McKinney  wrote:


hi folks,

How would you like to proceed on the Slack channel discussion? It
seems there is reasonable consensus to close the channel. Should we
have a vote?

It would be a good idea to export the data / chat history from the
channel before closing it down.

Thanks
Wes

On Thu, Jun 21, 2018 at 11:36 PM, Wes McKinney 
wrote:

It's sort of unrelated to this conversation, but since someone
mentioned MXNet I want to call attention to a thread on their podling
mailing list about JIRA vs. GitHub issues:
https://lists.apache.org/thread.html/b4d174223d68c5822ea538f2609281

c8023c7cc1eaef298bb2c4c186@%3Cdev.mxnet.apache.org%3E.

To summarize the thread: a lot of people don't like change, but
sometimes change is good. Some people have complained privately to me
that Arrow doesn't work like any other random project on GitHub.
Anyone who doesn't contribute to the project on account of that is,
IMHO, not a serious contributor.

I personally find JIRA to be an excellent tool, but it's a steeper
learning curve than GitHub and so it does take a bit of effort to
learn its features.

- Wes


On Thu, Jun 21, 2018 at 11:05 PM, Wes McKinney 

wrote:

Thanks all. I'm intrigued by Discourse (for some reason I keep typing
"discourge.org"); we should inquire with ASF infra to see if they
would be willing to support it for us. It's important that we develop
a public record for the project, and for that data to be archived and
indexed in some place that is owned by the ASF. I'm frankly -0 on
having a fourth communication channel (outside of e-mail/JIRA/GitHub)
since three is already a lot to keep track of. If we had a larger
maintainer team, I might feel differently.

Travis had some questions about GitHub and JIRA. JIRA is the only
system of record for concrete development activity in the project. We
use GitHub pull requests to submit patches (some projects use Gerrit,
or attach patch files to JIRA), but all of the data generated on these
PRs (code review comments, etc.) is mirrored back to JIRA.
Furthermore, JIRA activity is relayed to the iss...@arrow.apache.org
mailing list. So ultimately we have a public record for the project on
mailing lists.

Many newcomers have never interacted with an Apache project before,
and so when they go to http://github.com/apache/arrow their first
reaction is to look for the Issues tab to report a bug or ask for
something. For a long time we didn't have issues turned on, and we
found that people were "bouncing" rather than seeking out the mailing
list or JIRA. We'd rather capture the information somewhere rather
than lose it. We have an issue template asking people to either use
the mailing list or JIRA, but a lot of people ignore it unfortunately:
https://github.com/apache/arrow/blob/master/.github/ISSUE_TEMPLATE.md.

- Wes

On Thu, Jun 21, 2018 at 10:52 PM, Kenta Murata 

wrote:

Hi everyone,

I heard from Kou that you’re discussing to stop using Slack.
So I want to propose another way to use Discourse.

On 2018/06/21 18:46:54, Dhruv Madeka  wrote:

The issue with discourse is that you either have to host it or pay

for them

to host it

Discourse provides free hosting plan for community friendly opensource

projects.

See this article for the details:


but still +1 for discourse, its a really nice 

Re: [VOTE] Close down Arrow Slack channel

2018-07-09 Thread Paul Taylor

-1

I use it to coordinate with folks on the JS side. As I already have to 
participate in a number other Slack workspaces on a daily basis, Arrow 
slack is generally less intrusive than even opening and responding to email.



On 07/09/2018 10:30 AM, Bryan Cutler wrote:

+1

On Mon, Jul 9, 2018 at 1:46 AM, Li Jin  wrote:


+1
On Mon, Jul 9, 2018 at 3:08 AM Kouhei Sutou  wrote:


+1

In 
   "[VOTE] Close down Arrow Slack channel" on Sat, 7 Jul 2018 11:20:07
-0400,
   Wes McKinney  wrote:


Dear all,

Based on our mailing list discussion [1], I am proposing to close
down the Arrow Slack channel over concerns that its use in
practice is not consistent with the Apache Way.

 [ ] +1 : Shut down Slack channel
 [ ]  0 : No opinion
 [ ] -1 : Do not shut down Slack channel because...

Here is my vote: +1

The vote will be open for at least 72 hours.

Thanks,
Wes

[1]:

https://mail-archives.apache.org/mod_mbox/arrow-dev/201806.mbox/%

3CCAJPUwMDyiZfXUWdW1Mk7UwKbbvd--Dws6oOVqjBF3ZadTmXTGg%40mail.gmail.com%3E




[jira] [Created] (ARROW-2779) [JS] Fix node stream reader/writer compatibility

2018-07-01 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2779:
--

 Summary: [JS] Fix node stream reader/writer compatibility
 Key: ARROW-2779
 URL: https://issues.apache.org/jira/browse/ARROW-2779
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


Emit Buffers not Uint8Arrays, and guard against reading 0-length buffers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2650) [JS] Finish implementing Unions

2018-05-31 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2650:
--

 Summary: [JS] Finish implementing Unions
 Key: ARROW-2650
 URL: https://issues.apache.org/jira/browse/ARROW-2650
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


Finish implementing Unions in JS and add to integration tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2640) JS Writer should serialize schema metadata

2018-05-27 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2640:
--

 Summary: JS Writer should serialize schema metadata
 Key: ARROW-2640
 URL: https://issues.apache.org/jira/browse/ARROW-2640
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor


JS writer should serialize schema metadata



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Proposed Arrow Graph representations

2018-05-19 Thread Paul Taylor
At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
nvGraph  team's goals for
accelerating in-memory graph processing and analytics. A major component of
that is advancing and standardizing a common, efficient representation for
graphs that can support a broad ranges of use-cases, from small to large.

To that end, I'd like to kick off the discussion about native graph
representations in Arrow.

Joe's team has prepared a preliminary FlatBuffers schema for efficient
columnar representations of the four most common graph formats. It includes
embedded edge and vertex property tables, and is designed to be compatible
with the existing Arrow column types. My initial thoughts are that we could
add an optional 5th Graph Message type, similar to how Tensor Messages are
presently implemented.

I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
.
>From what I understand, the tables have been expanded into separate
definitions for the sake of comprehension, and the final forms will be
collapsed into each distinct Graph type, parameterized by sizes defined at
the top.

I also understand the nvGraph team supports these layouts natively,
enabling the community to take advantage of high-performance GPU kernels
very early on, and possibly align with libraries like Hornet
 (previously cuStinger).

Cheers,
Paul


Re: [JS] Arrow output from JS library?

2018-05-10 Thread Paul Taylor
Quick update on the Arrow JS ipc buffer writer:

I had a chance to revisit this branch on my fork
<https://github.com/trxcllnt/arrow/commits/js-buffer-writer>  last night,
and managed to get a working prototype of the RecordBatchStreamWriter
correctly serializing the integration test data to ArrayBuffers.

Next steps are to get more tests in place, finish the
RecordBatchFileWriter, do the JSON writers, validate against Arrow
cpp/java, implement builders, fixes/optimizations, and get a PR ready.

Best,
Paul




On Tue, Apr 17, 2018 at 7:09 PM, Paul Taylor <ptaylor.apa...@gmail.com>
wrote:

> Hi Naveen,
>
> I have some work in a branch
> <https://github.com/trxcllnt/arrow/blob/js-buffer-writer/js/src/ipc/writer/binary.ts>
> on my fork, and perhaps a bit more locally, but it's not finished.
>
> Feel free to reach out if you want to collaborate. Otherwise Graphistry
> intends to staff it full time in the next few weeks.
>
> Best,
> Paul
>
>
> On Tue, Apr 17, 2018 at 3:33 PM, Naveen Michaud-Agrawal <
> naveen.michaudagra...@gmail.com> wrote:
>
> Hi, Are there any plans to be able to create arrow objects from the JS
> library? Naveen Michaud-Agrawal
>
>


Re: [JS] Arrow output from JS library?

2018-04-17 Thread Paul Taylor

Hi Naveen,

I have some work in a branch on my fork, and perhaps a bit more 
locally, but it's not finished.


Feel free to reach out if you want to collaborate. Otherwise Graphistry 
intends to staff it full time in the next few weeks.


Best,
Paul


On Tue, Apr 17, 2018 at 3:33 PM, Naveen Michaud-Agrawal 
 wrote:

Hi,

Are there any plans to be able to create arrow objects from the JS 
library?


Naveen Michaud-Agrawal


Re: Next Arrow sync call

2018-03-29 Thread Paul Taylor
I'd like to join the gcal invite as well. Thanks!

> On Mar 29, 2018, at 11:10 AM, Wes McKinney  wrote:
> 
> Looks good.
> 
> The next Arrow sync will be Wednesday April 4 at 12:00 US Eastern time
> 
> On Thu, Mar 29, 2018 at 7:53 AM, Uwe L. Korn  wrote:
>> Hi,
>> 
>> I've added all who have requested an invite. Hope this worked eventhough I'm 
>> not the orangniser.
>> 
>> Uwe
>> 
>> On Thu, Mar 29, 2018, at 1:11 PM, Deepak Majeti wrote:
>>> Wes,
>>> 
>>> Can you add me too? Thanks!
>>> 
>>> On Wed, Mar 28, 2018 at 9:52 PM, Alex Hagerman 
>>> wrote:
>>> 
 Hi,
 
 Can I get an invite as well?
 
 Thank you.
 
 Alex
 
 
 
 On 03/28/2018 09:28 PM, Aneesh Karve wrote:
 
> Hi Wes, please add me to the Gcal invite. Thank you.
> ᐧ
> 
> 
 
>>> 
>>> 
>>> --
>>> regards,
>>> Deepak Majeti



[jira] [Created] (ARROW-2356) [JS] JSON reader fails on FixedSizeBinary data buffer

2018-03-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2356:
--

 Summary: [JS] JSON reader fails on FixedSizeBinary data buffer
 Key: ARROW-2356
 URL: https://issues.apache.org/jira/browse/ARROW-2356
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.3.1
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.0


The JSON reader doesn't ingest the FixedSizeBinary data buffer correctly, and 
we haven't known about it the JS integration test runner is accidentally 
exiting with code 0 on failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Working towards getting 0.9.0 release candidate up next week

2018-03-14 Thread Paul Taylor
This should fix it: https://github.com/apache/arrow/pull/1751 


> On Mar 14, 2018, at 6:42 PM, Wes McKinney  wrote:
> 
> Last item https://issues.apache.org/jira/browse/ARROW-2312
> 
> I can start the release vote as soon as we have the release
> verification script working again
> 
> On Wed, Mar 14, 2018 at 2:28 PM, Wes McKinney  wrote:
>> OK, patch is up for ARROW-2307:
>> https://github.com/apache/arrow/pull/1747. Once that is reviewed and
>> merged I will start the release vote
>> 
>> On Wed, Mar 14, 2018 at 10:18 AM, Wes McKinney  wrote:
>>> I'm going to have a look at ARROW-2307 to see if it's an easy fix. If
>>> not, I will go ahead with the RC
>>> 
>>> On Wed, Mar 14, 2018 at 7:36 AM, Uwe L. Korn  wrote:
 The mentioned bugfixes were merged and I also tested the 
 Arrow<->Parquet-cpp as well as the Arrow<->Dask(parquet) integration. Both 
 seem to work fine. From my side it looks like we're ready to make an RC.
 
 Uwe
 
 On Wed, Mar 14, 2018, at 5:07 AM, Wes McKinney wrote:
> I fixed these bugs
> 
> https://github.com/apache/arrow/pull/1742
> https://github.com/apache/arrow/pull/1743
> 
> As soon as these patches go in, we can cut the RC0. I can do this as
> soon as tomorrow (Wednesday) morning
> 
> On Tue, Mar 13, 2018 at 5:00 PM, Wes McKinney  wrote:
>> I found 2 bugs -- ARROW-2304 and ARROW-2306 -- while doing some final
>> testing of master (stuff we haven't been testing in CI -- we _really_
>> need to set up nightly CI jobs for more time consuming tests, like
>> HDFS, that we want to test periodically but perhaps not on every
>> commit). I'm going to see if these are easy to fix
>> 
>> On Tue, Mar 13, 2018 at 11:21 AM, Wes McKinney  
>> wrote:
>>> I fixed ARROW-2227 in https://github.com/apache/arrow/pull/1740, so if
>>> someone could review that would be great.
>>> 
>>> I am going to make sure the C++/Python HDFS tests run locally, then
>>> after the patch above is merged we should be good to make the RC.
>>> 
>>> Note to other onlookers -- please feel free to keep writing new
>>> patches not mentioned here. They may just have to go in to the next
>>> release
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Tue, Mar 13, 2018 at 12:56 AM, Wes McKinney  
>>> wrote:
 Things are looking pretty good. I'm waiting on a build for ARROW-1643
 to go in, and there's a last blocker bug (ARROW-2227) that we ought to
 try to fix tomorrow before cutting the RC. I can start the vote after
 that if nothing else comes up
 
 I created ARROW-2300 in the course of trying to test ARROW-1643. I
 have an alternative way to run the HDFS tests (since we aren't running
 them in Travis CI) so most likely I will try out the HDFS tests and
 then move ARROW-2300 to the next release milestone.
 
 Thanks
 Wes
 
 On Mon, Mar 12, 2018 at 12:03 AM, Wes McKinney  
 wrote:
> I've done a pass over the remaining JIRAs -- I think we're going to
> need at least another full day to get things buttoned up, so I would
> say we're looking at an RC on Tuesday.
> 
> In progress:
> - ARROW-1425: Patch up, needs some editing, may be moved to 0.10.0
> - ARROW-2282: Patch up, needs some test cases
> - ARROW-1974: Patch in review in apache/parquet-cpp -- may want to
> move this JIRA to the Parquet project
> - ARROW-2122: Patch to be reviewed
> - ARROW-2135: Patch to be merged
> 
> TODO
> - ARROW-2082: Parquet segfault <- to be investigated, may be moved to 
> 0.10.0
> - ARROW-2118: Fix rough edge with reading length-0 files
> - ARROW-2227: Bug with creating chunked arrays in Table.from_pandas
> - ARROW-2292: Deprecation / renaming a Python method
> 
> Out of the two items in TODO, ARROW-2118 and ARROW-2292 are small
> matters, so I will take care of them. ARROW-2227 may not be fixable
> within ~1 day but it would be useful to have a diagnosis in case the
> fix is easy -- multiple users hit this bug.
> 
> Thanks,
> Wes
> 
> On Thu, Mar 8, 2018 at 8:47 PM, Kouhei Sutou  
> wrote:
>> Thanks!
>> 
>> --
>> kou
>> 
>> In 
>> 
>>  "Re: Working towards getting 0.9.0 release candidate up next week" 
>> on Thu, 8 Mar 2018 20:44:14 -0500,
>>  Wes McKinney  wrote:
>> 

Re: [VOTE] Apache Arrow JavaScript 0.3.1 - RC1

2018-03-14 Thread Paul Taylor
+1 (non-binding)

> On Mar 14, 2018, at 5:10 PM, Wes McKinney  wrote:
> 
> Hello all,
> 
> I\'d like to propose the following release candidate (rc1) of Apache Arrow
> JavaScript version 0.3.1.
> 
> The source release rc1 is hosted at [1].
> 
> This release candidate is based on commit
> 077bd53df590cafe26fc784b3c6d03bf1ac24f67
> 
> Please download, verify checksums and signatures, run the unit tests, and vote
> on the release. The easiest way is to use the JavaScript-specific release
> verification script dev/release/js-verify-release-candidate.sh.
> 
> The vote will be open for at least 24 hours and will close once
> enough PMCs have approved the release.
> 
> [ ] +1 Release this as Apache Arrow JavaScript 0.3.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...
> 
> 
> How to validate a release signature:
> https://httpd.apache.org/dev/verification.html
> 
> [1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc1/
> [2]: 
> https://github.com/apache/arrow/tree/077bd53df590cafe26fc784b3c6d03bf1ac24f67



Re: [VOTE] Release Apache Arrow JavaScript 0.3.1 - RC0

2018-03-14 Thread Paul Taylor
This issue has been resolved. I'm available this week to help with anything 
else blocking this release. Thx

> On Mar 12, 2018, at 9:10 AM, Wes McKinney  wrote:
> 
> OK, thanks Brian. I will cancel this release and we can cut a new RC
> after this issue is resolved.
> 
> On Mon, Mar 12, 2018 at 10:49 AM, Brian Hulette  
> wrote:
>> -1 (non-binding)
>> 
>> I get an error when running js-verify-release-candidate.sh, which
>> I can also replicate with a fresh clone of arrow on commit
>> 17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6 by running `npm install`
>> and then `npm run test -- -t ts`:
>> 
>> [10:21:08] Starting 'test:ts'...
>> ● Validation Error:
>> 
>>  Module ./node_modules/babel-jest/build/index.js in the transform option
>> was not found.
>> 
>>  Configuration Documentation:
>>  https://facebook.github.io/jest/docs/configuration.html
>> 
>> [10:21:09] 'test:ts' errored after 306 ms
>> [10:21:09] Error: exited with error code: 1
>>at ChildProcess.onexit
>> (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
>>at emitTwo (events.js:126:13)
>>at ChildProcess.emit (events.js:214:7)
>>at Process.ChildProcess._handle.onexit
>> (internal/child_process.js:198:12)
>> [10:21:09] 'test' errored after 311 ms
>> 
>> 
>> Seems like the issue is that babel-jest is not included as a dev
>> dependency, so it's not found in node_modules in the new clone.
>> Not sure how it was working in the past, perhaps it was a
>> transitive dependency that was reliably included?
>> 
>> I can put up a PR to add the dependency
>> 
>> Brian
>> 
>> 
>> 
>> On 03/10/2018 01:52 PM, Wes McKinney wrote:
>>> 
>>> +1 (binding), ran js-verify-release-candidate.sh with NodeJS 8.10.0
>>> LTS on Ubuntu 16.04
>>> 
>>> On Sat, Mar 10, 2018 at 1:52 PM, Wes McKinney  wrote:
 
 Hello all,
 
 I'd like to propose the 1st release candidate (rc0) of Apache Arrow
 JavaScript version 0.3.0.  This is a bugfix release from 0.3.0.
 
 The source release rc0 is hosted at [1].
 
 This release candidate is based on commit
 17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6
 
 Please download, verify checksums and signatures, run the unit tests, and
 vote
 on the release. The easiest way is to use the JavaScript-specific release
 verification script dev/release/js-verify-release-candidate.sh.
 
 The vote will be open for at least 24 hours and will close once
 enough PMCs have approved the release.
 
 [ ] +1 Release this as Apache Arrow JavaScript 0.3.1
 [ ] +0
 [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...
 
 Thanks,
 Wes
 
 How to validate a release signature:
 https://httpd.apache.org/dev/verification.html
 
 [1]:
 https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc0/
 [2]:
 https://github.com/apache/arrow/tree/17b09ca0676995cb62ea1f9b6d6fa2afd99c33c6
>> 
>> 



[jira] [Created] (ARROW-2226) [JS] DictionaryData should use indices' offset in constructor

2018-02-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2226:
--

 Summary: [JS] DictionaryData should use indices' offset in 
constructor
 Key: ARROW-2226
 URL: https://issues.apache.org/jira/browse/ARROW-2226
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: JS-0.3.0
Reporter: Paul Taylor
Assignee: Paul Taylor






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2225) [JS] Vector reader should support reading tables split across buffers

2018-02-26 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2225:
--

 Summary: [JS] Vector reader should support reading tables split 
across buffers
 Key: ARROW-2225
 URL: https://issues.apache.org/jira/browse/ARROW-2225
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: JS-0.3.0
Reporter: Paul Taylor
Assignee: Paul Taylor






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2214) [JS] proxy DictionaryVector's nullBitmap to its indices' nullBitmap

2018-02-25 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2214:
--

 Summary: [JS] proxy DictionaryVector's nullBitmap to its indices' 
nullBitmap
 Key: ARROW-2214
 URL: https://issues.apache.org/jira/browse/ARROW-2214
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: JS-0.3.0
Reporter: Paul Taylor
Assignee: Paul Taylor


We need to add a {{nullBitmap}} getter to {{DictionaryData}} that proxies to 
the indices {{nullBitmap}}, like we do with the {{nullCount}}. This is blocking 
the PR that updates JPMC Perspective to v0.3.0: 
[https://github.com/jpmorganchase/perspective/pull/55#issuecomment-368164271|https://github.com/jpmorganchase/perspective/pull/55#issuecomment-368164271.].
 [~wesmckinn] can we do a patch release v0.3.1 once this PR is merged, since 
it's blocking a 3rd party PR?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2213) [JS] Fix npm-release.sh

2018-02-25 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-2213:
--

 Summary: [JS] Fix npm-release.sh 
 Key: ARROW-2213
 URL: https://issues.apache.org/jira/browse/ARROW-2213
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor


Fix two publishing issues:
 # timeouts caused by npm 2FA settings: 
[https://github.com/lerna/lerna/issues/1137]
 # silent failure publishing the main apache-arrow module due to "dist" key in 
that module's generated package.json



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-1903) [JS] Fix typings consuming apache-arrow module when noImplicitAny is false

2017-12-07 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-1903:
--

 Summary: [JS] Fix typings consuming apache-arrow module when 
noImplicitAny is false
 Key: ARROW-1903
 URL: https://issues.apache.org/jira/browse/ARROW-1903
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.8.0
Reporter: Paul Taylor
Assignee: Paul Taylor


The TypeScript compiler has a few bugs that raise compiler errors when valid 
strict-mode code is compiled with some of the strict mode-settings disabled. 
Since we ship the TS source code in the main `apache-arrow` npm module, 
consumers will encounter the following TypeScript compiler errors under these 
conditions:

{code}
# --strictNullChecks=true, --noImplicitAny=false
vector/numeric.ts(57,17): error TS2322: Type 'number' is not assignable to type 
'never'.
vector/numeric.ts(61,35): error TS2322: Type 'number' is not assignable to type 
'never'.
vector/numeric.ts(63,18): error TS2322: Type '0' is not assignable to type 
'never'.
vector/virtual.ts(98,38): error TS2345: Argument of type 'TypedArray' is not 
assignable to parameter of type 'never'.
{code}

The fixes are minor, and I'll add a step in the unit tests to validate the 
build targets compile with different compilation flags than ours.

Related:
https://github.com/ReactiveX/IxJS/pull/167
https://github.com/Microsoft/TypeScript/issues/20299



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Arrow JS tasks and roadmap

2017-10-19 Thread Paul Taylor
Brian Hulette and I have outlined this list of tasks/improvements for the 
expanded Arrow JS implementation:

https://docs.google.com/document/d/142dek89oM2TVI2Yql106Zo8IB1Ff_9zDg_EG6jPWS0M/edit?usp=sharing
 


No timelines yet, but I'm back full-time on Arrow now. Working my way down the 
first two lists, hoping to PR a few by early/mid next week.

Please edit, comment, and share at will!

Thanks,
Paul

[jira] [Created] (ARROW-1590) Flow TS Table method generics

2017-09-21 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-1590:
--

 Summary: Flow TS Table method generics
 Key: ARROW-1590
 URL: https://issues.apache.org/jira/browse/ARROW-1590
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor


The Table method generics should thread the Vector and value types through from 
the call site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1544) [JS] Export Vector type definitions

2017-09-15 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-1544:
--

 Summary: [JS] Export Vector type definitions
 Key: ARROW-1544
 URL: https://issues.apache.org/jira/browse/ARROW-1544
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Paul Taylor
Assignee: Paul Taylor
Priority: Minor
 Fix For: 0.7.0


We should export the Vector type definitions on the main Arrow export.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)