Problem with master build failing

2020-07-02 Thread Fan Liya
Dear all,

Currently, master build is failing occasionally.
After investigation, we find it was caused by a cyclic dependency when
class loading.

We have provided a patch for it [1]. Please take a look.

Best,
Liya Fan

[1] https://github.com/apache/arrow/pull/7628


Re: [RESULT] [VOTE] Add a "Feature" enum to Schema.fbs

2020-07-02 Thread Micah Kornfield
I added JIRAs for incorporating this into implementations.

On Thu, Jul 2, 2020 at 6:25 AM Wes McKinney  wrote:

> Forwarding with [RESULT] subject line
>
> On Wed, Jul 1, 2020 at 1:24 AM Micah Kornfield 
> wrote:
> >
> > The vote carries with 4 binding +1 votes and 0 non-binding +1. I will
> > merge the change and open some JIRAs about reading/writing the new field
> > from reference implementations (hopefully tomorrow).
> >
> > Thanks,
> > Micah
> >
> > On Tue, Jun 30, 2020 at 11:21 PM Micah Kornfield 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Sun, Jun 28, 2020 at 4:39 PM Sutou Kouhei 
> wrote:
> > >
> > >> +1 (binding)
> > >>
> > >> In <
> cak7z5t8c6pcsojdtvoga9vf868h-abdx-oqvw47as44mb-u...@mail.gmail.com>
> > >>   "[VOTE] Add a "Feature" enum to Schema.fbs" on Sat, 27 Jun 2020
> > >> 20:46:55 -0700,
> > >>   Micah Kornfield  wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > As discussed on the mailing list [1] I would like to add a feature
> enum
> > >> to
> > >> > enhance our ability to evolve Arrow in a forward compatible manner
> and
> > >> > allow clients and servers to negotiate which features are supported
> > >> before
> > >> > finalizing the features used.
> > >> >
> > >> > The PR adds a new enum and a field on schema.fbs [2]. We may make
> > >> > modifications to the language in comments but this vote is whether
> to
> > >> > accept the addition of this enum and field.  Details for how this
> will
> > >> be
> > >> > used in flight, are not part of this change.
> > >> >
> > >> > For clarity, this change is non-breaking and fully backwards
> > >> > compatible. The field ensures that current libraries will be able to
> > >> > determine if a future library version used features that it doesn't
> > >> support
> > >> > (by checking for out of range enum values).  It will require
> libraries
> > >> to
> > >> > both populate the field on writing and check values when reading.
> > >> >
> > >> > The vote will be open for at least 72 hours.
> > >> >
> > >> > [ ] +1 Accept addition of Feature enum flatbuffers field
> > >> > [ ] +0
> > >> > [ ] -1 Do not accept addition because...
> > >> >
> > >> > [1]:
> > >> >
> > >>
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202006.mbox/%3CCAK7Z5T-mUB1ipO7YGqwW%3DtcW7eA8_aYvrjWAzLmHw7ZtS09naQ%40mail.gmail.com%3E
> > >> > [2]: https://github.com/apache/arrow/pull/7502
> > >>
> > >
>


RE: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Kazuaki Ishizaki
I have seen this failure multiple times. However, it is not addressed yet.
https://travis-ci.community/t/s390x-no-space-left-on-device/8953

It is fine with me until we see more stable results.

Regards,
Kazuaki Ishizaki



From:   Wes McKinney 
To: dev 
Date:   2020/07/03 05:32
Subject:[EXTERNAL] Re: [CI] Reliability of s390x Travis CI build



Just looking at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_github_apache_arrow_builds=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=d0xgWL-o-XoU1IbNYu0qkgPOn6pka3XNoJI2sqeVL4M=cj0Lxl90H9X5cBy4T9Khl4Vw4J0uAdSTySKnVrh-iRw=
 
 the
failure rate on master (which should be green > 95% of the time) is
really high. I'm going to open a patch adding to allow_failures until
we see this become less flaky

On Thu, Jul 2, 2020 at 8:39 AM Antoine Pitrou  wrote:
>
>
> In my experience, both the s390x and ARM builds are flaky on Travis-Ci,
> for reasons which seem unrelated to Arrow.  The infrastructure seems a
> bit unreliable.
>
> Regards
>
> Antoine.
>
>
> Le 02/07/2020 à 15:15, Wes McKinney a écrit :
> > I would be interested to know the empirical reliability of the s390x
> > Travis CI build, but my guess is that it is flaking at least 20% of
> > the time, maybe more than that. If that's the case, then I think it
> > should be added back to allow_failures and at best we can look at it
> > perioidically to make sure it's passing some of the time including
> > near releases. Thoughts?
> >






[RESULT] [VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 release

2020-07-02 Thread Wes McKinney
The vote carries with 6 binding +1 votes and 2 non-binding +1

On Tue, Jun 30, 2020 at 4:03 PM Sutou Kouhei  wrote:
>
> +1 (binding)
>
> In 
>   "[VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 
> release" on Mon, 29 Jun 2020 16:42:45 -0500,
>   Wes McKinney  wrote:
>
> > Hi,
> >
> > As discussed on the mailing list [1], in order to demarcate the
> > pre-1.0.0 and post-1.0.0 worlds, and to allow the
> > forward-compatibility-protection changes we are making to actually
> > work (i.e. so that libraries can recognize that they have received
> > data with a feature that they do not support), I have proposed to
> > increment the MetadataVersion from V4 to V5. Additionally, if the
> > union validity bitmap changes are accepted, the MetadataVersion could
> > be used to control whether unions are permitted to be serialized or
> > not (with V4 -- used by v0.8.0 to v0.17.1, unions would not be
> > permitted).
> >
> > Since there have been no backward incompatible changes to the Arrow
> > format since 0.8.0, this would be no different, and (aside from the
> > union issue) libraries supporting V5 are expected to accept BOTH V4
> > and V5 so that backward compatibility is not broken, and any
> > serialized data from prior versions of the Arrow libraries (0.8.0
> > onward) will continue to be readable.
> >
> > Implementations are recommended, but not required, to provide an
> > optional "V4 compatibility mode" for forward compatibility
> > (serializing data from >= 1.0.0 that needs to be readable by older
> > libraries, e.g. Spark deployments stuck on an older Java-Arrow
> > version). In this compatibility mode, non-forward-compatible features
> > added in 1.0.0 and beyond would not be permitted.
> >
> > A PR with the changes to Schema.fbs (possibly subject to some
> > clarifying changes to the comments) is at [2].
> >
> > Once the PR is merged, it will be necessary for implementations to be
> > updated and tested as appropriate at minimum to validate that backward
> > compatibility is preserved (i.e. V4 IPC payloads are still readable --
> > we have some in apache/arrow-testing and can add more as needed).
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept addition of MetadataVersion::V5 along with its general
> > implications above
> > [ ] +0
> > [ ] -1 Do not accept because...
> >
> > [1]: 
> > https://lists.apache.org/thread.html/r856822cc366d944b3ecdf32c2ea9b1ad8fc9d12507baa2f2840a64b6%40%3Cdev.arrow.apache.org%3E
> > [2]: https://github.com/apache/arrow/pull/7566


[RESULT] [VOTE] Permitting unsigned integers for Arrow dictionary indices

2020-07-02 Thread Wes McKinney
The vote carries with 6 binding +1 and 1 non-binding +1. Thanks all

On Tue, Jun 30, 2020 at 10:07 AM Francois Saint-Jacques
 wrote:
>
> +1 (binding)
>
> On Tue, Jun 30, 2020 at 10:55 AM Neal Richardson
>  wrote:
> >
> > +1 (binding)
> >
> > On Tue, Jun 30, 2020 at 2:52 AM Antoine Pitrou  wrote:
> >
> > >
> > > +1 (binding)
> > >
> > > Le 29/06/2020 à 23:59, Wes McKinney a écrit :
> > > > Hi,
> > > >
> > > > As discussed on the mailing list [1], it has been proposed to allow
> > > > the use of unsigned dictionary indices (which is already technically
> > > > possible in our metadata serialization, but not allowed according to
> > > > the language of the columnar specification), with the following
> > > > caveats:
> > > >
> > > > * Unless part of an application's requirements (e.g. if it is
> > > > necessary to store dictionaries with size 128 to 255 more compactly),
> > > > implementations are recommended to prefer signed over unsigned
> > > > integers, with int32 continuing to be the "default" when the indexType
> > > > field of DictionaryEncoding is null
> > > > * uint64 dictionary indices, while permitted, are strongly not
> > > > recommended unless required by an application as they are more
> > > > difficult to work with in some programming languages (e.g. Java) and
> > > > they do not offer the storage size benefits that uint8 and uint16 do.
> > > >
> > > > This change is backwards compatible, but not forward compatible for
> > > > all implementations (for example, C++ will reject unsigned integers).
> > > > Assuming that the V5 MetadataVersion change is accepted, to protect
> > > > against forward compatibility issues such implementations would be
> > > > recommended to not allow unsigned dictionary indices to be serialized
> > > > using V4 MetadataVersion.
> > > >
> > > > A PR with the changes to the columnar specification (possibly subject
> > > > to some clarifying language) is at [2].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Accept changes to allow unsigned integer dictionary indices
> > > > [ ] +0
> > > > [ ] -1 Do not accept because...
> > > >
> > > > [1]:
> > > https://lists.apache.org/thread.html/r746e0a76c4737a2cf48dec656103677169bebb303240e62ae1c66d35%40%3Cdev.arrow.apache.org%3E
> > > > [2]: https://github.com/apache/arrow/pull/7567
> > > >
> > >


Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Thanks!


> You should be able to store different length vectors in Parquet. Think of
> strings simply as an array of bytes, and those are variable length. You
> would want to make sure you don’t use DICTIONARY_ENCODING in that case.
>

Interesting. We'll look at that.


> No, I'm not aware of any tools that do diffs between Parquet files. I'm
> not sure how you could perform a byte for byte diff without reading one
> into memory and decoding it. My question here would be who is trying to
> consume the diff you want to generate? Is the diff something you want to
> display to a user? i.e. column A, row 132 was "foo" but has now changed to
> "bar"
>

Yes. A typical scenario is that there is a public dataset, and
different people have made incremental improvements. This could be, for
instance, removal of constant columns, fixing typos, formatting dates,
remove data from a broken sensor,... It would be interesting if users could
see how two datasets differ.
Another scenario is a reviewing process where the author of a dataset wants
to review changes made by a contributor before accepting them.


> Or are you looking to apply an update to a dataset? i.e. I recently
> trained and stored embeddings and now I need to update them but I don't
> want to override the data because I would like to be able to retrieve what
> they were in the last training iteration so I can roll back, run parallel
> tests, etc..
>

Possibly, although updating an embedding will likely change every value in
the dataset. That seems to call for file versioning and meta-data about the
process that generated it.


> Thanks, you may mention me as a contributor to the blog post if you'd like!
>

Done ;).

Thanks again,
Joaquin





> On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren 
> wrote:
>
>> Hi Nick, all,
>>
>> Thanks! I updated the blog post to specify the requirements better.
>>
>> First, we plan to store the datasets in S3 (on min.io). I agree this
>> works
>> nicely with Parquet.
>>
>> Do you know whether there any activity on supporting partial read/writes
>> in
>> arrow or fastparquet? That would change things a lot.
>>
>>
>> > If doing the transform from/to various file formats is a feature you
>> feel
>> > strongly about, I would suggest doing the transforms via out-of-band ETL
>> > jobs where the user can then request the files asynchronously later.
>>
>>
>> That's what we were thinking about, yes. We need a 'core' format to store
>> the data and write ETL jobs for, but secondary formats could be stored in
>> S3 and returned on demand.
>>
>>
>> > To your point of storing images with meta data such as tags. I haven’t
>> > actually tried it but I suppose you could in theory write the images in
>> one
>> > Parquet binary type column and the tags in another.
>> >
>>
>> Even then, there are different numbers of bounding boxes / tags per image.
>> Can you store different-length vectors in Parquet?
>>
>>
>> > Versioning is difficult and I believe there are many attempts at this
>> right
>> > now. DeltaLake for example has the ability to query at dataset at a
>> point
>> > in time. They basically have Parquet files with some extra json files on
>> > the side describing the changes.
>>
>>
>> I've looked at DeltaLake, but as far as I understand, its commit log
>> depends on spark operations done on the dataframe? Hence, any change to
>> the
>> dataset has to be performed via spark? Is that correct?
>>
>>
>> > Straight up versions of file could be achieved with your underlying file
>> > system. S3 has file versioning.
>> >
>>
>> Do you know of any tools to compute diffs between Parquet file? What I
>> could find was basically: export both files to CSV and run git diff.
>> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
>> operations done directly on the file?
>>
>> Thanks!
>> Joaquin
>>
>> PS. Nick, would you like to be mentioned as a contributor in the blog
>> post?
>> Your comments helped a lot to improve it ;).
>>
>>
>>
>>
>> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren 
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > Sorry for restarting an old thread, but we've had a _lot_ of
>> discussions
>> > > over the past 9 months or so on how to store machine learning datasets
>> > > internally. We've written a blog post about it and would love to hear
>> > your
>> > > thoughts:
>> > >
>> > >
>> >
>> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
>> > >
>> > > To be clear: what we need is a data format for archival storage on the
>> > > server, and preferably one that supports versioning/diff, multi-table
>> > > storage, and sparse data.
>> > > Hence, this is for *internal* storage. When OpenML users want to
>> > download a
>> > > dataset in parquet or arrow we can always convert it on the fly (or
>> from
>> > a
>> > > cache). We already use Arrow/Feather to cache the datasets after it is
>> > > downloaded (when possible).
>> > >
>> > > One specific 

[RESULT] [VOTE] Removing validity bitmap from Arrow union types

2020-07-02 Thread Wes McKinney
The vote carries with 3 binding +1 votes, 2 non-binding +1, and 1 +0

Thanks all for voting. I will update the Format PR and plan to merge
the C++ PR soon thereafter

On Tue, Jun 30, 2020 at 4:00 PM Sutou Kouhei  wrote:
>
> +1 (binding)
>
> In 
>   "[VOTE] Removing validity bitmap from Arrow union types" on Mon, 29 Jun 
> 2020 16:23:23 -0500,
>   Wes McKinney  wrote:
>
> > Hi,
> >
> > As discussed on the mailing list [1], it has been proposed to remove
> > the validity bitmap buffer from Union types in the columnar format
> > specification and instead let value validity be determined exclusively
> > by constituent arrays of the union.
> >
> > One of the primary motivations for this is to simplify the creation of
> > unions, since constructing a validity bitmap that merges the
> > information contained in the child arrays' bitmaps is quite
> > complicated.
> >
> > Note that change breaks IPC forward compatibility for union types,
> > however implementations with hitherto spec-compliant union
> > implementations would be able to (at their discretion, of course)
> > preserve backward compatibility for deserializing "old" union data in
> > the case that the parent null count of the union is zero. The expected
> > impact of this breakage is low, particularly given that Unions have
> > been absent from integration testing and thus not recommended for
> > anything but ephemeral serialization.
> >
> > Under the assumption that the MetadataVersion V4 -> V5 version bump is
> > accepted, in order to protect against forward compatibility problems,
> > Arrow implementations would be forbidden from serializing union types
> > using the MetadataVersion::V4.
> >
> > A PR with the changes to Columnar.rst is at [2].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept changes to Columnar.rst (removing union validity bitmaps)
> > [ ] +0
> > [ ] -1 Do not accept changes because...
> >
> > [1]: 
> > https://lists.apache.org/thread.html/r889d7532cf1e1eff74b072b4e642762ad39f4008caccef5ecde5b26e%40%3Cdev.arrow.apache.org%3E
> > [2]: https://github.com/apache/arrow/pull/7535


Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Nicholas Poorman
Joaquin,

> Do you know whether there any activity on supporting partial read/writes
in
arrow or fastparquet?

I’m not entirely sure about the status of partial read/writes in Arrow’s
Parquet implementation but
https://github.com/xitongsys/parquet-go for example has this capability.

> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?

You should be able to store different length vectors in Parquet. Think of
strings simply as an array of bytes, and those are variable length. You
would want to make sure you don’t use DICTIONARY_ENCODING in that case.

> I've looked at DeltaLake, but as far as I understand, its commit log
depends on spark operations done on the dataframe? Hence, any change to the
dataset has to be performed via spark? Is that correct?

Until someone replicates the functionality outside of Spark, yes that is
the drawback and why I have been hesitant to adopt DeltaLake.

> Do you know of any tools to compute diffs between Parquet file? What I
could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
operations done directly on the file?

No, I'm not aware of any tools that do diffs between Parquet files. I'm not
sure how you could perform a byte for byte diff without reading one into
memory and decoding it. My question here would be who is trying to consume
the diff you want to generate? Is the diff something you want to display to
a user? i.e. column A, row 132 was "foo" but has now changed to "bar" Or
are you looking to apply an update to a dataset? i.e. I recently trained
and stored embeddings and now I need to update them but I don't want to
override the data because I would like to be able to retrieve what they
were in the last training iteration so I can roll back, run parallel tests,
etc..

I believe DeltaLake has a commit log. However, it probably doesn't provide
a diff. The commit log does give them the ability to ask "What did the data
look like at this point in time?".

Thanks, you may mention me as a contributor to the blog post if you'd like!

Best,
Nick Poorman



On Thu, Jul 2, 2020 at 9:40 AM Joaquin Vanschoren 
wrote:

> Hi Nick, all,
>
> Thanks! I updated the blog post to specify the requirements better.
>
> First, we plan to store the datasets in S3 (on min.io). I agree this works
> nicely with Parquet.
>
> Do you know whether there any activity on supporting partial read/writes in
> arrow or fastparquet? That would change things a lot.
>
>
> > If doing the transform from/to various file formats is a feature you feel
> > strongly about, I would suggest doing the transforms via out-of-band ETL
> > jobs where the user can then request the files asynchronously later.
>
>
> That's what we were thinking about, yes. We need a 'core' format to store
> the data and write ETL jobs for, but secondary formats could be stored in
> S3 and returned on demand.
>
>
> > To your point of storing images with meta data such as tags. I haven’t
> > actually tried it but I suppose you could in theory write the images in
> one
> > Parquet binary type column and the tags in another.
> >
>
> Even then, there are different numbers of bounding boxes / tags per image.
> Can you store different-length vectors in Parquet?
>
>
> > Versioning is difficult and I believe there are many attempts at this
> right
> > now. DeltaLake for example has the ability to query at dataset at a point
> > in time. They basically have Parquet files with some extra json files on
> > the side describing the changes.
>
>
> I've looked at DeltaLake, but as far as I understand, its commit log
> depends on spark operations done on the dataframe? Hence, any change to the
> dataset has to be performed via spark? Is that correct?
>
>
> > Straight up versions of file could be achieved with your underlying file
> > system. S3 has file versioning.
> >
>
> Do you know of any tools to compute diffs between Parquet file? What I
> could find was basically: export both files to CSV and run git diff.
> DeltaLake would help here, but again, is seems that it only 'tracks' Spark
> operations done directly on the file?
>
> Thanks!
> Joaquin
>
> PS. Nick, would you like to be mentioned as a contributor in the blog post?
> Your comments helped a lot to improve it ;).
>
>
>
>
> On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren 
> > wrote:
> >
> > > Hi all,
> > >
> > > Sorry for restarting an old thread, but we've had a _lot_ of
> discussions
> > > over the past 9 months or so on how to store machine learning datasets
> > > internally. We've written a blog post about it and would love to hear
> > your
> > > thoughts:
> > >
> > >
> >
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> > >
> > > To be clear: what we need is a data format for archival storage on the
> > > server, and preferably one that supports 

Re: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Wes McKinney
Just looking at https://travis-ci.org/github/apache/arrow/builds the
failure rate on master (which should be green > 95% of the time) is
really high. I'm going to open a patch adding to allow_failures until
we see this become less flaky

On Thu, Jul 2, 2020 at 8:39 AM Antoine Pitrou  wrote:
>
>
> In my experience, both the s390x and ARM builds are flaky on Travis-Ci,
> for reasons which seem unrelated to Arrow.  The infrastructure seems a
> bit unreliable.
>
> Regards
>
> Antoine.
>
>
> Le 02/07/2020 à 15:15, Wes McKinney a écrit :
> > I would be interested to know the empirical reliability of the s390x
> > Travis CI build, but my guess is that it is flaking at least 20% of
> > the time, maybe more than that. If that's the case, then I think it
> > should be added back to allow_failures and at best we can look at it
> > perioidically to make sure it's passing some of the time including
> > near releases. Thoughts?
> >


Re: Timeline for next major Arrow release (1.0.0)

2020-07-02 Thread Wes McKinney
hi folks,

I hope you and your families are all well.

We're heading into a holiday weekend here in the US -- I would guess
given the state of the backlog and nightly builds that the earliest we
could contemplate making the release will be the week of July 13. That
should give enough time next week to resolve the code changes related
to the Format votes under way along with other things that come up.

In the meantime, if all stakeholders could please review the 1.0.0
backlog and remove issues that you do not believe will be completed in
the next 10 days with > 0.5 probability, that would be very helpful to
know where things stand viz-a-viz cutting an RC

Thank you,
Wes

On Mon, Jun 15, 2020 at 11:21 AM Wes McKinney  wrote:
>
> hi folks,
>
> Based on the previous discussions about release timelines, the window
> for the next major release would be around the week of July 6. Does
> this sound reasonable?
>
> I see that Neal has created a wiki page to help track the burndown
>
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+1.0.0+Release
>
> There are currently 214 issues in the 1.0.0 backlog. Some of these are
> indeed blockers based on the criteria we've indicated for making the
> 1.0.0 release. Would project stakeholders please review their parts of
> the backlog and remove issues that aren't likely to be completed in
> the next 21 days?
>
> Thanks,
> Wes


Re: Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
I can confirm what Uwe said, manylinux doesn't cause issues.

Here I've build inside a manylinux2010 docker a C++ Python extension (using
the C++ of Arrow):
https://github.com/vaexio/vaex-arrow-ext/runs/831763024?check_suite_focus=true
It's built with the manylinux1 and manylinux2010 pyarrow wheel (the
manylinux2014 cannot be installed in the manylinux2010 docker obviously)

After that, I've installed the manylinux2010 Python extension in the host
OS, and ran the test with all manylinux versions of pyarrow

I've done the same but now building the extension in the 2014 docker image,
where I also check with building against the manylinux2014 pyarrow wheel.
All combinations works.

Regarding linking, as Uwe said, setting the symbol resolution to global,
and not linking to the (libarrow.so and libarrow_python.so) libraries works:
https://github.com/vaexio/vaex-arrow-ext/blob/0def54cc056710db5305477c308a1e68c9e9aba2/vaex_arrow_ext/__init__.py

Note that this solution does not ship libarrow/libarrow_python in the
wheel, but it could result into issues because symbols exported by
libarrow.so and libarrow_python.so might be visible by other libraries
(reading Uwe's article on arrow+tensorflow I guess that's settled).

I'm not sure what people think of this solution, but it seems to work well.

PS: I tried calling 'restoring' the symbol resolutions to local, but it
seems to have no effect.

Op do 2 jul. 2020 om 17:10 schreef Tim Paine :

> We build pyarrow in the docker image because auditwheel complains about
> pyarrow otherwise which causes our wheels to fail auditwheel and not allow
> the manylinux tag. But assuming we build pyarrow in the docker image, our
> manylinux wheels that result are then compatible with the pyarrow manylinux
> wheels.
>
> It has taken a few months of on-and-off work, you may need to also consult
> this PR
> https://github.com/finos/perspective/pull/1105/files <
> https://github.com/finos/perspective/pull/1105/files>
>
>
> > On Jul 2, 2020, at 11:04, Uwe L. Korn  wrote:
> >
> > Hello Tim,
> >
> > thanks for the hint. I see that you build arrow by yourselves in the
> Dockerfile. Could it be that in the end you statically link the arrow
> libraries?
> >
> > As there are no wheel on PyPI, I couldn't verify whether that assumption
> is true.
> >
> > Best
> > Uwe
> >
> > On Thu, Jul 2, 2020, at 4:53 PM, Tim Paine wrote:
> >> We spent a ton of time on this for perspective, the end result is a
> >> mostly compatible set of wheels for most platforms, I believe we
> >> skipped py2 but nobody cares about those anyway. We link against
> >> libarrow and libarrow_python on Linux, on windows we vendor them all
> >> into our library. Feel free to scrape the perspective repo's cmake
> >> lists and setup.py for details.
> >>
> >> Tim Paine
> >> tim.paine.nyc
> >>
> >>> On Jul 2, 2020, at 10:32, Uwe L. Korn  wrote:
> >>>
> >>> I had so much fun with the wheels in the past, I'm now a happy member
> of conda-forge core instead :D
> >>>
> >>> The good thing first:
> >>>
> >>> * The C++ ABI didn't change between the manylinux versions, it is the
> old one in all cases. So you mix & match manylinux versions.
> >>>
> >>> The sad things:
> >>>
> >>> * The manylinuxX standard are intented to provide a way to ship
> *self-contained* wheels that run on any recent Linux. The important part
> here is that they need to be self-contained. Having a binary dependency on
> another wheel is actually not allowed.
> >>> * Thus the snowflake-python-connector ships the libarrow.so it was
> build with as part of its wheel. In this case auditwheel is happy with the
> wheel.
> >>> * It is working with numpy as a dependency because NumPy linkage is
> similar to the import lib behaviour on Windows: You don't actually link
> against numpy but you statically link a set of functions that are resolved
> to NumPy's function when you import numpy. Quick googling leads to
> https://github.com/yugr/Implib.so which could provide something similar
> for Linux.
> >>> * You could actually omit linking to libarrow and try to populate the
> symbols before you load the library. This is how the Python symbols are
> available to extensions without linking to libpython.
> >>>
> >>>
>  On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
>  Ok, thanks!
> 
>  I'm setting up a repo with an example here, using pybind11:
>  https://github.com/vaexio/vaex-arrow-ext
> 
>  and I'll just try all possible combinations and report back.
> 
>  cheers,
> 
>  Maarten Breddels
>  Software engineer / consultant / data scientist
>  Python / C++ / Javascript / Jupyter
>  www.maartenbreddels.com / vaex.io
>  maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
>  [image: Twitter] [image: Github]
>  [image: LinkedIn]
>  [image: Skype]
> 
> 
> 
> 
>  Op do 

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Antoine Pitrou


Well, it depends how important speed is, but LZ4 has extremely fast
decompression, even compared to Snappy:
https://github.com/lz4/lz4#benchmarks

Regards

Antoine.


Le 02/07/2020 à 19:47, Christian Hudon a écrit :
> At least for us, the advantages of Parquet are speed and interoperability
> in the context of longer-term data storage, so I would tend to say
> "reasonably conservative".
> 
> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou  a
> écrit :
> 
>>
>> I don't have a sense of how conservative Parquet users generally are.
>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>> format, or would people just not use it?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Tue, 30 Jun 2020 14:33:17 +0200
>> "Uwe L. Korn"  wrote:
>>> I'm also in favor of disabling support for now. Having to deal with
>> broken files or the detection of various incompatible implementations in
>> the long-term will harm more than not supporting LZ4 for a while. Snappy is
>> generally more used than LZ4 in this category as it has been available
>> since the inception of Parquet and thus should be considered as a viable
>> alternative.
>>>
>>> Cheers
>>> Uwe
>>>
>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
 On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou 
>> wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>> hi folks,
>>
>> (cross-posting to dev@arrow and dev@parquet since there are
>> stakeholders in both places)
>>
>> It seems there are still problems at least with the C++
>> implementation
>> of LZ4 compression in Parquet files
>>
>> https://issues.apache.org/jira/browse/PARQUET-1241
>> https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue,
>> but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of
>> the
> speed/compression spectrum, so it would be a pity to disable one of
>> them.

 It's true, however I think it's worse to write LZ4-compressed files
 that cannot be read by other Parquet implementations (if that's what's
 happening as I understand it?). If we are indeed shipping something
 broken then we either should fix it or disable it until it can be
 fixed.

> Regards
>
> Antoine.

>>>
>>
>>
>>
>>
> 


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Christian Hudon
At least for us, the advantages of Parquet are speed and interoperability
in the context of longer-term data storage, so I would tend to say
"reasonably conservative".

Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou  a
écrit :

>
> I don't have a sense of how conservative Parquet users generally are.
> Is it worth adding a LZ4_FRAMED compression option in the Parquet
> format, or would people just not use it?
>
> Regards
>
> Antoine.
>
>
> On Tue, 30 Jun 2020 14:33:17 +0200
> "Uwe L. Korn"  wrote:
> > I'm also in favor of disabling support for now. Having to deal with
> broken files or the detection of various incompatible implementations in
> the long-term will harm more than not supporting LZ4 for a while. Snappy is
> generally more used than LZ4 in this category as it has been available
> since the inception of Parquet and thus should be considered as a viable
> alternative.
> >
> > Cheers
> > Uwe
> >
> > On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > > hi folks,
> > > > >
> > > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > > stakeholders in both places)
> > > > >
> > > > > It seems there are still problems at least with the C++
> implementation
> > > > > of LZ4 compression in Parquet files
> > > > >
> > > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > > >
> > > > I don't have any particular opinion on how to solve the LZ4 issue,
> but
> > > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > > compression algorithms available, and they span different parts of
> the
> > > > speed/compression spectrum, so it would be a pity to disable one of
> them.
> > >
> > > It's true, however I think it's worse to write LZ4-compressed files
> > > that cannot be read by other Parquet implementations (if that's what's
> > > happening as I understand it?). If we are indeed shipping something
> > > broken then we either should fix it or disable it until it can be
> > > fixed.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
> >
>
>
>
>

-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com


Re: Arrow for low-latency streaming of small batches?

2020-07-02 Thread Christian Hudon
Very interesting. This is something that I would potentially also be
interested in, so if there were some code available out there, I could
potentially contribute or at least use. At least, I'd love for something
that allows Arrow to work with both larger and very small record batches (a
few rows) in a seamless and efficient way to make it into the Arrow
codebase.

Le lun. 29 juin 2020, à 17 h 05, Wes McKinney  a
écrit :

> On Fri, Jun 26, 2020 at 8:56 AM Chris Osborn 
> wrote:
> >
> > Yes, it would be quite feasible to preallocate a region large enough for
> several thousand rows for each column, assuming I read from that region
> while it's still filling in. When that region is full, I could either
> allocate a new big chunk or loop around if I no longer need the data. I'm
> now doing something like that in a revised prototype. Specifically I'm
> creating builders and calling Reserve() once up front to get a large
> region, which I then fill in with multiple batches. As the producer fills
> it in using ArrayBuilder::Append(), the consumers read out earlier rows
> using ArrayBuilder::GetValue(). This works, but I'm clearly going against
> the spirit of the library by using builders as ersatz Arrays and a set of
> builders in lieu of a Table.
> >
> > In short, it's feasible (and preferable) to preallocate the memory
> needed, whether it's the builders' memory or the RecordBatch/Table's memory
> (ideally that's the same thing?). I just haven't been able to figure out
> how to do that gracefully.
>
> By following the columnar format's buffer layouts [1] it should
> straightforward to compute the size of a memory region to preallocate
> that represents a RecordBatch's memory and then construct the Buffer
> and ArrayData objects that reference each constituent buffer, and then
> create a RecordBatch from those ArrayData objects. Some assumptions
> must be made of course:
>
> * If a field is nullable, then an empty validity bitmap must be
> preallocated (and you can initialize it to all valid or all null based
> on what your application prefers)
> * Must decide what to do about variable-size allocations for
> binary/string types (and extrapolating, analogously for list types if
> you have Array/List-like data). So if you preallocated a region that
> can accommodate 1024 values then you might allocate 32KB data buffers
> for string data (or some factor of the length if you have bigger
> strings). If you fill up the data buffer then you will have to move on
> to the next region. Another approach might be to let the string data
> buffer be a separate ResizableBuffer that you reallocate when you need
> to make it bigger
>
> I could envision creating a C++ implementation to manage this whole
> process that becomes a part of the Arrow C++ codebase -- preallocate
> memory given some global / field-level options and then provide
> effectively "UnsafeAppend" APIs to write data into the preallocated
> region.
>
> If you create a "parent" RecordBatch that references the preallocated
> memory than you can use `RecordBatch::Slice` to "chop" off the filled
> portion to pass to your consumer.
>
> [1]:
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#buffer-listing-for-each-layout
>
> > Thanks!
> > Chris Osborn
> >
> > 
> > From: Wes McKinney 
> > Sent: Thursday, June 25, 2020 10:13 PM
> > To: dev 
> > Subject: Re: Arrow for low-latency streaming of small batches?
> >
> > Is it feasible to preallocate the memory region where you are writing
> > the record batch?
> >
> > On Thu, Jun 25, 2020 at 1:06 PM Chris Osborn  wrote:
> > >
> > > Hi,
> > >
> > > I am investigating Arrow for a project that needs to transfer records
> from
> > > a producer to one or more consumers in small batches (median batch
> size is
> > > 1) and with low latency. The usual structure for something like this
> would
> > > be a single producer multi-consumer queue*. Is there any sane way to
> use
> > > Arrow in this fashion? I have a little C++ prototype that works, but it
> > > does the following for each batch of rows:
> > >
> > > Producer side:
> > > 1. construct a set of builders
> > > 2. append a value to each builder for each record in the batch
> > > 3. finish the builders and use them to make a RecordBatch
> > > 4. append the RecordBatch to a vector
> > >
> > > Consumer side:
> > > 1. construct a Table from the vector of RecordBatches
> > > 2. slice out the part of the table that the consumer requires (each
> > > consumer keeps its own offset)
> > > 3. read the data from the resulting sliced table
> > >
> > > Considering how much work this has to do it performs better than I
> would
> > > have expected, but there's definitely a big fixed cost for each batch
> of
> > > rows (constructing and destructing builders, making Tables that can
> only be
> > > used once since they're immutable, etc). If the batches weren't so
> small it
> > > would probably make sense, 

Re: Upcoming JS fixes and release timeline

2020-07-02 Thread Wes McKinney
Since publishing artifacts to NPM is somewhat independent from the
Apache source release, if you aren't ready to push to NPM then the
release manager can just not push the artifacts

Note that the plan hasn't been to go from 1.0.0 to 1.1.0, rather that
almost every Apache release (aside from patch releases) would be a
major version bump, so the next release after 1.0.0 is 2.0.0.

On Wed, Jul 1, 2020 at 11:23 AM Paul Taylor  wrote:
>
> The TypeScript compiler has made breaking changes in recent releases,
> meaning we can't easily upgrade past 3.5 and projects on 3.6+ can't
> compile our types.
>
> I'm working on upgrading our tsc dependency to 3.9. The fixes could
> include a few backwards-incompatible API changes, and might not be done
> in time for the general Arrow 1.0 release.
>
> JS shouldn't block the 1.0 release, so can we exclude JS from 1.0 if the
> fixes aren't ready by then? npm's semantic versioning allows breaking
> changes in any version before 1.0, but not between minor versions after
> 1.0. I've heard directly from some of our JS users who'd prefer if we
> made these changes before bumping to 1.0 on npm.
>
> Thanks,
>
> Paul
>


Re: Developing a C++ Python extension

2020-07-02 Thread Tim Paine
We build pyarrow in the docker image because auditwheel complains about pyarrow 
otherwise which causes our wheels to fail auditwheel and not allow the 
manylinux tag. But assuming we build pyarrow in the docker image, our manylinux 
wheels that result are then compatible with the pyarrow manylinux wheels.

It has taken a few months of on-and-off work, you may need to also consult this 
PR
https://github.com/finos/perspective/pull/1105/files 



> On Jul 2, 2020, at 11:04, Uwe L. Korn  wrote:
> 
> Hello Tim,
> 
> thanks for the hint. I see that you build arrow by yourselves in the 
> Dockerfile. Could it be that in the end you statically link the arrow 
> libraries?
> 
> As there are no wheel on PyPI, I couldn't verify whether that assumption is 
> true.
> 
> Best
> Uwe
> 
> On Thu, Jul 2, 2020, at 4:53 PM, Tim Paine wrote:
>> We spent a ton of time on this for perspective, the end result is a 
>> mostly compatible set of wheels for most platforms, I believe we 
>> skipped py2 but nobody cares about those anyway. We link against 
>> libarrow and libarrow_python on Linux, on windows we vendor them all 
>> into our library. Feel free to scrape the perspective repo's cmake 
>> lists and setup.py for details.
>> 
>> Tim Paine
>> tim.paine.nyc
>> 
>>> On Jul 2, 2020, at 10:32, Uwe L. Korn  wrote:
>>> 
>>> I had so much fun with the wheels in the past, I'm now a happy member of 
>>> conda-forge core instead :D
>>> 
>>> The good thing first:
>>> 
>>> * The C++ ABI didn't change between the manylinux versions, it is the old 
>>> one in all cases. So you mix & match manylinux versions.
>>> 
>>> The sad things:
>>> 
>>> * The manylinuxX standard are intented to provide a way to ship 
>>> *self-contained* wheels that run on any recent Linux. The important part 
>>> here is that they need to be self-contained. Having a binary dependency on 
>>> another wheel is actually not allowed.
>>> * Thus the snowflake-python-connector ships the libarrow.so it was build 
>>> with as part of its wheel. In this case auditwheel is happy with the wheel.
>>> * It is working with numpy as a dependency because NumPy linkage is similar 
>>> to the import lib behaviour on Windows: You don't actually link against 
>>> numpy but you statically link a set of functions that are resolved to 
>>> NumPy's function when you import numpy. Quick googling leads to 
>>> https://github.com/yugr/Implib.so which could provide something similar for 
>>> Linux.
>>> * You could actually omit linking to libarrow and try to populate the 
>>> symbols before you load the library. This is how the Python symbols are 
>>> available to extensions without linking to libpython.
>>> 
>>> 
 On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
 Ok, thanks!
 
 I'm setting up a repo with an example here, using pybind11:
 https://github.com/vaexio/vaex-arrow-ext
 
 and I'll just try all possible combinations and report back.
 
 cheers,
 
 Maarten Breddels
 Software engineer / consultant / data scientist
 Python / C++ / Javascript / Jupyter
 www.maartenbreddels.com / vaex.io
 maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
 [image: Twitter] [image: Github]
 [image: LinkedIn]
 [image: Skype]
 
 
 
 
 Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
 jorisvandenboss...@gmail.com>:
 
> Also no concrete answer, but one such example is turbodbc, I think.
> But it seems they only have conda binary packages, and don't
> distribute wheels ..
> (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> so not that relevant as comparison (they also need to build against an
> odbc driver in addition to arrow).
> But maybe Uwe has some more experience in this regard (and with
> attempts building wheels for turbodbc, eg
> https://github.com/blue-yonder/turbodbc/pull/108).
> 
> Joris
> 
> On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
>> 
>> 
>> Hi Maarten,
>> 
>> Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
>>> 
>>> Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> vaex
>>> extension distributed as a 2010 wheel, and build with the pyarrow 2010
>>> wheel, work in an environment where someone installed a pyarrow 2014
>>> wheel, or build from source, or installed from conda-forge?
>> 
>> I have no idea about the concrete answer, but it probably depends
>> whether the libstdc++ ABI changed between those two versions.  I'm
>> afraid you'll have to experiment yourself.
>> 
>> (if you want to eschew C++ ABI issues, you may use the C Data Interface:
>> https://arrow.apache.org/docs/format/CDataInterface.html
>> though of course you 

Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I did try the approach to not link against pyarrow but leave out the symbols, 
just ensure pyarrow is imported before the vaex extension. This works 
out-of-the-box on macOS but fails on Linux as symbols have a scope there. 
Adding the following lines to load Arrow into the global scope made it work 
though:

import ctypes
arrow_python = ctypes.CDLL('libarrow.so', ctypes.RTLD_GLOBAL)
libarrow_python = ctypes.CDLL('libarrow_python.so', ctypes.RTLD_GLOBAL)

On Thu, Jul 2, 2020, at 4:32 PM, Uwe L. Korn wrote:
> I had so much fun with the wheels in the past, I'm now a happy member 
> of conda-forge core instead :D
> 
> The good thing first:
> 
> * The C++ ABI didn't change between the manylinux versions, it is the 
> old one in all cases. So you mix & match manylinux versions.
> 
> The sad things:
> 
> * The manylinuxX standard are intented to provide a way to ship 
> *self-contained* wheels that run on any recent Linux. The important 
> part here is that they need to be self-contained. Having a binary 
> dependency on another wheel is actually not allowed.
> * Thus the snowflake-python-connector ships the libarrow.so it was 
> build with as part of its wheel. In this case auditwheel is happy with 
> the wheel.
> * It is working with numpy as a dependency because NumPy linkage is 
> similar to the import lib behaviour on Windows: You don't actually link 
> against numpy but you statically link a set of functions that are 
> resolved to NumPy's function when you import numpy. Quick googling 
> leads to https://github.com/yugr/Implib.so which could provide 
> something similar for Linux.
> * You could actually omit linking to libarrow and try to populate the 
> symbols before you load the library. This is how the Python symbols are 
> available to extensions without linking to libpython.
> 
> 
> On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> > Ok, thanks!
> > 
> > I'm setting up a repo with an example here, using pybind11:
> > https://github.com/vaexio/vaex-arrow-ext
> > 
> > and I'll just try all possible combinations and report back.
> > 
> > cheers,
> > 
> > Maarten Breddels
> > Software engineer / consultant / data scientist
> > Python / C++ / Javascript / Jupyter
> > www.maartenbreddels.com / vaex.io
> > maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> > [image: Twitter] [image: Github]
> > [image: LinkedIn]
> > [image: Skype]
> > 
> > 
> > 
> > 
> > Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> > jorisvandenboss...@gmail.com>:
> > 
> > > Also no concrete answer, but one such example is turbodbc, I think.
> > > But it seems they only have conda binary packages, and don't
> > > distribute wheels ..
> > > (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> > > so not that relevant as comparison (they also need to build against an
> > > odbc driver in addition to arrow).
> > > But maybe Uwe has some more experience in this regard (and with
> > > attempts building wheels for turbodbc, eg
> > > https://github.com/blue-yonder/turbodbc/pull/108).
> > >
> > > Joris
> > >
> > > On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> > > >
> > > >
> > > > Hi Maarten,
> > > >
> > > > Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > > > >
> > > > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> > > vaex
> > > > > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > > > > wheel, work in an environment where someone installed a pyarrow 2014
> > > > > wheel, or build from source, or installed from conda-forge?
> > > >
> > > > I have no idea about the concrete answer, but it probably depends
> > > > whether the libstdc++ ABI changed between those two versions.  I'm
> > > > afraid you'll have to experiment yourself.
> > > >
> > > > (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> > > > https://arrow.apache.org/docs/format/CDataInterface.html
> > > > though of course you won't have access to all the useful helpers in the
> > > > Arrow C++ library)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
> >
>


Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
Hello Tim,

thanks for the hint. I see that you build arrow by yourselves in the 
Dockerfile. Could it be that in the end you statically link the arrow libraries?

As there are no wheel on PyPI, I couldn't verify whether that assumption is 
true.

Best
Uwe

On Thu, Jul 2, 2020, at 4:53 PM, Tim Paine wrote:
> We spent a ton of time on this for perspective, the end result is a 
> mostly compatible set of wheels for most platforms, I believe we 
> skipped py2 but nobody cares about those anyway. We link against 
> libarrow and libarrow_python on Linux, on windows we vendor them all 
> into our library. Feel free to scrape the perspective repo's cmake 
> lists and setup.py for details.
> 
> Tim Paine
> tim.paine.nyc
> 
> > On Jul 2, 2020, at 10:32, Uwe L. Korn  wrote:
> > 
> > I had so much fun with the wheels in the past, I'm now a happy member of 
> > conda-forge core instead :D
> > 
> > The good thing first:
> > 
> > * The C++ ABI didn't change between the manylinux versions, it is the old 
> > one in all cases. So you mix & match manylinux versions.
> > 
> > The sad things:
> > 
> > * The manylinuxX standard are intented to provide a way to ship 
> > *self-contained* wheels that run on any recent Linux. The important part 
> > here is that they need to be self-contained. Having a binary dependency on 
> > another wheel is actually not allowed.
> > * Thus the snowflake-python-connector ships the libarrow.so it was build 
> > with as part of its wheel. In this case auditwheel is happy with the wheel.
> > * It is working with numpy as a dependency because NumPy linkage is similar 
> > to the import lib behaviour on Windows: You don't actually link against 
> > numpy but you statically link a set of functions that are resolved to 
> > NumPy's function when you import numpy. Quick googling leads to 
> > https://github.com/yugr/Implib.so which could provide something similar for 
> > Linux.
> > * You could actually omit linking to libarrow and try to populate the 
> > symbols before you load the library. This is how the Python symbols are 
> > available to extensions without linking to libpython.
> > 
> > 
> >> On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> >> Ok, thanks!
> >> 
> >> I'm setting up a repo with an example here, using pybind11:
> >> https://github.com/vaexio/vaex-arrow-ext
> >> 
> >> and I'll just try all possible combinations and report back.
> >> 
> >> cheers,
> >> 
> >> Maarten Breddels
> >> Software engineer / consultant / data scientist
> >> Python / C++ / Javascript / Jupyter
> >> www.maartenbreddels.com / vaex.io
> >> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> >> [image: Twitter] [image: Github]
> >> [image: LinkedIn]
> >> [image: Skype]
> >> 
> >> 
> >> 
> >> 
> >> Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> >> jorisvandenboss...@gmail.com>:
> >> 
> >>> Also no concrete answer, but one such example is turbodbc, I think.
> >>> But it seems they only have conda binary packages, and don't
> >>> distribute wheels ..
> >>> (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> >>> so not that relevant as comparison (they also need to build against an
> >>> odbc driver in addition to arrow).
> >>> But maybe Uwe has some more experience in this regard (and with
> >>> attempts building wheels for turbodbc, eg
> >>> https://github.com/blue-yonder/turbodbc/pull/108).
> >>> 
> >>> Joris
> >>> 
> >>> On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
>  
>  
>  Hi Maarten,
>  
>  Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > 
> > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> >>> vaex
> > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > wheel, work in an environment where someone installed a pyarrow 2014
> > wheel, or build from source, or installed from conda-forge?
>  
>  I have no idea about the concrete answer, but it probably depends
>  whether the libstdc++ ABI changed between those two versions.  I'm
>  afraid you'll have to experiment yourself.
>  
>  (if you want to eschew C++ ABI issues, you may use the C Data Interface:
>  https://arrow.apache.org/docs/format/CDataInterface.html
>  though of course you won't have access to all the useful helpers in the
>  Arrow C++ library)
>  
>  Regards
>  
>  Antoine.
>  
>  
> >>> 
> >> 
>


Re: Performance of ArrowJS in the DOM

2020-07-02 Thread Tim Paine
The virtual table a sounds a lot like regular-table:
https://github.com/jpmorganchase/regular-table

Used in perspective:
https://perspective.finos.org/

We use arrow c++ compiled with webassembly and some front end grid and chart 
plugins, perspective can run in a client server fashion and only sends diffs 
across the wire so works well for random access e.g. in pivoted views.


Tim Paine
tim.paine.nyc


> On Jul 2, 2020, at 09:45, Matthias Vallentin  wrote:
> 
> Hi folks,
> 
> We are reaching out to better understand the performance of ArrowJS when it
> comes to viewing large amounts of data (> 1M records) in the browser’s DOM.
> Our backend (https://github.com/tenzir/vast) spits out record batches,
> which we are accumulating in the frontend with a RecordBatchReader.
> 
> At first, we only want to render the data fast, line by line, with minimal
> markup according to its types from the schema. We use a virtual scrolling
> window to avoid overloading the DOM, that is, we lazily convert the record
> batch data to DOM elements according to a scroll window defined by the
> user. As the user scrolls, elements outside the window get removed and new
> ones added.
> 
> The data consists of one or more Tables that we are pulling in through the
> RecordBatchReader. We use the Async Interator interface to go over the
> record batches and convert them into rows. This API feels suboptimal for
> our use cases, where we want random access to the data. Is there a
> faster/better way to do this?
> 
> Does anyone have any experience worth sharing with doing something similar?
> The DOM is the main bottleneck, but if there are some clever things we can
> do with Arrow to pull out the data in the most efficient way, that would be
> nice.
> 
>Matthias


Re: [Discuss] Extremely dubious Python equality semantics

2020-07-02 Thread Wes McKinney
On Wed, Jul 1, 2020 at 9:52 AM Joris Van den Bossche
 wrote:
>
> I am personally fine with removing the compute dunder methods again (i.e.
> Array.__richcmp__), if that resolves the ambiguity. Although they *are*
> convenient IMO, even for developers (question might also come up if we want
> to add __add__, __sub__ etc, though). So it could also be an option to say
> that for "data structure equality", one should use the ".equals(..)"
> method, and not rely on __eq__.

Yeah, I think this is a slippery slope. My vote is to avoid the
conflict by removing the dunder methods and provide analytics
exclusively through named / non-dunder instance methods and normal
functions (e.g. stuff in pyarrow.compute).

> If we use __eq__ for data structure equality, there is still the question
> of what "null == null" should return: True or False? Although somewhat
> counter-intuitive, it should probably then return True? (given that
> "pa.array([null]) == pa.array([null])" would give True, and I suppose the
> C++ Equals method will also return True).

Right, comparing two null scalars of the same type returns true, same
with two all-null arrays of the same type

> On Wed, 1 Jul 2020 at 16:30, Wes McKinney  wrote:
>
> > I think we need to have a hard separation between "data structure equality"
> > (do these objects contain equivalent data) and "analytical/semantic
> > equality". The latter is more the domain of pyarrow.compute and I am not
> > sure we should be overloading dunder methods with compute functions. I
> > might recommend actually that we remove the compute functions from
> > Array.__richcmp__ also
> >
> > Keep in mind that the pyarrow data structures **are not for end users**.
> > They are intended for developer use and so analytical conveniences should
> > not outweigh consistency for use by library developers.
> >
> > On Wed, Jul 1, 2020, 9:03 AM Maarten Breddels 
> > wrote:
> >
> > > I think that if __eq__ does not return True/False exclusively, __bool__
> > > should raise an exception to avoid these unexpected truthies. Python
> > users
> > > are used to that due to Numpy.
> > >
> > >
> > > Op wo 1 jul. 2020 om 15:40 schreef Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com>:
> > >
> > > > On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou 
> > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Recent changes to PyArrow seem to have taken the stance that
> > comparing
> > > > > null values should return null.
> > > >
> > > >
> > > > Small note that it is not a *very* recent change (
> > > > https://github.com/apache/arrow/pull/5330, ARROW-6488
> > > > ), at least for
> > > scalars
> > > > (I am supposing you are talking about scalars in specific? Or also
> > about
> > > > array equality?).
> > > >
> > > >
> > > > > The problem is that it breaks the
> > > > > expectation that comparisons should return booleans, and perculates
> > > into
> > > > > crazy behaviour in other places.
> > > >
> > > >
> > > > It's certainly true that not returning a boolean from __eq__ gives a
> > > whole
> > > > set of surprising/strange behaviour, but doing so would also introduce
> > an
> > > > inconsistency with array equality (where nulls propagate in
> > comparisons).
> > > > I think this might boil down to different "expectations" about what
> > > __eq__
> > > > should do or is used for:
> > > >
> > > > 1) Be the scalar equivalent of element-wise array equality
> > > (Array.__eq__).
> > > > Since right now nulls propagate in comparisons, this would lead to
> > > > Scalar.__eq__ also to return null (and not True/False) if one of the
> > > > operands is null. The null propagation of array comparisons can also be
> > > > discussed of course.
> > > > 2) Be a general equality check of the two objects (matching the
> > > > ".equals(..)" method). For example, Array.equals considers nulls at the
> > > > same location in the array as equal.
> > > > (however, for this second case it's actually not fully clear if nulls
> > > > evaluate to True or False ..)
> > > >
> > > >  Some other inline comments about the the examples:
> > > >
> > > > Here is an example of such
> > > > > misbehaviour in the scalar refactor PR:
> > > > >
> > > > > >>> import pyarrow as pa
> > > > > >>> na = pa.scalar(None)
> > > > > >>> na == na
> > > > > 
> > > > >
> > > > > >>> na == 5
> > > > > 
> > > > >
> > > > > >>> bool(na == 5)
> > > > > True
> > > > >
> > > >
> > > > This could also be changed to raise an error instead. And that way, the
> > > > next example would also raise (requiring the user to be specific on how
> > > to
> > > > handle a scalar null: is it truthy or falsey?)
> > > >
> > > >
> > > > > >>> if na == 5: print("yo!")
> > > > > yo!
> > > > >
> > > > > >>> na in [5]
> > > > > True
> > > > >
> > > > > But you can see it also with arrays containing null values:
> > > > >
> > > > > >>> pa.array([1, None]) in [pa.scalar(42)]
> > > > > True
> > > > >
> > > > > Note that this one 

Re: Developing a C++ Python extension

2020-07-02 Thread Tim Paine
We spent a ton of time on this for perspective, the end result is a mostly 
compatible set of wheels for most platforms, I believe we skipped py2 but 
nobody cares about those anyway. We link against libarrow and libarrow_python 
on Linux, on windows we vendor them all into our library. Feel free to scrape 
the perspective repo's cmake lists and setup.py for details.

Tim Paine
tim.paine.nyc

> On Jul 2, 2020, at 10:32, Uwe L. Korn  wrote:
> 
> I had so much fun with the wheels in the past, I'm now a happy member of 
> conda-forge core instead :D
> 
> The good thing first:
> 
> * The C++ ABI didn't change between the manylinux versions, it is the old one 
> in all cases. So you mix & match manylinux versions.
> 
> The sad things:
> 
> * The manylinuxX standard are intented to provide a way to ship 
> *self-contained* wheels that run on any recent Linux. The important part here 
> is that they need to be self-contained. Having a binary dependency on another 
> wheel is actually not allowed.
> * Thus the snowflake-python-connector ships the libarrow.so it was build with 
> as part of its wheel. In this case auditwheel is happy with the wheel.
> * It is working with numpy as a dependency because NumPy linkage is similar 
> to the import lib behaviour on Windows: You don't actually link against numpy 
> but you statically link a set of functions that are resolved to NumPy's 
> function when you import numpy. Quick googling leads to 
> https://github.com/yugr/Implib.so which could provide something similar for 
> Linux.
> * You could actually omit linking to libarrow and try to populate the symbols 
> before you load the library. This is how the Python symbols are available to 
> extensions without linking to libpython.
> 
> 
>> On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
>> Ok, thanks!
>> 
>> I'm setting up a repo with an example here, using pybind11:
>> https://github.com/vaexio/vaex-arrow-ext
>> 
>> and I'll just try all possible combinations and report back.
>> 
>> cheers,
>> 
>> Maarten Breddels
>> Software engineer / consultant / data scientist
>> Python / C++ / Javascript / Jupyter
>> www.maartenbreddels.com / vaex.io
>> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
>> [image: Twitter] [image: Github]
>> [image: LinkedIn]
>> [image: Skype]
>> 
>> 
>> 
>> 
>> Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
>> jorisvandenboss...@gmail.com>:
>> 
>>> Also no concrete answer, but one such example is turbodbc, I think.
>>> But it seems they only have conda binary packages, and don't
>>> distribute wheels ..
>>> (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
>>> so not that relevant as comparison (they also need to build against an
>>> odbc driver in addition to arrow).
>>> But maybe Uwe has some more experience in this regard (and with
>>> attempts building wheels for turbodbc, eg
>>> https://github.com/blue-yonder/turbodbc/pull/108).
>>> 
>>> Joris
>>> 
>>> On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
 
 
 Hi Maarten,
 
 Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> 
> Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
>>> vaex
> extension distributed as a 2010 wheel, and build with the pyarrow 2010
> wheel, work in an environment where someone installed a pyarrow 2014
> wheel, or build from source, or installed from conda-forge?
 
 I have no idea about the concrete answer, but it probably depends
 whether the libstdc++ ABI changed between those two versions.  I'm
 afraid you'll have to experiment yourself.
 
 (if you want to eschew C++ ABI issues, you may use the C Data Interface:
 https://arrow.apache.org/docs/format/CDataInterface.html
 though of course you won't have access to all the useful helpers in the
 Arrow C++ library)
 
 Regards
 
 Antoine.
 
 
>>> 
>> 


Re: Decimal128 scale limits

2020-07-02 Thread Wes McKinney
I think the intention so far has been to support precision between 0
and 38 and scale <= precision. 128-bit integers max out at 38 digits,
I think that's the rationale for the limit. See e.g. the Impala docs
(also uses 128-bit decimals) [1]

[1]: https://impala.apache.org/docs/build/html/topics/impala_decimal.html

On Wed, Jul 1, 2020 at 10:16 AM Kazuaki Ishizaki  wrote:
>
> Hi,
> According to https://arrow.apache.org/docs/cpp/api/utilities.html,
> Decimal128 comes from the Apache ORC C++ implementation.
>
> When I see the Hive document at
> https://hive.apache.org/javadocs/r1.2.2/api/index.html?org/apache/hadoop/hive/common/type/Decimal128.html
> , there is the following statement. Does it help you?
> > A 128-bit fixed-length Decimal value in the ANSI SQL Numeric semantics,
> representing unscaledValue / 10**scale where scale is 0 or positive.
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:   Jacek Pliszka 
> To: dev@arrow.apache.org
> Date:   2020/07/02 00:08
> Subject:[EXTERNAL] Re: Decimal128 scale limits
>
>
>
> Hi!
>
> I am aware about at least 2  different decimal128 things:
>
> a) the one we have - where we use 128 bits to store integer which is
> later shifted by scale - 38 is number of digits of significand i.e.
> digits fitting in 128 bits
> (2**128/10**38) - IMHO it is completely unrelated to scale which we
> store separately
>
> b) IEEE 754 one which has exponent from -6143 to +6144
>
> BR,
>
> Jacek
>
> śr., 1 lip 2020 o 16:16 Antoine Pitrou  napisał(a):
> >
> >
> > Hello,
> >
> > Are there limits to the value of the scale for either decimal128 or
> > decimal?  Can it be negative?  Can it be greater than 38 (and/or lower
> > than -38)?
> >
> > It's not clear from looking either at the spec or at the C++ code...
> >
> > Regards
> >
> > Antoine.
>
>
>
>


Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I had so much fun with the wheels in the past, I'm now a happy member of 
conda-forge core instead :D

The good thing first:

* The C++ ABI didn't change between the manylinux versions, it is the old one 
in all cases. So you mix & match manylinux versions.

The sad things:

* The manylinuxX standard are intented to provide a way to ship 
*self-contained* wheels that run on any recent Linux. The important part here 
is that they need to be self-contained. Having a binary dependency on another 
wheel is actually not allowed.
* Thus the snowflake-python-connector ships the libarrow.so it was build with 
as part of its wheel. In this case auditwheel is happy with the wheel.
* It is working with numpy as a dependency because NumPy linkage is similar to 
the import lib behaviour on Windows: You don't actually link against numpy but 
you statically link a set of functions that are resolved to NumPy's function 
when you import numpy. Quick googling leads to 
https://github.com/yugr/Implib.so which could provide something similar for 
Linux.
* You could actually omit linking to libarrow and try to populate the symbols 
before you load the library. This is how the Python symbols are available to 
extensions without linking to libpython.


On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> Ok, thanks!
> 
> I'm setting up a repo with an example here, using pybind11:
> https://github.com/vaexio/vaex-arrow-ext
> 
> and I'll just try all possible combinations and report back.
> 
> cheers,
> 
> Maarten Breddels
> Software engineer / consultant / data scientist
> Python / C++ / Javascript / Jupyter
> www.maartenbreddels.com / vaex.io
> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> [image: Twitter] [image: Github]
> [image: LinkedIn]
> [image: Skype]
> 
> 
> 
> 
> Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> jorisvandenboss...@gmail.com>:
> 
> > Also no concrete answer, but one such example is turbodbc, I think.
> > But it seems they only have conda binary packages, and don't
> > distribute wheels ..
> > (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> > so not that relevant as comparison (they also need to build against an
> > odbc driver in addition to arrow).
> > But maybe Uwe has some more experience in this regard (and with
> > attempts building wheels for turbodbc, eg
> > https://github.com/blue-yonder/turbodbc/pull/108).
> >
> > Joris
> >
> > On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> > >
> > >
> > > Hi Maarten,
> > >
> > > Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > > >
> > > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> > vaex
> > > > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > > > wheel, work in an environment where someone installed a pyarrow 2014
> > > > wheel, or build from source, or installed from conda-forge?
> > >
> > > I have no idea about the concrete answer, but it probably depends
> > > whether the libstdc++ ABI changed between those two versions.  I'm
> > > afraid you'll have to experiment yourself.
> > >
> > > (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> > > https://arrow.apache.org/docs/format/CDataInterface.html
> > > though of course you won't have access to all the useful helpers in the
> > > Arrow C++ library)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>


Performance of ArrowJS in the DOM

2020-07-02 Thread Matthias Vallentin
Hi folks,

We are reaching out to better understand the performance of ArrowJS when it
comes to viewing large amounts of data (> 1M records) in the browser’s DOM.
Our backend (https://github.com/tenzir/vast) spits out record batches,
which we are accumulating in the frontend with a RecordBatchReader.

At first, we only want to render the data fast, line by line, with minimal
markup according to its types from the schema. We use a virtual scrolling
window to avoid overloading the DOM, that is, we lazily convert the record
batch data to DOM elements according to a scroll window defined by the
user. As the user scrolls, elements outside the window get removed and new
ones added.

The data consists of one or more Tables that we are pulling in through the
RecordBatchReader. We use the Async Interator interface to go over the
record batches and convert them into rows. This API feels suboptimal for
our use cases, where we want random access to the data. Is there a
faster/better way to do this?

Does anyone have any experience worth sharing with doing something similar?
The DOM is the main bottleneck, but if there are some clever things we can
do with Arrow to pull out the data in the most efficient way, that would be
nice.

Matthias


Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Hi Nick, all,

Thanks! I updated the blog post to specify the requirements better.

First, we plan to store the datasets in S3 (on min.io). I agree this works
nicely with Parquet.

Do you know whether there any activity on supporting partial read/writes in
arrow or fastparquet? That would change things a lot.


> If doing the transform from/to various file formats is a feature you feel
> strongly about, I would suggest doing the transforms via out-of-band ETL
> jobs where the user can then request the files asynchronously later.


That's what we were thinking about, yes. We need a 'core' format to store
the data and write ETL jobs for, but secondary formats could be stored in
S3 and returned on demand.


> To your point of storing images with meta data such as tags. I haven’t
> actually tried it but I suppose you could in theory write the images in one
> Parquet binary type column and the tags in another.
>

Even then, there are different numbers of bounding boxes / tags per image.
Can you store different-length vectors in Parquet?


> Versioning is difficult and I believe there are many attempts at this right
> now. DeltaLake for example has the ability to query at dataset at a point
> in time. They basically have Parquet files with some extra json files on
> the side describing the changes.


I've looked at DeltaLake, but as far as I understand, its commit log
depends on spark operations done on the dataframe? Hence, any change to the
dataset has to be performed via spark? Is that correct?


> Straight up versions of file could be achieved with your underlying file
> system. S3 has file versioning.
>

Do you know of any tools to compute diffs between Parquet file? What I
could find was basically: export both files to CSV and run git diff.
DeltaLake would help here, but again, is seems that it only 'tracks' Spark
operations done directly on the file?

Thanks!
Joaquin

PS. Nick, would you like to be mentioned as a contributor in the blog post?
Your comments helped a lot to improve it ;).




On Tue,  Jun 30, 2020 at 6:46 AM Joaquin Vanschoren 
> wrote:
>
> > Hi all,
> >
> > Sorry for restarting an old thread, but we've had a _lot_ of discussions
> > over the past 9 months or so on how to store machine learning datasets
> > internally. We've written a blog post about it and would love to hear
> your
> > thoughts:
> >
> >
> https://openml.github.io/blog/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html
> >
> > To be clear: what we need is a data format for archival storage on the
> > server, and preferably one that supports versioning/diff, multi-table
> > storage, and sparse data.
> > Hence, this is for *internal* storage. When OpenML users want to
> download a
> > dataset in parquet or arrow we can always convert it on the fly (or from
> a
> > cache). We already use Arrow/Feather to cache the datasets after it is
> > downloaded (when possible).
> >
> > One specific concern about parquet is that we are not entirely sure
> > whether a parquet file created by one parser (e.g. in R) can always be
> read
> > by another parser (e.g. in Python). We saw some github issues related to
> > this but we don't know whether this is still an issue. Do you know? Also,
> > it seems that none of the current python parsers support partial
> > read/writes, is that correct?
> >
> > Because of these issues, we are still considering a text-based format
> (e.g.
> > CSV) for our main dataset storage, mainly because of its broad native
> > support in all languages and easy versioning/diffs (we could use
> git-lfs),
> > and use parquet/arrow for later usage where possible. We're still
> doubting
> > between CSV and Parquet, though.
> >
> > Do you have any thoughts or comments?
> >
> > Thanks!
> > Joaquin
> >
> > On Thu, 20 Jun 2019 at 23:47, Wes McKinney  wrote:
> >
> > > hi Joaquin -- there would be no practical difference, primarily it
> > > would be for the preservation of APIs in Python and R related to the
> > > Feather format. Internally "read_feather" will invoke the same code
> > > paths as the Arrow protocol file reader
> > >
> > > - Wes
> > >
> > > On Thu, Jun 20, 2019 at 4:12 PM Joaquin Vanschoren
> > >  wrote:
> > > >
> > > > Thank you all for your very detailed answers! I also read in other
> > > threads
> > > > that the 1.0.0 release might be coming somewhere this fall? I'm
> really
> > > > looking forward to that.
> > > > @Wes: will there be any practical difference between Feather and
> Arrow
> > > > after the 1.0.0 release? It is just an alias? What would be the
> > benefits
> > > of
> > > > using Feather rather than Arrow at that point?
> > > >
> > > > Thanks!
> > > > Joaquin
> > > >
> > > >
> > > >
> > > > On Sun, 16 Jun 2019 at 18:25, Sebastien Binet  wrote:
> > > >
> > > > > hi there,
> > > > >
> > > > > On Sun, Jun 16, 2019 at 6:07 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > > *  Can Feather files already be read in 

Re: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Antoine Pitrou


In my experience, both the s390x and ARM builds are flaky on Travis-Ci,
for reasons which seem unrelated to Arrow.  The infrastructure seems a
bit unreliable.

Regards

Antoine.


Le 02/07/2020 à 15:15, Wes McKinney a écrit :
> I would be interested to know the empirical reliability of the s390x
> Travis CI build, but my guess is that it is flaking at least 20% of
> the time, maybe more than that. If that's the case, then I think it
> should be added back to allow_failures and at best we can look at it
> perioidically to make sure it's passing some of the time including
> near releases. Thoughts?
> 


[RESULT] [VOTE] Add a "Feature" enum to Schema.fbs

2020-07-02 Thread Wes McKinney
Forwarding with [RESULT] subject line

On Wed, Jul 1, 2020 at 1:24 AM Micah Kornfield  wrote:
>
> The vote carries with 4 binding +1 votes and 0 non-binding +1. I will
> merge the change and open some JIRAs about reading/writing the new field
> from reference implementations (hopefully tomorrow).
>
> Thanks,
> Micah
>
> On Tue, Jun 30, 2020 at 11:21 PM Micah Kornfield 
> wrote:
>
> > +1 (binding)
> >
> > On Sun, Jun 28, 2020 at 4:39 PM Sutou Kouhei  wrote:
> >
> >> +1 (binding)
> >>
> >> In 
> >>   "[VOTE] Add a "Feature" enum to Schema.fbs" on Sat, 27 Jun 2020
> >> 20:46:55 -0700,
> >>   Micah Kornfield  wrote:
> >>
> >> > Hi,
> >> >
> >> > As discussed on the mailing list [1] I would like to add a feature enum
> >> to
> >> > enhance our ability to evolve Arrow in a forward compatible manner and
> >> > allow clients and servers to negotiate which features are supported
> >> before
> >> > finalizing the features used.
> >> >
> >> > The PR adds a new enum and a field on schema.fbs [2]. We may make
> >> > modifications to the language in comments but this vote is whether to
> >> > accept the addition of this enum and field.  Details for how this will
> >> be
> >> > used in flight, are not part of this change.
> >> >
> >> > For clarity, this change is non-breaking and fully backwards
> >> > compatible. The field ensures that current libraries will be able to
> >> > determine if a future library version used features that it doesn't
> >> support
> >> > (by checking for out of range enum values).  It will require libraries
> >> to
> >> > both populate the field on writing and check values when reading.
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Accept addition of Feature enum flatbuffers field
> >> > [ ] +0
> >> > [ ] -1 Do not accept addition because...
> >> >
> >> > [1]:
> >> >
> >> https://mail-archives.apache.org/mod_mbox/arrow-dev/202006.mbox/%3CCAK7Z5T-mUB1ipO7YGqwW%3DtcW7eA8_aYvrjWAzLmHw7ZtS09naQ%40mail.gmail.com%3E
> >> > [2]: https://github.com/apache/arrow/pull/7502
> >>
> >


[CI] Reliability of s390x Travis CI build

2020-07-02 Thread Wes McKinney
I would be interested to know the empirical reliability of the s390x
Travis CI build, but my guess is that it is flaking at least 20% of
the time, maybe more than that. If that's the case, then I think it
should be added back to allow_failures and at best we can look at it
perioidically to make sure it's passing some of the time including
near releases. Thoughts?


Re: Sharing our experience adopting (py) Arrow in Vaex

2020-07-02 Thread Wes McKinney
On Thu, Jul 2, 2020 at 3:32 AM Maarten Breddels
 wrote:
>
> Hi,
>
> in the process of adding Arrow support in Vaex (natively, not converting to
> Numpy as we did before), one of our biggest pain points is (surprisingly)
> the name mismatch between NumPy's .tolist() and Arrow's .to_pylist().
> Especially in code that deals with both types of arrays, this is a bit of
> an annoyance. We actually use tolist() a lot in our unittests as well. I
> wonder if this was done with a purposely, or if this is something that
> could still be changed/added.

This particular function could be renamed or aliased, but in general
substitutability in code that currently uses NumPy has not been a goal
of the project.

> The difference in filter/take vs fancy indexing with [] is ok, it doesn't
> happen that often, but I was wondering if this will be added later, or if
> this stays as it is.

I personally wouldn't be thrilled about this -- I think adding too
many syntactic conveniences or trying to emulate NumPy would be a
slippery slope ("you emulate this, but why not that?").

> Another difficult thing is testing for string arrays, since there are two
> string types (utf8 and large_utf8) testing if something is of string type
> is a bit annoying. I don't plan to have a type system in Vaex itself, so we
> leak this to users.
> A similar issue is also array testing, testing if something is an arrow
> array (chunked or plain) is again a test against two types (e.g.
> isinstance(ar, (pa.Array, pa.ChunkedArray)).
> I could see some helper functions pa.is_array and pa.is_string (this is
> already taken, and I guess only tests for 32bit offset strings arrays)

Having some more helper type checking functions sounds fine.

> Overall, we're quite positive, and as you see, the pain points are not
> fundamental issue, but annoyances that might be easy to fix, and make
> adoption smoother/faster.
>
> cheers,
>
> Maarten Breddels
> Software engineer / consultant / data scientist
> Python / C++ / Javascript / Jupyter
> www.maartenbreddels.com / vaex.io
> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> [image: Twitter] [image: Github]
> [image: LinkedIn]
> [image: Skype]


Re: Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
Ok, thanks!

I'm setting up a repo with an example here, using pybind11:
https://github.com/vaexio/vaex-arrow-ext

and I'll just try all possible combinations and report back.

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] [image: Github]
[image: LinkedIn]
[image: Skype]




Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
jorisvandenboss...@gmail.com>:

> Also no concrete answer, but one such example is turbodbc, I think.
> But it seems they only have conda binary packages, and don't
> distribute wheels ..
> (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> so not that relevant as comparison (they also need to build against an
> odbc driver in addition to arrow).
> But maybe Uwe has some more experience in this regard (and with
> attempts building wheels for turbodbc, eg
> https://github.com/blue-yonder/turbodbc/pull/108).
>
> Joris
>
> On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> >
> >
> > Hi Maarten,
> >
> > Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > >
> > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> vaex
> > > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > > wheel, work in an environment where someone installed a pyarrow 2014
> > > wheel, or build from source, or installed from conda-forge?
> >
> > I have no idea about the concrete answer, but it probably depends
> > whether the libstdc++ ABI changed between those two versions.  I'm
> > afraid you'll have to experiment yourself.
> >
> > (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> > https://arrow.apache.org/docs/format/CDataInterface.html
> > though of course you won't have access to all the useful helpers in the
> > Arrow C++ library)
> >
> > Regards
> >
> > Antoine.
> >
> >
>


Re: Developing a C++ Python extension

2020-07-02 Thread Joris Van den Bossche
Also no concrete answer, but one such example is turbodbc, I think.
But it seems they only have conda binary packages, and don't
distribute wheels ..
(https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
so not that relevant as comparison (they also need to build against an
odbc driver in addition to arrow).
But maybe Uwe has some more experience in this regard (and with
attempts building wheels for turbodbc, eg
https://github.com/blue-yonder/turbodbc/pull/108).

Joris

On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
>
>
> Hi Maarten,
>
> Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> >
> > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a vaex
> > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > wheel, work in an environment where someone installed a pyarrow 2014
> > wheel, or build from source, or installed from conda-forge?
>
> I have no idea about the concrete answer, but it probably depends
> whether the libstdc++ ABI changed between those two versions.  I'm
> afraid you'll have to experiment yourself.
>
> (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> https://arrow.apache.org/docs/format/CDataInterface.html
> though of course you won't have access to all the useful helpers in the
> Arrow C++ library)
>
> Regards
>
> Antoine.
>
>


[NIGHTLY] Arrow Build Report for Job nightly-2020-07-02-0

2020-07-02 Thread Crossbow


Arrow Build Report for Job nightly-2020-07-02-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0

Failed Tasks:
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-python-3.8-jpype
- ubuntu-xenial-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-ubuntu-xenial-arm64
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-wheel-osx-cp35m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-appveyor-wheel-win-cp35m
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-centos-6-amd64
- centos-7-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-centos-7-aarch64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-centos-7-amd64
- centos-8-aarch64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-centos-8-aarch64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-clean
- conda-linux-gcc-py36-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py36-cpu
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py37-cpu
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-linux-gcc-py38-cuda
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-debian-buster-amd64
- debian-buster-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-debian-buster-arm64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-debian-stretch-amd64
- debian-stretch-arm64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-travis-debian-stretch-arm64
- gandiva-jar-osx:
  URL: 

Re: Developing a C++ Python extension

2020-07-02 Thread Antoine Pitrou


Hi Maarten,

Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> 
> Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a vaex
> extension distributed as a 2010 wheel, and build with the pyarrow 2010
> wheel, work in an environment where someone installed a pyarrow 2014
> wheel, or build from source, or installed from conda-forge?

I have no idea about the concrete answer, but it probably depends
whether the libstdc++ ABI changed between those two versions.  I'm
afraid you'll have to experiment yourself.

(if you want to eschew C++ ABI issues, you may use the C Data Interface:
https://arrow.apache.org/docs/format/CDataInterface.html
though of course you won't have access to all the useful helpers in the
Arrow C++ library)

Regards

Antoine.




Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
Hi,

again, in the process of adopting Arrow in Vaex, we need to have some
legacy c++ code in Vaex itself, and we might want to add some new functions
in c++ that might not be suitable for core Apache Arrow, or we need to ship
ourselves due to time constraints.

I am a bit worried about the C++ ABI compatibility for Linux binaries and
was wondering if there are already projects out there that have experience
with this.

Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a vaex
extension distributed as a 2010 wheel, and build with the pyarrow 2010
wheel, work in an environment where someone installed a pyarrow 2014
wheel, or build from source, or installed from conda-forge?

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] [image: Github]
[image: LinkedIn]
[image: Skype]


Sharing our experience adopting (py) Arrow in Vaex

2020-07-02 Thread Maarten Breddels
Hi,

in the process of adding Arrow support in Vaex (natively, not converting to
Numpy as we did before), one of our biggest pain points is (surprisingly)
the name mismatch between NumPy's .tolist() and Arrow's .to_pylist().
Especially in code that deals with both types of arrays, this is a bit of
an annoyance. We actually use tolist() a lot in our unittests as well. I
wonder if this was done with a purposely, or if this is something that
could still be changed/added.

The difference in filter/take vs fancy indexing with [] is ok, it doesn't
happen that often, but I was wondering if this will be added later, or if
this stays as it is.

Another difficult thing is testing for string arrays, since there are two
string types (utf8 and large_utf8) testing if something is of string type
is a bit annoying. I don't plan to have a type system in Vaex itself, so we
leak this to users.
A similar issue is also array testing, testing if something is an arrow
array (chunked or plain) is again a test against two types (e.g.
isinstance(ar, (pa.Array, pa.ChunkedArray)).
I could see some helper functions pa.is_array and pa.is_string (this is
already taken, and I guess only tests for 32bit offset strings arrays)

Overall, we're quite positive, and as you see, the pain points are not
fundamental issue, but annoyances that might be easy to fix, and make
adoption smoother/faster.

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] [image: Github]
[image: LinkedIn]
[image: Skype]