[jira] [Created] (ARROW-8228) [C++][Parquet] Support writing lists that have null elements that are non-empty.

2020-03-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-8228:
--

 Summary: [C++][Parquet] Support writing lists that have null 
elements that are non-empty.
 Key: ARROW-8228
 URL: https://issues.apache.org/jira/browse/ARROW-8228
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
 Fix For: 1.0.0


With the new V2 level writing engine we can detect this case but fail as not 
implemented.  Fixing this will require changes to the "core" parquet API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8227) [C++] Propose refining SIMD code framework

2020-03-25 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8227:
---

 Summary: [C++] Propose refining SIMD code framework
 Key: ARROW-8227
 URL: https://issues.apache.org/jira/browse/ARROW-8227
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Arrow supports wide range of hardware(x86,arm,ppc?) + 
os(linux,windows,macos,others?) + compiler(gcc,clang,msvc,others?). Managing 
platform dependent code is non-trivial. This Jira aims to refine(or mess up) 
simd related code framework.
Some goals: Move simd feature definition into one place, possibly in cmake, and 
reduce compiler based ifdef is source code. Manage simd code in one place, but 
leave non-simd default implementations where they are. Shouldn't introduce any 
performance penalty, prefer direct inline to runtime dispatcher. Code should be 
easy to maintain, expand, and hard to make mistakes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8226) [Go] Add binary builder that uses 64 bit offsets and make binary builders resettable

2020-03-25 Thread Richard (Jira)
Richard created ARROW-8226:
--

 Summary: [Go] Add binary builder that uses 64 bit offsets and make 
binary builders resettable
 Key: ARROW-8226
 URL: https://issues.apache.org/jira/browse/ARROW-8226
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Richard


I ran into some overflow issues with the existing 32 bit binary builder. My 
changes add a new binary builder that uses 64-bit offsets + tests.

I also added a panic for when the 32-bit offset binary builder overflows.

Finally I made both binary builders resettable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8224) [C++] Remove APIs deprecated prior to 0.16.0

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8224:
---

 Summary: [C++] Remove APIs deprecated prior to 0.16.0
 Key: ARROW-8224
 URL: https://issues.apache.org/jira/browse/ARROW-8224
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8223) Schema.from_pandas breaks with pandas nullable integer dtype

2020-03-25 Thread Ged Steponavicius (Jira)
Ged Steponavicius created ARROW-8223:


 Summary: Schema.from_pandas breaks with pandas nullable integer 
dtype
 Key: ARROW-8223
 URL: https://issues.apache.org/jira/browse/ARROW-8223
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1, 0.16.0, 0.15.0
 Environment: pyarrow 0.16
Reporter: Ged Steponavicius


 
{code:java}
import pandas as pd
import pyarrow as pa
df = pd.DataFrame([{'int_col':1},
 {'int_col':2}])
df['int_col'] = df['int_col'].astype(pd.Int64Dtype())

schema = pa.Schema.from_pandas(df)
{code}
produces ArrowTypeError: Did not pass numpy.dtype object

 

However, this works fine 
{code:java}
schema = pa.Table.from_pandas(df).schema{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-25 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8222:
--

 Summary: [C++] Use bcp to make a slim boost for bundled build
 Key: ARROW-8222
 URL: https://issues.apache.org/jira/browse/ARROW-8222
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson


We don't use much of Boost (just system, filesystem, and regex), but when we do 
a bundled build, we still download and extract all of boost. The tarball itself 
is 113mb, expanded is over 700mb. This can be slow, and it requires a lot of 
free disk space that we don't really need.

[bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is a 
boost tool that lets you extract a subset of boost, resolving any of its 
necessary dependencies across boost. The savings for us could be huge:

{code}
mkdir test
./bcp system.hpp filesystem.hpp regex.hpp test
tar -czf test.tar.gz test/
{code}

The resulting tarball is 885K (kilobytes!). 

{{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
well.

We would need a place to host this tarball, and we would have to updated it 
whenever we (1) bump the boost version or (2) add a new boost library 
dependency. This patch would of course include a script that would generate the 
tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Preparing for 0.17.0 Arrow release

2020-03-25 Thread Andy Grove
I just took a first pass at reviewing the Java and Rust issues and removed
some from the 0.17.0 release. There are a few small Rust issues that I am
actively working on for this release.

Thanks.


On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney  wrote:

> hi Neal,
>
> Thanks for helping coordinate. I agree we should be in a position to
> release sometime next week.
>
> Can folks from the Rust and Java side review issues in the backlog?
> According to the dashboard there are 19 Rust issues open and 7 Java
> issues.
>
> Thanks
>
> On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
>  wrote:
> >
> > Hi all,
> > A few weeks ago, there seemed to be consensus (lazy, at least) for a 0.17
> > release at the end of the month. Judging from
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release,
> it
> > looks like we're getting closer.
> >
> > I'd encourage everyone to review their backlogs and (1) bump from 0.17
> > scope any tickets they don't plan to finish this week, and (2) if there
> are
> > any issues that should block release, make sure they are flagged as
> > "blockers".
> >
> > Neal
> >
> > On Tue, Mar 10, 2020 at 7:39 AM Wes McKinney 
> wrote:
> >
> > > It seems like the consensus is to push for a 0.17.0 major release
> > > sooner rather than doing a patch release, since releases in general
> > > are costly. This is fine with me. I see that a 0.17.0 milestone has
> > > been created in JIRA and some JIRA gardening has begun. Do you think
> > > we can be in a position to release by the week of March 23 or the week
> > > of March 30?
> > >
> > > On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney 
> wrote:
> > > >
> > > > If people are generally on board with accelerating a 0.17.0 major
> > > > release, then I would suggest renaming "1.0.0" to "0.17.0" and
> > > > beginning to do issue gardening to whittle things down to
> > > > critical-looking bugs and high probability patches for the next
> couple
> > > > of weeks.
> > > >
> > > > On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > > I recall there are some other issues that have been reported or
> fixed
> > > > > that are critical and not yet marked with 0.16.1.
> > > > >
> > > > > I'm also OK with doing a 0.17.0 release sooner
> > > > >
> > > > > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson
> > > > >  wrote:
> > > > > >
> > > > > > I would also be more supportive of doing 0.17 earlier instead of
> a
> > > patch
> > > > > > release.
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > If releases were costless to make, I'd be all for it, but it's
> not
> > > clear
> > > > > > > to me that it's worth the diversion from other priorities to
> make
> > > a release
> > > > > > > right now. Nothing on
> > > > > > >
> > >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1
> > > > > > > jumps out to me as super urgent--what are you seeing as
> critical?
> > > > > > >
> > > > > > > If we did decide to go forward, would it be possible to do a
> > > release that
> > > > > > > is limited to the affected implementations (say, do a
> Python-only
> > > release)?
> > > > > > > That might reduce the cost of building and verifying enough to
> > > make it
> > > > > > > reasonable to consider.
> > > > > > >
> > > > > > > Neal
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs <
> > > szucs.kriszt...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney <
> wesmck...@gmail.com>
> > > wrote:
> > > > > > >> >
> > > > > > >> > hi folks,
> > > > > > >> >
> > > > > > >> > There have been a number of critical issues reported (many
> of
> > > them
> > > > > > >> > fixed already) since 0.16.0 was released. Is there interest
> in
> > > > > > >> > preparing a patch 0.16.1 release (with backported patches
> onto a
> > > > > > >> > maint-0.16.x branch as with 0.15.1) since the next major
> > > release is a
> > > > > > >> > minimum of 6-8 weeks away from general availability?
> > > > > > >> >
> > > > > > >> > Did the 0.15.1 patch release helper script that Krisztian
> wrote
> > > get
> > > > > > >> > contributed as a PR?
> > > > > > >> Not yet, but it is available at
> > > > > > >>
> https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0
> > > > > > >> >
> > > > > > >> > Thanks
> > > > > > >> > Wes
> > > > > > >>
> > > > > > >
> > >
>


Re: Preparing for 0.17.0 Arrow release

2020-03-25 Thread Wes McKinney
hi Neal,

Thanks for helping coordinate. I agree we should be in a position to
release sometime next week.

Can folks from the Rust and Java side review issues in the backlog?
According to the dashboard there are 19 Rust issues open and 7 Java
issues.

Thanks

On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
 wrote:
>
> Hi all,
> A few weeks ago, there seemed to be consensus (lazy, at least) for a 0.17
> release at the end of the month. Judging from
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release, it
> looks like we're getting closer.
>
> I'd encourage everyone to review their backlogs and (1) bump from 0.17
> scope any tickets they don't plan to finish this week, and (2) if there are
> any issues that should block release, make sure they are flagged as
> "blockers".
>
> Neal
>
> On Tue, Mar 10, 2020 at 7:39 AM Wes McKinney  wrote:
>
> > It seems like the consensus is to push for a 0.17.0 major release
> > sooner rather than doing a patch release, since releases in general
> > are costly. This is fine with me. I see that a 0.17.0 milestone has
> > been created in JIRA and some JIRA gardening has begun. Do you think
> > we can be in a position to release by the week of March 23 or the week
> > of March 30?
> >
> > On Thu, Mar 5, 2020 at 8:39 PM Wes McKinney  wrote:
> > >
> > > If people are generally on board with accelerating a 0.17.0 major
> > > release, then I would suggest renaming "1.0.0" to "0.17.0" and
> > > beginning to do issue gardening to whittle things down to
> > > critical-looking bugs and high probability patches for the next couple
> > > of weeks.
> > >
> > > On Thu, Mar 5, 2020 at 11:31 AM Wes McKinney 
> > wrote:
> > > >
> > > > I recall there are some other issues that have been reported or fixed
> > > > that are critical and not yet marked with 0.16.1.
> > > >
> > > > I'm also OK with doing a 0.17.0 release sooner
> > > >
> > > > On Thu, Mar 5, 2020 at 11:31 AM Neal Richardson
> > > >  wrote:
> > > > >
> > > > > I would also be more supportive of doing 0.17 earlier instead of a
> > patch
> > > > > release.
> > > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > On Thu, Mar 5, 2020 at 9:29 AM Neal Richardson <
> > neal.p.richard...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > If releases were costless to make, I'd be all for it, but it's not
> > clear
> > > > > > to me that it's worth the diversion from other priorities to make
> > a release
> > > > > > right now. Nothing on
> > > > > >
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.16.1
> > > > > > jumps out to me as super urgent--what are you seeing as critical?
> > > > > >
> > > > > > If we did decide to go forward, would it be possible to do a
> > release that
> > > > > > is limited to the affected implementations (say, do a Python-only
> > release)?
> > > > > > That might reduce the cost of building and verifying enough to
> > make it
> > > > > > reasonable to consider.
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 5, 2020 at 8:19 AM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> On Thu, Mar 5, 2020 at 5:07 PM Wes McKinney 
> > wrote:
> > > > > >> >
> > > > > >> > hi folks,
> > > > > >> >
> > > > > >> > There have been a number of critical issues reported (many of
> > them
> > > > > >> > fixed already) since 0.16.0 was released. Is there interest in
> > > > > >> > preparing a patch 0.16.1 release (with backported patches onto a
> > > > > >> > maint-0.16.x branch as with 0.15.1) since the next major
> > release is a
> > > > > >> > minimum of 6-8 weeks away from general availability?
> > > > > >> >
> > > > > >> > Did the 0.15.1 patch release helper script that Krisztian wrote
> > get
> > > > > >> > contributed as a PR?
> > > > > >> Not yet, but it is available at
> > > > > >> https://gist.github.com/kszucs/b2743546044ccd3215e5bb34fa0d76a0
> > > > > >> >
> > > > > >> > Thanks
> > > > > >> > Wes
> > > > > >>
> > > > > >
> >


[jira] [Created] (ARROW-8220) [Python] Make dataset FileFormat objects serializable

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8220:


 Summary: [Python] Make dataset FileFormat objects serializable
 Key: ARROW-8220
 URL: https://issues.apache.org/jira/browse/ARROW-8220
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8219) [Rust] sqlparser crate needs to be bumped to version 0.2.5

2020-03-25 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8219:
--

 Summary: [Rust] sqlparser crate needs to be bumped to version 0.2.5
 Key: ARROW-8219
 URL: https://issues.apache.org/jira/browse/ARROW-8219
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 0.16.0
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Micah Kornfield
If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
think there was a question previously raised if there was benefit for
smaller sizes buffers.

Thanks,
Micah


On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney  wrote:

> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield 
> wrote:
> >
> > >
> > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > > dataset. So that's a huge space savings
> >
> > One more question on this.  What was the average row-batch size used?  I
> > see in the proposal some buffers might not be compressed, did you this
> > feature in the test?
>
> I used 64K row batch size. I haven't implemented the optional
> non-compressed buffers (for cases where there is little space savings)
> so everything is compressed. I can check different batch sizes if you
> like
>
>
> > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney 
> wrote:
> >
> > > hi folks,
> > >
> > > Sorry it's taken me a little while to produce supporting benchmarks.
> > >
> > > * I implemented experimental trivial body buffer compression in
> > > https://github.com/apache/arrow/pull/6638
> > > * I hooked up the Arrow IPC file format with compression as the new
> > > Feather V2 format in
> > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >
> > > I tested a couple of real-world datasets from a prior blog post
> > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> > > codecs
> > >
> > > The complete results are here
> > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >
> > > Summary:
> > >
> > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > > dataset. So that's a huge space savings
> > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> > > and 1.2-3GByte/s with ZSTD
> > >
> > > I would have to do some more engineering to test throughput changes
> > > with Flight, but given these results on slower networking (e.g. 1
> > > Gigabit) my guess is that the compression and decompression overhead
> > > is little compared with the time savings due to high compression
> > > ratios. If people would like to see these numbers to help make a
> > > decision I can take a closer look
> > >
> > > As far as what Micah said about having a limited number of
> > > compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> > > anecdotally that these outperform Snappy in most real world scenarios
> > > and generally have > 1 GB/s decompression performance. Some Linux
> > > distributions (Arch at least) have already started adopting ZSTD over
> > > LZMA or GZIP [1]
> > >
> > > - Wes
> > >
> > > [1]:
> > >
> https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> > >
> > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya  wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > Thanks a lot for the additional information.
> > > > Looking forward to see the good results from your experiments.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney 
> > > wrote:
> > > >
> > > > > I see, thank you.
> > > > >
> > > > > For such a scenario, implementations would need to define a
> > > > > "UserDefinedCodec" interface to enable codecs to be registered from
> > > > > third party code, similar to what is done for extension types [1]
> > > > >
> > > > > I'll update this thread when I get my experimental C++ patch up to
> see
> > > > > what I'm thinking at least for the built-in codecs we have like
> ZSTD.
> > > > >
> > > > >
> > > > >
> > >
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> > > > >
> > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya 
> wrote:
> > > > > >
> > > > > > Hi Wes,
> > > > > >
> > > > > > Thanks a lot for your further clarification.
> > > > > >
> > > > > > Some of my prelimiary thoughts:
> > > > > >
> > > > > > 1. We assign a unique GUID to each pair of
> compression/decompression
> > > > > > strategies. The GUID is stored as part of the
> > > Message.custom_metadata.
> > > > > When
> > > > > > receiving the GUID, the receiver knows which decompression
> strategy
> > > to
> > > > > use.
> > > > > >
> > > > > > 2. We serialize the decompression strategy, and store it into the
> > > > > > Message.custom_metadata. The receiver can decompress data after
> > > > > > deserializing the strategy.
> > > > > >
> > > > > > Method 1 is generally used in static strategy scenarios while
> method
> > > 2 is
> > > > > > generally used in dynamic strategy scenarios.
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > > >
> > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Okay, I guess my question is how the receiver is going to be
> able
> > > to
> 

[jira] [Created] (ARROW-8218) [C++] Parallelize decompression at field level in experimental IPC compression code

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8218:
---

 Summary: [C++] Parallelize decompression at field level in 
experimental IPC compression code
 Key: ARROW-8218
 URL: https://issues.apache.org/jira/browse/ARROW-8218
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


This is follow up work to ARROW-7979, a minor amount of refactoring will be 
required to move the decompression step out of {{ArrayLoader}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8217) [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8217:
---

 Summary: [R][C++] Fix crashing data in test-dataset.R on 32-bit 
Windows from ARROW-7979
 Key: ARROW-8217
 URL: https://issues.apache.org/jira/browse/ARROW-8217
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Reporter: Wes McKinney
 Fix For: 0.17.0


If we can obtain a gdb backtrace from the failed test in 
https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8216) filter method for Dataset doesn't distinguish between empty strings and NAs

2020-03-25 Thread Sam Albers (Jira)
Sam Albers created ARROW-8216:
-

 Summary: filter method for Dataset doesn't distinguish between 
empty strings and NAs
 Key: ARROW-8216
 URL: https://issues.apache.org/jira/browse/ARROW-8216
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.16.0
 Environment: R 3.6.3, Windows 10
Reporter: Sam Albers


 

I have just noticed some slightly odd behaviour with the filter method for 
Dataset. 
{code:java}
library(arrow)
library(dplyr)
packageVersion("arrow")
#> [1] '0.16.0.20200323'
## Make sample parquet
starwars$hair_color[starwars$hair_color == "brown"] <- ""
dir <- tempdir()
fpath <- file.path(dir, 'data.parquet')
write_parquet(starwars, fpath)
## df in memory
df_mem <- starwars %>% 
 filter(hair_color == "")
## reading from the parquet
df_parquet <- read_parquet(fpath) %>% 
 filter(hair_color == "")
## using open_dataset
df_dataset <- open_dataset(dir) %>% 
 filter(hair_color == "") %>% 
 collect()
{code}
I'm pretty sure all these should return the same data.frame. Am I missing 
something?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8215) [CI][Glib] Meson install fails in the macOS build

2020-03-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8215:
--

 Summary: [CI][Glib] Meson install fails in the macOS build
 Key: ARROW-8215
 URL: https://issues.apache.org/jira/browse/ARROW-8215
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, GLib
Reporter: Krisztian Szucs


It also happens in the pull request builds, see build log 
https://github.com/apache/arrow/runs/533168517#step:5:1230

cc @kou



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Wes McKinney
On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield  wrote:
>
> >
> > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > dataset. So that's a huge space savings
>
> One more question on this.  What was the average row-batch size used?  I
> see in the proposal some buffers might not be compressed, did you this
> feature in the test?

I used 64K row batch size. I haven't implemented the optional
non-compressed buffers (for cases where there is little space savings)
so everything is compressed. I can check different batch sizes if you
like


> On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney  wrote:
>
> > hi folks,
> >
> > Sorry it's taken me a little while to produce supporting benchmarks.
> >
> > * I implemented experimental trivial body buffer compression in
> > https://github.com/apache/arrow/pull/6638
> > * I hooked up the Arrow IPC file format with compression as the new
> > Feather V2 format in
> > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >
> > I tested a couple of real-world datasets from a prior blog post
> > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> > codecs
> >
> > The complete results are here
> > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >
> > Summary:
> >
> > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > dataset. So that's a huge space savings
> > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> > and 1.2-3GByte/s with ZSTD
> >
> > I would have to do some more engineering to test throughput changes
> > with Flight, but given these results on slower networking (e.g. 1
> > Gigabit) my guess is that the compression and decompression overhead
> > is little compared with the time savings due to high compression
> > ratios. If people would like to see these numbers to help make a
> > decision I can take a closer look
> >
> > As far as what Micah said about having a limited number of
> > compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> > anecdotally that these outperform Snappy in most real world scenarios
> > and generally have > 1 GB/s decompression performance. Some Linux
> > distributions (Arch at least) have already started adopting ZSTD over
> > LZMA or GZIP [1]
> >
> > - Wes
> >
> > [1]:
> > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> >
> > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya  wrote:
> > >
> > > Hi Wes,
> > >
> > > Thanks a lot for the additional information.
> > > Looking forward to see the good results from your experiments.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney 
> > wrote:
> > >
> > > > I see, thank you.
> > > >
> > > > For such a scenario, implementations would need to define a
> > > > "UserDefinedCodec" interface to enable codecs to be registered from
> > > > third party code, similar to what is done for extension types [1]
> > > >
> > > > I'll update this thread when I get my experimental C++ patch up to see
> > > > what I'm thinking at least for the built-in codecs we have like ZSTD.
> > > >
> > > >
> > > >
> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> > > >
> > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya  wrote:
> > > > >
> > > > > Hi Wes,
> > > > >
> > > > > Thanks a lot for your further clarification.
> > > > >
> > > > > Some of my prelimiary thoughts:
> > > > >
> > > > > 1. We assign a unique GUID to each pair of compression/decompression
> > > > > strategies. The GUID is stored as part of the
> > Message.custom_metadata.
> > > > When
> > > > > receiving the GUID, the receiver knows which decompression strategy
> > to
> > > > use.
> > > > >
> > > > > 2. We serialize the decompression strategy, and store it into the
> > > > > Message.custom_metadata. The receiver can decompress data after
> > > > > deserializing the strategy.
> > > > >
> > > > > Method 1 is generally used in static strategy scenarios while method
> > 2 is
> > > > > generally used in dynamic strategy scenarios.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Okay, I guess my question is how the receiver is going to be able
> > to
> > > > > > determine how to "rehydrate" the record batch buffers:
> > > > > >
> > > > > > What I've proposed amounts to the following:
> > > > > >
> > > > > > * UNCOMPRESSED: the current behavior
> > > > > > * ZSTD/LZ4/...: each buffer is compressed and written with an int64
> > > > > > length prefix
> > > > > >
> > > > > > (I'm close to putting up a PR implementing an experimental version
> > of
> > > > > > this that uses Message.custom_metadata to transmit the codec, so
> > this
> > > > > > will make the implementation 

[jira] [Created] (ARROW-8213) [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8213:


 Summary: [Python][Dataste] Opening a dataset with a local 
incorrect path gives confusing error message
 Key: ARROW-8213
 URL: https://issues.apache.org/jira/browse/ARROW-8213
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Even after the previous PRs related to local paths 
(https://github.com/apache/arrow/pull/6643, 
https://github.com/apache/arrow/pull/6655), I don't the user experience optimal 
in case you are working with local files, and pass a wrong, non-existent path 
(eg due to a typo).

Currently, you get this error:

{code}
>>> dataset = ds.dataset("data_with_typo.parquet", format="parquet")
...
ArrowInvalid: URI has empty scheme: 'data_with_typo.parquet'
{code}

where "URI has empty scheme" is rather confusing for the user in case of a 
non-existent path.  I think ideally we should raise a "No such file or 
directory" error.

I am not fully sure what the best solution is, as {{FileSystem.from_uri}} can 
also give other errors that we do want to propagate to the user. 
The most straightforward that I am now thinking of is checking if "URI has 
empty scheme" is in the error message, and then rewording it, but that's not 
very clean ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8212) [Python][Dataset] Consider adding Cast like operation

2020-03-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8212:
--

 Summary: [Python][Dataset] Consider adding Cast like operation
 Key: ARROW-8212
 URL: https://issues.apache.org/jira/browse/ARROW-8212
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


It would cast an expression to the datatype of another expression. 

Re-evalutate once the new LogicalPlan implementation is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8211) [C++] Sanitize hdfs host when creating HadoopFileSystem from endpoint

2020-03-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8211:
--

 Summary: [C++] Sanitize hdfs host when creating HadoopFileSystem 
from endpoint
 Key: ARROW-8211
 URL: https://issues.apache.org/jira/browse/ARROW-8211
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs


Creating a HadoopFileSystem from uri always 
[prepends|https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/hdfs.cc#L283]
 the host with the uri scheme whereas configuring endpoint [does 
not|https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/hdfs.cc#L253].

It has caused issue during equality checks and serialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8209) [Python] Accessing duplicate column of Table by name gives wrong error

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8209:


 Summary: [Python] Accessing duplicate column of Table by name 
gives wrong error
 Key: ARROW-8209
 URL: https://issues.apache.org/jira/browse/ARROW-8209
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


When you have a table with duplicate column names and you try to access this 
column, you get an error about the column not existing:

{code}
>>> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 
>>> 9])], names=['a', 'b', 'a']) 

>>> table.column('a')   
>>> 
>>>
---
KeyError  Traceback (most recent call last)
 in 
> 1 table.column('a')

~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.column()

KeyError: 'Column a does not exist in table'
{code}

It should rather give an error message about the column name being duplicate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-03-25-0

2020-03-25 Thread Crossbow


Arrow Build Report for Job nightly-2020-03-25-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0

Failed Tasks:
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-gandiva-jar-trusty
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-cpp-valgrind
- test-debian-10-go-1.12:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-debian-10-go-1.12
- test-r-linux-as-cran:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-test-r-linux-as-cran
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp36m
- wheel-win-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp37m
- wheel-win-cp38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-appveyor-wheel-win-cp38

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-centos-8
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-azure-conda-win-vs2015-py38
- debian-buster:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-debian-buster
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-github-debian-stretch
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-homebrew-cpp
- macos-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-travis-macos-r-autobrew
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-03-25-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 

[jira] [Created] (ARROW-8208) [PYTHON] RowGroup filtering with ParquetDataset

2020-03-25 Thread Christophe Clienti (Jira)
Christophe Clienti created ARROW-8208:
-

 Summary: [PYTHON] RowGroup filtering with ParquetDataset
 Key: ARROW-8208
 URL: https://issues.apache.org/jira/browse/ARROW-8208
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Christophe Clienti


Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround propose here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file in a dataset or not?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-25 Thread Sebastien Binet
On Wed, Mar 25, 2020 at 2:32 AM Wes McKinney  wrote:

> From what I've found searching on the internet
>
> - Java:
> * ZSTD -- JNI-based library available
> * LZ4 -- both JNI and native Java available
>
> - Go: ZSTD is a C binding, while there is an LZ4 native Go implementation
>
AFAIK, one has access to pure-Go packages for both of these compressors:
- github.com/pierrec/lz4
- github.com/klauspost/compress

-s