[jira] [Created] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes

2020-04-06 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-8359:
---

 Summary: [C++/Python] Enable aarch64/ppc64le build in conda recipes
 Key: ARROW-8359
 URL: https://issues.apache.org/jira/browse/ARROW-8359
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging, Python
Reporter: Uwe Korn
 Fix For: 0.17.0


These two new arches were added in the conda recipes, we should also build them 
as nightlies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-04-06 Thread Wes McKinney
I updated the Format proposal again, please have a look

https://github.com/apache/arrow/pull/6707

On Wed, Apr 1, 2020 at 10:15 AM Wes McKinney  wrote:
>
> For uncompressed, memory mapping is disabled, so all of the bytes are
> being read into RAM. I wanted to show that even when your IO pipe is
> very fast (in the case with an NVMe SSD like I have, > 1GB/s for read
> from disk) that you can still load faster with compressed files.
>
> Here were the prior Read results with
>
> * Single threaded decompression
> * Memory mapping enabled
>
> https://ibb.co/4ZncdF8
>
> You can see for larger chunksizes, because the IPC reconstruction
> overhead is about 60 microseconds per batch, that read time is very
> low (10s of milliseconds).
>
> On Wed, Apr 1, 2020 at 10:10 AM Antoine Pitrou  wrote:
> >
> >
> > The read times are still with memory mapping for the uncompressed case?
> >  If so, impressive!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> > > Several pieces of work got done in the last few days:
> > >
> > > * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> > > interoperability)
> > > * Parallelizing both compression and decompression at the field level
> > >
> > > Here are the results (using 8 threads on an 8-core laptop). I disabled
> > > the "memory map" feature so that in the uncompressed case all of the
> > > data must be read off disk into memory. This helps illustrate the
> > > compression/IO tradeoff to wall clock load times
> > >
> > > File size (only LZ4 may be different): https://ibb.co/CP3VQkp
> > > Read time: https://ibb.co/vz9JZMx
> > > Write time: https://ibb.co/H7bb68T
> > >
> > > In summary, now with multicore compression and decompression,
> > > LZ4-compressed files are faster both to read and write even on a very
> > > fast SSD, as are ZSTD-compressed files with a low ZSTD compression
> > > level. I didn't notice a major difference between LZ4 raw and LZ4
> > > frame formats. The reads and writes could be made faster still by
> > > pipelining / making concurrent the disk read/write and
> > > compression/decompression steps -- the current implementation performs
> > > these tasks serially. We can improve this in the near future
> > >
> > > I'll update the Format proposal this week so we can move toward
> > > something we can vote on. I would recommend that we await
> > > implementations and integration tests for this before releasing this
> > > as stable, in line with prior discussions about adding stuff to the
> > > IPC protocol
> > >
> > > On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney  wrote:
> > >>
> > >> Here are the results:
> > >>
> > >> File size: https://ibb.co/71sBsg3
> > >> Read time: https://ibb.co/4ZncdF8
> > >> Write time: https://ibb.co/xhNkRS2
> > >>
> > >> Code: 
> > >> https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
> > >> (based on https://github.com/apache/arrow/pull/6694)
> > >>
> > >> High level summary:
> > >>
> > >> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
> > >>
> > >> * Wall clock read time is impacted by chunksize, maybe 30-40%
> > >> difference between 1K row chunks versus 16K row chunks. One notable
> > >> thing is that you can see clearly the overhead associated with IPC
> > >> reconstruction even when the data is memory mapped. For example, in
> > >> the Fannie Mae dataset there are 21,661 batches (each batch has 31
> > >> fields) when the chunksize is 1024. So a read time of 1.3 seconds
> > >> indicates ~60 microseconds of overhead for each record batch. When you
> > >> consider the amount of business logic involved with reconstructing a
> > >> record batch, 60 microseconds is pretty good. This also shows that
> > >> every microsecond counts and we need to be carefully tracking
> > >> microperformance in this critical operation.
> > >>
> > >> * Small chunksize results in higher write times for "expensive" codecs
> > >> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
> > >> it doesn't make as much of a difference
> > >>
> > >> * Note that LZ4 compressor results in faster wall clock time to disk
> > >> presumably because the compression speed is faster than my SSD's write
> > >> speed
> > >>
> > >> Implementation notes:
> > >> * There is no parallelization or pipelining of reads or writes. For
> > >> example, on write, all of the buffers are compressed with a single
> > >> thread and then compression stops until the write to disk completes.
> > >> On read, buffers are decompressed serially
> > >>
> > >>
> > >> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney  
> > >> wrote:
> > >>>
> > >>> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> > >>> know the read/write times and compression ratios. Shouldn't take too
> > >>> long
> > >>>
> > >>> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya  wrote:
> > 
> >  Thanks a lot for sharing the good results.
> > 
> >  As investigated 

Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Andy Grove
There are two trivial Rust PRs pending that I would like to see merged for
the release.

ARROW-7794: [Rust] Support releasing arrow-flight

https://github.com/apache/arrow/pull/6858

ARROW-8357: [Rust] [DataFusion] Dockerfile for CLI is missing format dir

https://github.com/apache/arrow/pull/6860

Thanks,

Andy.


On Mon, Apr 6, 2020 at 6:55 AM Antoine Pitrou  wrote:

>
> Also nice to have perhaps (PR available and several back-and-forths
> already):
>
> * ARROW-7610: [Java] Finish support for 64 bit int allocations
>
> Needs a Java committer to decide...
>
> Regards
>
> Antoine.
>
>
> Le 06/04/2020 à 00:24, Wes McKinney a écrit :
> > We are getting close to the 0.17.0 endgame.
> >
> > Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
> > issues without patches yet so we should decide quickly whether they
> > need to be included. Are they any blocking issues not accounted for in
> > the milestone?
> >
> > * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
> >
> > Patch available
> >
> > * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
> > relative path to Flight.proto
> >
> > No patch yet
> >
> > * ARROW-7222 [Python][Release] Wipe any existing generated Python API
> > documentation when updating website
> >
> > This issue needs to be addressed by the release manager and the
> > Confluence instructions must be updated.
> >
> > * ARROW-7891 [C++] RecordBatch->Equals should also have a
> > check_metadata argument
> >
> > Patch available that needs to be reviewed and approved
> >
> > * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical
> schema
> >
> > Patch available, but failures to be resolved
> >
> > * ARROW-7965: [Python] Hold a reference to the dataset factory for later
> reuse
> >
> > Depends on ARROW-8164, will require rebase
> >
> > * ARROW-8039: [Python][Dataset] Support using dataset API in
> > pyarrow.parquet with a minimal ParquetDataset shim
> >
> > Patch pending
> >
> > * ARROW-8047: [Python][Documentation] Document migration from
> > ParquetDataset to pyarrow.datasets
> >
> > May be tackled beyond 0.17.0
> >
> > * ARROW-8063: [Python] Add user guide documentation for Datasets API
> >
> > May be tackled beyond 0.17.0
> >
> > * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
> >
> > Does not seem strictly necessary for release, since a packaging issue
> >
> > * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
> >
> > Patch available, but needs review. May
> >
> > * ARROW-8213: [Python][Dataset] Opening a dataset with a local
> > incorrect path gives confusing error message
> >
> > Nice to have, but not essential
> >
> > * ARROW-8266: [C++] Add backup mirrors for external project source
> downloads
> >
> > Patch available, nice to have
> >
> > * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
> > per "Feather V2" changes
> >
> > Patch available
> >
> > * ARROW-8300 [R] Documentation and changelog updates for 0.17
> >
> > Patch available
> >
> > * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
> > requirements in C data interface
> >
> > Patch available
> >
> > * ARROW-8330: [Documentation] The post release script generates the
> > documentation with a development version
> >
> > Patch available
> >
> > * ARROW-8335: [Release] Add crossbow jobs to run release verification
> >
> > Patch in progress
> >
> > On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
> >>
> >> I see ARROW-6871 in the list.
> >> It seems it has some bugs, which are being fixed by ARROW-8239.
> >> So I have added ARROW-8239 to the list.
> >>
> >> The PR for ARROW-8239 is already approved, so it is expected to be
> resolved
> >> soon.
> >>
> >> Best,
> >> Liya Fan
> >>
> >> On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield 
> >> wrote:
> >>
> >>> I moved the Java issues out of 0.17.0, they seem complex enough or not
> of
> >>> enough significance to make them blockers for 0.17.0 release.  If
> owners of
> >>> the issues disagree please move them back int.
> >>>
> >>> On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney 
> wrote:
> >>>
>  We've made good progress, but there are still 35 issues in the
>  backlog. Some of them are documentation related, but there are some
>  functionality-related patches that could be at risk. If all could
>  review again to trim out anything that isn't going to make the cut for
>  0.17.0, please do
> 
>  On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
> >>> wrote:
> >
> > I just took a first pass at reviewing the Java and Rust issues and
>  removed
> > some from the 0.17.0 release. There are a few small Rust issues that
> I
> >>> am
> > actively working on for this release.
> >
> > Thanks.
> >
> >
> > On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney 
>  wrote:
> >
> >> hi Neal,
> >>
> >> Thanks for helping coordinate. I agree we should be in a position to
> >> release sometime next week.
> 

[jira] [Created] (ARROW-8358) [C++] Fix -Wrange-loop-construct warnings in clang-11

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8358:
---

 Summary: [C++] Fix -Wrange-loop-construct warnings in clang-11 
 Key: ARROW-8358
 URL: https://issues.apache.org/jira/browse/ARROW-8358
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


We might change one of our CI entries to use clang-11 so we get some more 
bleeding edge compiler warnings, to get out ahead of things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8357) [Rust] [DataFusion] Dockerfile for CLI is missing format dir

2020-04-06 Thread Andy Grove (Jira)
Andy Grove created ARROW-8357:
-

 Summary: [Rust] [DataFusion] Dockerfile for CLI is missing format 
dir
 Key: ARROW-8357
 URL: https://issues.apache.org/jira/browse/ARROW-8357
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.17.0


{code:java}
error: failed to run custom build command for `arrow-flight v1.0.0-SNAPSHOT 
(/arrow/rust/arrow-flight)`Caused by:
  process didn't exit successfully: 
`/arrow/rust/target/release/build/arrow-flight-a0fb14daffea70f5/build-script-build`
 (exit code: 1)
--- stderr
Error: Custom { kind: Other, error: "protoc failed: ../../format: warning: 
directory does not exist.\nCould not make proto path relative: 
../../format/Flight.proto: No such file or directory\n" }warning: build failed, 
waiting for other jobs to finish...
error: failed to compile `datafusion v1.0.0-SNAPSHOT (/arrow/rust/datafusion)`, 
intermediate artifacts can be found at `/arrow/rust/target`Caused by:
  build failed
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8356) [Developer] Support * wildcards with "crossbow submit" via GitHub actions

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8356:
---

 Summary: [Developer] Support * wildcards with "crossbow submit" 
via GitHub actions
 Key: ARROW-8356
 URL: https://issues.apache.org/jira/browse/ARROW-8356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney


While the "group" feature can be useful, sometimes there is a group of builds 
that do not fit neatly into a particular group



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8355) [Python] Reduce the number of pandas dependent test cases in test_feather

2020-04-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8355:
--

 Summary: [Python] Reduce the number of pandas dependent test cases 
in test_feather
 Key: ARROW-8355
 URL: https://issues.apache.org/jira/browse/ARROW-8355
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Krisztian Szucs
 Fix For: 1.0.0


See comment https://github.com/apache/arrow/pull/6849#discussion_r404160096



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8353) [C++] is_nullable maybe not initialized in parquet writer

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8353:
--

 Summary: [C++] is_nullable maybe not initialized in parquet writer
 Key: ARROW-8353
 URL: https://issues.apache.org/jira/browse/ARROW-8353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


>From the Rtools build:

{code}
[ 84%] Building CXX object 
src/parquet/CMakeFiles/parquet_static.dir/column_reader.cc.obj
In file included from D:/a/arrow/arrow/cpp/src/arrow/io/concurrency.h:23:0,
 from D:/a/arrow/arrow/cpp/src/arrow/io/memory.h:25,
 from D:/a/arrow/arrow/cpp/src/parquet/platform.h:25,
 from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.h:23,
 from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:18:
D:/a/arrow/arrow/cpp/src/arrow/result.h: In member function 'virtual 
arrow::Status parquet::arrow::FileWriterImpl::WriteColumnChunk(const 
std::shared_ptr&, int64_t, int64_t)':
D:/a/arrow/arrow/cpp/src/arrow/result.h:428:28: warning: 'is_nullable' may be 
used uninitialized in this function [-Wmaybe-uninitialized]
   auto result_name = (rexpr);   \
^
D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:430:10: note: 'is_nullable' 
was declared here
 bool is_nullable;
  ^
{code}

I'd give it a default value, but IDK that it's that simple.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8352) [R] Add install_pyarrow()

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8352:
--

 Summary: [R] Add install_pyarrow()
 Key: ARROW-8352
 URL: https://issues.apache.org/jira/browse/ARROW-8352
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neal Richardson
Assignee: Neal Richardson


To facilitate installing for use with reticulate, including handling how to use 
the nightly packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: building arrow with CMake 2.8 on CentOS

2020-04-06 Thread Wes McKinney
Newer versions of CMake are also available from PyPI

pip install cmake

https://pypi.org/project/cmake/

On Mon, Apr 6, 2020 at 1:11 AM Sutou Kouhei  wrote:
>
> Hi,
>
> We don't support CMake 2.8. Please use CMake 3.2 or later.
>
> Are you using CentOS 6? You can install CMake 3.6 with EPEL
> on CentOS 6:
>
>   % sudo yum install -y epel-release
>   % sudo yum install -y cmake3
>   % cmake3 --version
>   cmake3 version 3.6.1
>
>   CMake suite maintained and supported by Kitware (kitware.com/cmake).
>
>
> Thanks,
> ---
> kou
>
> In
>  
> 
>   "building arrow with CMake 2.8 on CentOS" on Mon, 6 Apr 2020 05:39:56 +,
>   "Lekshmi Narayanan, Arun Balajiee"  wrote:
>
> > Hi
> >
> > I am looking to build Arrow on CentOS with cmake version 2.8. It is a 
> > shared server, so the server admin at my school doesn't want to update the 
> > version of cmake. I checked these two issues,
> > https://issues.apache.org/jira/browse/ARROW-73
> > https://issues.apache.org/jira/browse/ARROW-66
> >
> > but I couldn't arrive at a resolution on how to build on my machine.
> >
> > I don't find a documentation for building on CentOS with these cmake 
> > settings as well. Could you help me with this?
> >
> >
> > Regards,
> > Arun  Balajiee


Re: C interface clarifications

2020-04-06 Thread Wes McKinney
On Mon, Apr 6, 2020 at 12:22 PM Todd Lipcon  wrote:
>
> On Mon, Apr 6, 2020 at 9:57 AM Antoine Pitrou  wrote:
>
> >
> > Hello Todd,
> >
> > Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> > >
> > > I had a couple questions / items that should be clarified in the spec.
> > Wes
> > > suggested I raise them here on dev@:
> > >
> > > *1) Should producers expect callers to zero-init structs?*
> >
> > IMO, they shouldn't.  They should fill the structure exhaustively.
> > Though ultimately it's a decision made by the implementer of the
> > producer API.
> >
> > > I suppose since it's the "C interface" it's
> > > probably best to follow the C-style "producer assumes the argument
> > contains
> > > uninitialized memory" convention.
> >
> > Exactly.
> >
> > > *2) Clarify lifetime semantics for nested structures*
> > >
> > > In my application, i'd like to allocate all of the children structures of
> > > an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> > > private_data field of the top-level struct. As such, my initial
> > > implementation was to make the 'release' callback on the top-level struct
> > > delete that memory pool, and set the 'release' callback of all children
> > > structs to null, since their memory was totally owned by the top-level
> > > struct.
> > >
> > > I figured this approach was OK because the spec says:
> > >
> > >>  Consumers MUST call a base structure's release callback when they won't
> > > be using it anymore, but they MUST not call any of its children's release
> > > callbacks (including the optional dictionary). The producer is
> > responsible
> > > for releasing the children.
> > >
> > > That advice seems to indicate that I can do whatever I want with the
> > > release callback of the children, including not setting it.
> >
> > ... Except that in this case, moving a child wouldn't be possible.
> >
> > > This section of the spec also seems to be a bit in conflict with the
> > > following:
> > >
> > >> It is possible to move a child array, but the parent array MUST be
> > > released immediately afterwards, as it won't point to a valid child array
> > > anymore. This satisfies the use case of keeping only a subset of child
> > > arrays, while releasing the others.
> > >
> > > ... because if you have a parent array which owns the memory referred to
> > by
> > > the child, then moving the child (with a no-op release callback) followed
> > > by releasing the parent, you'll end up with an invalid or deallocated
> > child
> > > as well.
> >
> > I think the solution here is for your memory pool to be
> > reference-counted, and have each release callback in the array tree
> > decrement the reference count.  Does that sound reasonable to you?
> >
>
> Sure, I can do that. But I imagine consumers may also be a bit surprised by
> this behavior, that releasing the children doesn't actually free up any
> memory.

This should be documented in the producer API, I think, that even if
you move the children out of the parent that the common resource
persists beyond the parents' release() invocation. It may vary between
producer implementations of the C interface.

> The spec should also probably cover thread-safety: if the consumer gets an
> ArrowArray, is it safe to pass off the children to multiple threads and
> have them call release() concurrently? In other words, do I need to use a
> thread-safe reference count? I would guess so.

Yes, e.g. std::shared_ptr or similar. I agree that the spec should
address or at least provide a strong recommendation regarding
thread-safety of resources shared by distinct ArrowArray structures in
their private_data. While in many scenarios the top-level release
callback will handle destruction and any related thread-safety issues,
as soon as any children are moved out of the parent, their release
callbacks could be called at many time.

So in the spec it says

"It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child
array anymore. This satisfies the use case of keeping only a subset of
child arrays, while releasing the others."

we should add a sentence like "It is recommended that producers take
thread-safety into consideration and ensure that moved child arrays'
release callbacks can be called in a concurrent setting."

> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera


Re: Attn: Wes, Re: Masked Arrays

2020-04-06 Thread Wes McKinney
For the sake of others reading, this discussion might be a bit
confusing to happen upon because the scope isn't clear. It seems that
we are discussing the C++ implementation and not the columnar format,
is that right?

Adding any additional metadata about this to the columnar format /
Flatbuffers files / C interface is probably a non-starter. We've
discussed the contents of data "underneath" a null and consistently
the consensus is that is is unspecified.

Applications (as well as internal details of some implementations and
their interactions with external libraries) are free to set
custom_metadata fields in schemas to indicate otherwise. However, one
must take care to not propagate this metadata inappropriately from one
realization of a schema (as an Array or RecordBatch) where it is true
to another where it is not true. Similarly, one should also be careful
not to use such metadata on data whose provenance is unknown.

- Wes

On Mon, Apr 6, 2020 at 11:37 AM Felix Benning  wrote:
>
> In that case it is probably necessary to have a "has_sentinel" flag and a
> "sentinel_value" variable. Since other algorithms might benefit from not
> having to set these values to zero. Which is probably the reason why the
> value "underneath" was set to unspecified in the first place. Alternatively
> a "sentinel_enum" could specify whether the sentinel is 0, or the R
> sentinel value is used. This would sacrifice flexibility for size. Although
> size probably does not matter, when meta data for entire columns are
> concerned. So the first approach is probably better.
>
> Felix
>
> On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques 
> wrote:
>
> > It does make sense, I would go a little further and make this
> > field/property a single value of the same type than the array. This
> > would allow using any arbitrary sentinel value for unknown values (0
> > in your suggested case). The end result is zero-copy for R bindings
> > (if stars are aligned). I created ARROW-8348 [1] for this.
> >
> > François
> >
> > [1] https://jira.apache.org/jira/browse/ARROW-8348
> >
> > On Mon, Apr 6, 2020 at 11:02 AM Felix Benning 
> > wrote:
> > >
> > > Would it make sense to have an `na_are_zero` flag? Since null checking is
> > > not without cost, it might be helpful to some algorithms, if the content
> > > "underneath" the nulls is zero. For example in means, or scalar products
> > > and thus matrix multiplication, knowing that the array has zeros where
> > the
> > > na's are, would allow these algorithms to pretend that there are no na's.
> > > Since setting all nulls to zero in a matrix of n columns and n rows costs
> > > O(n^2), it would make sense to set them all to zero before matrix
> > > multiplication i.e. O(n^3) and similarly expensive algorithms. If there
> > was
> > > a `na_are_zero` flag, other algorithms could later utilize this work
> > > already being done. Algorithms which change the data and violate this
> > > contract, would only need to reset the flag. And in some use cases, it
> > > might be possible to use idle time of the computer to "clean up" the
> > na's,
> > > preparing for the next query.
> > >
> > > Felix
> > >
> > > -- Forwarded message -
> > > From: Wes McKinney 
> > > Date: Sun, 5 Apr 2020 at 22:31
> > > Subject: Re: Attn: Wes, Re: Masked Arrays
> > > To: 
> > >
> > >
> > > As I recall the contents "underneath" have been discussed before and
> > > the consensus was that the contents are not specified. If you'e like
> > > to make a proposal to change something I would suggest raising it on
> > > dev@arrow.apache.org
> > >
> > > On Sun, Apr 5, 2020 at 1:56 PM Felix Benning 
> > > wrote:
> > > >
> > > > Follow up: Do you think it would make sense to have an `na_are_zero`
> > > flag? Since it appears that the baseline (naively assuming there are no
> > > null values) is still a bit faster than equally optimized null value
> > > handling algorithms. So you might want to make the assumption, that all
> > > null values are set to zero in the array (instead of undefined). This
> > would
> > > allow for very fast means, scalar products and thus matrix multiplication
> > > which ignore nas. And in case of matrix multiplication, you might prefer
> > > sacrificing an O(n^2) effort to set all null entries to zero before
> > > multiplying. And assuming you do not overwrite this data, you would be
> > able
> > > to reuse that assumption in later computations with such a flag.
> > > > In some use cases, you might even be able to utilize unused computing
> > > resources for this task. I.e. clean up the nulls while the computer is
> > not
> > > used, preparing for the next query.
> > > >
> > > >
> > > > On Sun, 5 Apr 2020 at 18:34, Felix Benning 
> > > wrote:
> > > >>
> > > >> Awesome, that was exactly what I was looking for, thank you!
> > > >>
> > > >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney 
> > wrote:
> > > >>>
> > > >>> I wrote a blog post a couple of years about this
> > > >>>
> > > >>> 

[jira] [Created] (ARROW-8351) [R][CI] Store the Rtools-built Arrow C++ library as a build artifact

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8351:
--

 Summary: [R][CI] Store the Rtools-built Arrow C++ library as a 
build artifact
 Key: ARROW-8351
 URL: https://issues.apache.org/jira/browse/ARROW-8351
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neal Richardson
Assignee: Neal Richardson


To help with debugging unexplained segfaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8350) [Python] Implement to_numpy on ChunkedArray

2020-04-06 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-8350:
---

 Summary: [Python] Implement to_numpy on ChunkedArray
 Key: ARROW-8350
 URL: https://issues.apache.org/jira/browse/ARROW-8350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe Korn


We support {{to_numpy}} on Array instances but not on {{ChunkedArray}} 
instances. It would be quite useful to have it also there to support returning 
e.g. non-nanosecond datetime instances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: C interface clarifications

2020-04-06 Thread Todd Lipcon
On Mon, Apr 6, 2020 at 9:57 AM Antoine Pitrou  wrote:

>
> Hello Todd,
>
> Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> >
> > I had a couple questions / items that should be clarified in the spec.
> Wes
> > suggested I raise them here on dev@:
> >
> > *1) Should producers expect callers to zero-init structs?*
>
> IMO, they shouldn't.  They should fill the structure exhaustively.
> Though ultimately it's a decision made by the implementer of the
> producer API.
>
> > I suppose since it's the "C interface" it's
> > probably best to follow the C-style "producer assumes the argument
> contains
> > uninitialized memory" convention.
>
> Exactly.
>
> > *2) Clarify lifetime semantics for nested structures*
> >
> > In my application, i'd like to allocate all of the children structures of
> > an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> > private_data field of the top-level struct. As such, my initial
> > implementation was to make the 'release' callback on the top-level struct
> > delete that memory pool, and set the 'release' callback of all children
> > structs to null, since their memory was totally owned by the top-level
> > struct.
> >
> > I figured this approach was OK because the spec says:
> >
> >>  Consumers MUST call a base structure's release callback when they won't
> > be using it anymore, but they MUST not call any of its children's release
> > callbacks (including the optional dictionary). The producer is
> responsible
> > for releasing the children.
> >
> > That advice seems to indicate that I can do whatever I want with the
> > release callback of the children, including not setting it.
>
> ... Except that in this case, moving a child wouldn't be possible.
>
> > This section of the spec also seems to be a bit in conflict with the
> > following:
> >
> >> It is possible to move a child array, but the parent array MUST be
> > released immediately afterwards, as it won't point to a valid child array
> > anymore. This satisfies the use case of keeping only a subset of child
> > arrays, while releasing the others.
> >
> > ... because if you have a parent array which owns the memory referred to
> by
> > the child, then moving the child (with a no-op release callback) followed
> > by releasing the parent, you'll end up with an invalid or deallocated
> child
> > as well.
>
> I think the solution here is for your memory pool to be
> reference-counted, and have each release callback in the array tree
> decrement the reference count.  Does that sound reasonable to you?
>

Sure, I can do that. But I imagine consumers may also be a bit surprised by
this behavior, that releasing the children doesn't actually free up any
memory.

The spec should also probably cover thread-safety: if the consumer gets an
ArrowArray, is it safe to pass off the children to multiple threads and
have them call release() concurrently? In other words, do I need to use a
thread-safe reference count? I would guess so.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: C interface clarifications

2020-04-06 Thread Antoine Pitrou


Hello Todd,

Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> 
> I had a couple questions / items that should be clarified in the spec. Wes
> suggested I raise them here on dev@:
> 
> *1) Should producers expect callers to zero-init structs?*

IMO, they shouldn't.  They should fill the structure exhaustively.
Though ultimately it's a decision made by the implementer of the
producer API.

> I suppose since it's the "C interface" it's
> probably best to follow the C-style "producer assumes the argument contains
> uninitialized memory" convention.

Exactly.

> *2) Clarify lifetime semantics for nested structures*
> 
> In my application, i'd like to allocate all of the children structures of
> an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> private_data field of the top-level struct. As such, my initial
> implementation was to make the 'release' callback on the top-level struct
> delete that memory pool, and set the 'release' callback of all children
> structs to null, since their memory was totally owned by the top-level
> struct.
> 
> I figured this approach was OK because the spec says:
> 
>>  Consumers MUST call a base structure's release callback when they won't
> be using it anymore, but they MUST not call any of its children's release
> callbacks (including the optional dictionary). The producer is responsible
> for releasing the children.
> 
> That advice seems to indicate that I can do whatever I want with the
> release callback of the children, including not setting it.

... Except that in this case, moving a child wouldn't be possible.

> This section of the spec also seems to be a bit in conflict with the
> following:
> 
>> It is possible to move a child array, but the parent array MUST be
> released immediately afterwards, as it won't point to a valid child array
> anymore. This satisfies the use case of keeping only a subset of child
> arrays, while releasing the others.
> 
> ... because if you have a parent array which owns the memory referred to by
> the child, then moving the child (with a no-op release callback) followed
> by releasing the parent, you'll end up with an invalid or deallocated child
> as well.

I think the solution here is for your memory pool to be
reference-counted, and have each release callback in the array tree
decrement the reference count.  Does that sound reasonable to you?

Best regards

Antoine.


Re: Attn: Wes, Re: Masked Arrays

2020-04-06 Thread Felix Benning
In that case it is probably necessary to have a "has_sentinel" flag and a
"sentinel_value" variable. Since other algorithms might benefit from not
having to set these values to zero. Which is probably the reason why the
value "underneath" was set to unspecified in the first place. Alternatively
a "sentinel_enum" could specify whether the sentinel is 0, or the R
sentinel value is used. This would sacrifice flexibility for size. Although
size probably does not matter, when meta data for entire columns are
concerned. So the first approach is probably better.

Felix

On Mon, 6 Apr 2020 at 17:59, Francois Saint-Jacques 
wrote:

> It does make sense, I would go a little further and make this
> field/property a single value of the same type than the array. This
> would allow using any arbitrary sentinel value for unknown values (0
> in your suggested case). The end result is zero-copy for R bindings
> (if stars are aligned). I created ARROW-8348 [1] for this.
>
> François
>
> [1] https://jira.apache.org/jira/browse/ARROW-8348
>
> On Mon, Apr 6, 2020 at 11:02 AM Felix Benning 
> wrote:
> >
> > Would it make sense to have an `na_are_zero` flag? Since null checking is
> > not without cost, it might be helpful to some algorithms, if the content
> > "underneath" the nulls is zero. For example in means, or scalar products
> > and thus matrix multiplication, knowing that the array has zeros where
> the
> > na's are, would allow these algorithms to pretend that there are no na's.
> > Since setting all nulls to zero in a matrix of n columns and n rows costs
> > O(n^2), it would make sense to set them all to zero before matrix
> > multiplication i.e. O(n^3) and similarly expensive algorithms. If there
> was
> > a `na_are_zero` flag, other algorithms could later utilize this work
> > already being done. Algorithms which change the data and violate this
> > contract, would only need to reset the flag. And in some use cases, it
> > might be possible to use idle time of the computer to "clean up" the
> na's,
> > preparing for the next query.
> >
> > Felix
> >
> > -- Forwarded message -
> > From: Wes McKinney 
> > Date: Sun, 5 Apr 2020 at 22:31
> > Subject: Re: Attn: Wes, Re: Masked Arrays
> > To: 
> >
> >
> > As I recall the contents "underneath" have been discussed before and
> > the consensus was that the contents are not specified. If you'e like
> > to make a proposal to change something I would suggest raising it on
> > dev@arrow.apache.org
> >
> > On Sun, Apr 5, 2020 at 1:56 PM Felix Benning 
> > wrote:
> > >
> > > Follow up: Do you think it would make sense to have an `na_are_zero`
> > flag? Since it appears that the baseline (naively assuming there are no
> > null values) is still a bit faster than equally optimized null value
> > handling algorithms. So you might want to make the assumption, that all
> > null values are set to zero in the array (instead of undefined). This
> would
> > allow for very fast means, scalar products and thus matrix multiplication
> > which ignore nas. And in case of matrix multiplication, you might prefer
> > sacrificing an O(n^2) effort to set all null entries to zero before
> > multiplying. And assuming you do not overwrite this data, you would be
> able
> > to reuse that assumption in later computations with such a flag.
> > > In some use cases, you might even be able to utilize unused computing
> > resources for this task. I.e. clean up the nulls while the computer is
> not
> > used, preparing for the next query.
> > >
> > >
> > > On Sun, 5 Apr 2020 at 18:34, Felix Benning 
> > wrote:
> > >>
> > >> Awesome, that was exactly what I was looking for, thank you!
> > >>
> > >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney 
> wrote:
> > >>>
> > >>> I wrote a blog post a couple of years about this
> > >>>
> > >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> > >>>
> > >>> Pasha Stetsenko did a follow-up analysis that showed that my
> > >>> "sentinel" code could be significantly improved, see:
> > >>>
> > >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> > >>>
> > >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> > >>> representation of nullness across all types, both primitive
> (booleans,
> > >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> > >>> computational operations (like elementwise functions) need not
> concern
> > >>> themselves with the nulls at all, for example, since the bitmap from
> > >>> the input array can be passed along (with zero copy even) to the
> > >>> output array.
> > >>>
> > >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <
> felix.benn...@gmail.com>
> > wrote:
> > >>> >
> > >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> > Arrays for NA implementations? There seems to have been a discussion
> about
> > that in the numpy community in 2012
> > https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> > apparent result.
> > >>> >
> > 

C interface clarifications

2020-04-06 Thread Todd Lipcon
Hey folks,

I've started working on a patch to make Apache Kudu's C++ client able to
expose batches of data in Arrow's new C-style interface (
https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
)

I had a couple questions / items that should be clarified in the spec. Wes
suggested I raise them here on dev@:

*1) Should producers expect callers to zero-init structs?*

The spec suggests that producers have an interface like:

Status Produce(ArrowArray* array) {
  ...
}

In the case of Arrow's own producer implementation, it doesnt assume that
'array' has been initialized in any way prior to this call, and the first
thing it does is zero the memory of 'array'. This is pretty standard
behavior in C-style APIs (eg stat(2) doesn't assume that its out-argument
is initialized in any way)

 An alternate approach would be to assume that 'array' is in some valid
state, and c all array->release() if it is non-null prior to filling in the
array with new data. This is a more C++-style API: in C++ it's rare to have
uninitialized structures floating around because constructors usually put
objects into some kind of valid state before the object gets passed
anywhere.

The answer here is probably "up to you", but might be good to have some
guidance here in the spec doc. I suppose since it's the "C interface" it's
probably best to follow the C-style "producer assumes the argument contains
uninitialized memory" convention.

*2) Clarify lifetime semantics for nested structures*

In my application, i'd like to allocate all of the children structures of
an ArrowSchema or ArrowArray out of a memory pool which is stored in the
private_data field of the top-level struct. As such, my initial
implementation was to make the 'release' callback on the top-level struct
delete that memory pool, and set the 'release' callback of all children
structs to null, since their memory was totally owned by the top-level
struct.

I figured this approach was OK because the spec says:

>  Consumers MUST call a base structure's release callback when they won't
be using it anymore, but they MUST not call any of its children's release
callbacks (including the optional dictionary). The producer is responsible
for releasing the children.

That advice seems to indicate that I can do whatever I want with the
release callback of the children, including not setting it.

However, I found that arrow's ImportArray function would fail a check
because the child structures had no release callbacks set. I had to set the
release callbacks to a no-op function to work around this.

This section of the spec also seems to be a bit in conflict with the
following:

> It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child array
anymore. This satisfies the use case of keeping only a subset of child
arrays, while releasing the others.

... because if you have a parent array which owns the memory referred to by
the child, then moving the child (with a no-op release callback) followed
by releasing the parent, you'll end up with an invalid or deallocated child
as well.

In other words, I think the spec should be explicit that either:
(a) every allocated structure should "stand alone" and be individually
releasable (and thus moveable)
(b) a produced struct must have the same lifetime as all children.
Consumers should not release children, and if they release the original
base, all children are invalidated regardless of whether they have been
moved.


Thanks
Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


[jira] [Created] (ARROW-8349) [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2

2020-04-06 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8349:
---

 Summary: [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2
 Key: ARROW-8349
 URL: https://issues.apache.org/jira/browse/ARROW-8349
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


Now that homebrew provides compatible libgit2 version, we can use latest pygit2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Attn: Wes, Re: Masked Arrays

2020-04-06 Thread Francois Saint-Jacques
It does make sense, I would go a little further and make this
field/property a single value of the same type than the array. This
would allow using any arbitrary sentinel value for unknown values (0
in your suggested case). The end result is zero-copy for R bindings
(if stars are aligned). I created ARROW-8348 [1] for this.

François

[1] https://jira.apache.org/jira/browse/ARROW-8348

On Mon, Apr 6, 2020 at 11:02 AM Felix Benning  wrote:
>
> Would it make sense to have an `na_are_zero` flag? Since null checking is
> not without cost, it might be helpful to some algorithms, if the content
> "underneath" the nulls is zero. For example in means, or scalar products
> and thus matrix multiplication, knowing that the array has zeros where the
> na's are, would allow these algorithms to pretend that there are no na's.
> Since setting all nulls to zero in a matrix of n columns and n rows costs
> O(n^2), it would make sense to set them all to zero before matrix
> multiplication i.e. O(n^3) and similarly expensive algorithms. If there was
> a `na_are_zero` flag, other algorithms could later utilize this work
> already being done. Algorithms which change the data and violate this
> contract, would only need to reset the flag. And in some use cases, it
> might be possible to use idle time of the computer to "clean up" the na's,
> preparing for the next query.
>
> Felix
>
> -- Forwarded message -
> From: Wes McKinney 
> Date: Sun, 5 Apr 2020 at 22:31
> Subject: Re: Attn: Wes, Re: Masked Arrays
> To: 
>
>
> As I recall the contents "underneath" have been discussed before and
> the consensus was that the contents are not specified. If you'e like
> to make a proposal to change something I would suggest raising it on
> dev@arrow.apache.org
>
> On Sun, Apr 5, 2020 at 1:56 PM Felix Benning 
> wrote:
> >
> > Follow up: Do you think it would make sense to have an `na_are_zero`
> flag? Since it appears that the baseline (naively assuming there are no
> null values) is still a bit faster than equally optimized null value
> handling algorithms. So you might want to make the assumption, that all
> null values are set to zero in the array (instead of undefined). This would
> allow for very fast means, scalar products and thus matrix multiplication
> which ignore nas. And in case of matrix multiplication, you might prefer
> sacrificing an O(n^2) effort to set all null entries to zero before
> multiplying. And assuming you do not overwrite this data, you would be able
> to reuse that assumption in later computations with such a flag.
> > In some use cases, you might even be able to utilize unused computing
> resources for this task. I.e. clean up the nulls while the computer is not
> used, preparing for the next query.
> >
> >
> > On Sun, 5 Apr 2020 at 18:34, Felix Benning 
> wrote:
> >>
> >> Awesome, that was exactly what I was looking for, thank you!
> >>
> >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney  wrote:
> >>>
> >>> I wrote a blog post a couple of years about this
> >>>
> >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >>>
> >>> Pasha Stetsenko did a follow-up analysis that showed that my
> >>> "sentinel" code could be significantly improved, see:
> >>>
> >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> >>>
> >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> >>> representation of nullness across all types, both primitive (booleans,
> >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> >>> computational operations (like elementwise functions) need not concern
> >>> themselves with the nulls at all, for example, since the bitmap from
> >>> the input array can be passed along (with zero copy even) to the
> >>> output array.
> >>>
> >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning 
> wrote:
> >>> >
> >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> Arrays for NA implementations? There seems to have been a discussion about
> that in the numpy community in 2012
> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> apparent result.
> >>> >
> >>> > Summary of the Summary:
> >>> > - The Bitpattern approach reserves one bitpattern of any type as na,
> the only type not having spare bitpatterns are integers which means this
> decreases their range by one. This approach is taken by R and was regarded
> as more performant in 2012.
> >>> > - The Mask approach was deemed more flexible, since it would allow
> "degrees of missingness", and also cleaner/easier implementation.
> >>> >
> >>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
> calculations (e.g. mean) would actually benefit more, from setting na
> values to zero, proceeding as if they were not there, and using the number
> of nas in the metadata to adjust the result. This of course does not work
> if two columns are used (e.g. scalar product), which is probably more
> important.
> >>> >
> >>> > Was using Bitmasks 

[jira] [Created] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls

2020-04-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8348:
-

 Summary: [C++] Support optional sentinel values in primitive Array 
for nulls
 Key: ARROW-8348
 URL: https://issues.apache.org/jira/browse/ARROW-8348
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


This is an optional feature where a sentinel value is stored in null cells and 
is exposed via an accessor method, e.g. `optional Array::HasSentinel() 
const;`. This would allow zero-copy bi-directional conversion with R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8347) [C++] Add Result to Array methods

2020-04-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8347:
-

 Summary: [C++] Add Result to Array methods
 Key: ARROW-8347
 URL: https://issues.apache.org/jira/browse/ARROW-8347
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8346) [CI][Ruby] GLib/Ruby macOS build fails on zlib

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8346:
--

 Summary: [CI][Ruby] GLib/Ruby macOS build fails on zlib
 Key: ARROW-8346
 URL: https://issues.apache.org/jira/browse/ARROW-8346
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, GLib
Reporter: Neal Richardson
 Fix For: 0.17.0


See https://github.com/apache/arrow/runs/564610412 for example.

{code}
Using 'PKG_CONFIG_PATH' from environment with value: '/usr/local/lib/pkgconfig'
Run-time dependency gobject-2.0 found: YES 2.64.1
Run-time dependency gio-2.0 found: NO (tried framework and cmake)

c_glib/arrow-glib/meson.build:210:0: ERROR: Could not generate cargs for 
gio-2.0:
Package zlib was not found in the pkg-config search path.
Perhaps you should add the directory containing `zlib.pc'
to the PKG_CONFIG_PATH environment variable
Package 'zlib', required by 'gio-2.0', not found


A full log can be found at 
/Users/runner/runners/2.168.0/work/arrow/arrow/build/c_glib/meson-logs/meson-log.txt
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Fwd: Attn: Wes, Re: Masked Arrays

2020-04-06 Thread Felix Benning
Would it make sense to have an `na_are_zero` flag? Since null checking is
not without cost, it might be helpful to some algorithms, if the content
"underneath" the nulls is zero. For example in means, or scalar products
and thus matrix multiplication, knowing that the array has zeros where the
na's are, would allow these algorithms to pretend that there are no na's.
Since setting all nulls to zero in a matrix of n columns and n rows costs
O(n^2), it would make sense to set them all to zero before matrix
multiplication i.e. O(n^3) and similarly expensive algorithms. If there was
a `na_are_zero` flag, other algorithms could later utilize this work
already being done. Algorithms which change the data and violate this
contract, would only need to reset the flag. And in some use cases, it
might be possible to use idle time of the computer to "clean up" the na's,
preparing for the next query.

Felix

-- Forwarded message -
From: Wes McKinney 
Date: Sun, 5 Apr 2020 at 22:31
Subject: Re: Attn: Wes, Re: Masked Arrays
To: 


As I recall the contents "underneath" have been discussed before and
the consensus was that the contents are not specified. If you'e like
to make a proposal to change something I would suggest raising it on
dev@arrow.apache.org

On Sun, Apr 5, 2020 at 1:56 PM Felix Benning 
wrote:
>
> Follow up: Do you think it would make sense to have an `na_are_zero`
flag? Since it appears that the baseline (naively assuming there are no
null values) is still a bit faster than equally optimized null value
handling algorithms. So you might want to make the assumption, that all
null values are set to zero in the array (instead of undefined). This would
allow for very fast means, scalar products and thus matrix multiplication
which ignore nas. And in case of matrix multiplication, you might prefer
sacrificing an O(n^2) effort to set all null entries to zero before
multiplying. And assuming you do not overwrite this data, you would be able
to reuse that assumption in later computations with such a flag.
> In some use cases, you might even be able to utilize unused computing
resources for this task. I.e. clean up the nulls while the computer is not
used, preparing for the next query.
>
>
> On Sun, 5 Apr 2020 at 18:34, Felix Benning 
wrote:
>>
>> Awesome, that was exactly what I was looking for, thank you!
>>
>> On Sun, 5 Apr 2020 at 00:40, Wes McKinney  wrote:
>>>
>>> I wrote a blog post a couple of years about this
>>>
>>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
>>>
>>> Pasha Stetsenko did a follow-up analysis that showed that my
>>> "sentinel" code could be significantly improved, see:
>>>
>>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
>>>
>>> Generally speaking in Apache Arrow we've been happy to have a uniform
>>> representation of nullness across all types, both primitive (booleans,
>>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
>>> computational operations (like elementwise functions) need not concern
>>> themselves with the nulls at all, for example, since the bitmap from
>>> the input array can be passed along (with zero copy even) to the
>>> output array.
>>>
>>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning 
wrote:
>>> >
>>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
Arrays for NA implementations? There seems to have been a discussion about
that in the numpy community in 2012
https://numpy.org/neps/nep-0026-missing-data-summary.html without an
apparent result.
>>> >
>>> > Summary of the Summary:
>>> > - The Bitpattern approach reserves one bitpattern of any type as na,
the only type not having spare bitpatterns are integers which means this
decreases their range by one. This approach is taken by R and was regarded
as more performant in 2012.
>>> > - The Mask approach was deemed more flexible, since it would allow
"degrees of missingness", and also cleaner/easier implementation.
>>> >
>>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
calculations (e.g. mean) would actually benefit more, from setting na
values to zero, proceeding as if they were not there, and using the number
of nas in the metadata to adjust the result. This of course does not work
if two columns are used (e.g. scalar product), which is probably more
important.
>>> >
>>> > Was using Bitmasks in Arrow a conscious performance decision? Or was
the decision only based on the fact, that R and Bitpattern implementations
in general are a niche, which means that Bitmasks are more compatible with
other languages?
>>> >
>>> > I am curious about this topic, since the "lack of proper na support"
was cited as the reason, why Python would never replace R in statistics.
>>> >
>>> > Thanks,
>>> >
>>> > Felix
>>> >
>>> >
>>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
>>> >
>>> > Note that pandas is starting to use a notion of "masked arrays" as
well, for example for its nullable integer data type, but also not using
the 

[jira] [Created] (ARROW-8345) [Python] feather.read_table should not require pandas

2020-04-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8345:


 Summary: [Python] feather.read_table should not require pandas
 Key: ARROW-8345
 URL: https://issues.apache.org/jira/browse/ARROW-8345
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0


We still check the pandas version, while pandas is not actually needed. Will do 
a quick fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Antoine Pitrou


Also nice to have perhaps (PR available and several back-and-forths
already):

* ARROW-7610: [Java] Finish support for 64 bit int allocations

Needs a Java committer to decide...

Regards

Antoine.


Le 06/04/2020 à 00:24, Wes McKinney a écrit :
> We are getting close to the 0.17.0 endgame.
> 
> Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
> issues without patches yet so we should decide quickly whether they
> need to be included. Are they any blocking issues not accounted for in
> the milestone?
> 
> * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
> 
> Patch available
> 
> * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
> relative path to Flight.proto
> 
> No patch yet
> 
> * ARROW-7222 [Python][Release] Wipe any existing generated Python API
> documentation when updating website
> 
> This issue needs to be addressed by the release manager and the
> Confluence instructions must be updated.
> 
> * ARROW-7891 [C++] RecordBatch->Equals should also have a
> check_metadata argument
> 
> Patch available that needs to be reviewed and approved
> 
> * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical 
> schema
> 
> Patch available, but failures to be resolved
> 
> * ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse
> 
> Depends on ARROW-8164, will require rebase
> 
> * ARROW-8039: [Python][Dataset] Support using dataset API in
> pyarrow.parquet with a minimal ParquetDataset shim
> 
> Patch pending
> 
> * ARROW-8047: [Python][Documentation] Document migration from
> ParquetDataset to pyarrow.datasets
> 
> May be tackled beyond 0.17.0
> 
> * ARROW-8063: [Python] Add user guide documentation for Datasets API
> 
> May be tackled beyond 0.17.0
> 
> * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
> 
> Does not seem strictly necessary for release, since a packaging issue
> 
> * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
> 
> Patch available, but needs review. May
> 
> * ARROW-8213: [Python][Dataset] Opening a dataset with a local
> incorrect path gives confusing error message
> 
> Nice to have, but not essential
> 
> * ARROW-8266: [C++] Add backup mirrors for external project source downloads
> 
> Patch available, nice to have
> 
> * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
> per "Feather V2" changes
> 
> Patch available
> 
> * ARROW-8300 [R] Documentation and changelog updates for 0.17
> 
> Patch available
> 
> * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
> requirements in C data interface
> 
> Patch available
> 
> * ARROW-8330: [Documentation] The post release script generates the
> documentation with a development version
> 
> Patch available
> 
> * ARROW-8335: [Release] Add crossbow jobs to run release verification
> 
> Patch in progress
> 
> On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
>>
>> I see ARROW-6871 in the list.
>> It seems it has some bugs, which are being fixed by ARROW-8239.
>> So I have added ARROW-8239 to the list.
>>
>> The PR for ARROW-8239 is already approved, so it is expected to be resolved
>> soon.
>>
>> Best,
>> Liya Fan
>>
>> On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield 
>> wrote:
>>
>>> I moved the Java issues out of 0.17.0, they seem complex enough or not of
>>> enough significance to make them blockers for 0.17.0 release.  If owners of
>>> the issues disagree please move them back int.
>>>
>>> On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney  wrote:
>>>
 We've made good progress, but there are still 35 issues in the
 backlog. Some of them are documentation related, but there are some
 functionality-related patches that could be at risk. If all could
 review again to trim out anything that isn't going to make the cut for
 0.17.0, please do

 On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
>>> wrote:
>
> I just took a first pass at reviewing the Java and Rust issues and
 removed
> some from the 0.17.0 release. There are a few small Rust issues that I
>>> am
> actively working on for this release.
>
> Thanks.
>
>
> On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney 
 wrote:
>
>> hi Neal,
>>
>> Thanks for helping coordinate. I agree we should be in a position to
>> release sometime next week.
>>
>> Can folks from the Rust and Java side review issues in the backlog?
>> According to the dashboard there are 19 Rust issues open and 7 Java
>> issues.
>>
>> Thanks
>>
>> On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
>>  wrote:
>>>
>>> Hi all,
>>> A few weeks ago, there seemed to be consensus (lazy, at least) for
>>> a
 0.17
>>> release at the end of the month. Judging from
>>>
 https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release,
>> it
>>> looks like we're getting closer.
>>>
>>> I'd encourage everyone to review their backlogs and 

Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Antoine Pitrou


Hi,

I added the following issue to the cpp-1.6.0 milestone:

* PARQUET-1835 [C++] Fix crashes on invalid input (OSS-Fuzz)

There's a PR up for it and it's simple enough to be validated quickly, IMHO.

Regards

Antoine.


Le 06/04/2020 à 00:24, Wes McKinney a écrit :
> We are getting close to the 0.17.0 endgame.
> 
> Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
> issues without patches yet so we should decide quickly whether they
> need to be included. Are they any blocking issues not accounted for in
> the milestone?
> 
> * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
> 
> Patch available
> 
> * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
> relative path to Flight.proto
> 
> No patch yet
> 
> * ARROW-7222 [Python][Release] Wipe any existing generated Python API
> documentation when updating website
> 
> This issue needs to be addressed by the release manager and the
> Confluence instructions must be updated.
> 
> * ARROW-7891 [C++] RecordBatch->Equals should also have a
> check_metadata argument
> 
> Patch available that needs to be reviewed and approved
> 
> * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical 
> schema
> 
> Patch available, but failures to be resolved
> 
> * ARROW-7965: [Python] Hold a reference to the dataset factory for later reuse
> 
> Depends on ARROW-8164, will require rebase
> 
> * ARROW-8039: [Python][Dataset] Support using dataset API in
> pyarrow.parquet with a minimal ParquetDataset shim
> 
> Patch pending
> 
> * ARROW-8047: [Python][Documentation] Document migration from
> ParquetDataset to pyarrow.datasets
> 
> May be tackled beyond 0.17.0
> 
> * ARROW-8063: [Python] Add user guide documentation for Datasets API
> 
> May be tackled beyond 0.17.0
> 
> * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
> 
> Does not seem strictly necessary for release, since a packaging issue
> 
> * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
> 
> Patch available, but needs review. May
> 
> * ARROW-8213: [Python][Dataset] Opening a dataset with a local
> incorrect path gives confusing error message
> 
> Nice to have, but not essential
> 
> * ARROW-8266: [C++] Add backup mirrors for external project source downloads
> 
> Patch available, nice to have
> 
> * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
> per "Feather V2" changes
> 
> Patch available
> 
> * ARROW-8300 [R] Documentation and changelog updates for 0.17
> 
> Patch available
> 
> * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
> requirements in C data interface
> 
> Patch available
> 
> * ARROW-8330: [Documentation] The post release script generates the
> documentation with a development version
> 
> Patch available
> 
> * ARROW-8335: [Release] Add crossbow jobs to run release verification
> 
> Patch in progress
> 
> On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
>>
>> I see ARROW-6871 in the list.
>> It seems it has some bugs, which are being fixed by ARROW-8239.
>> So I have added ARROW-8239 to the list.
>>
>> The PR for ARROW-8239 is already approved, so it is expected to be resolved
>> soon.
>>
>> Best,
>> Liya Fan
>>
>> On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield 
>> wrote:
>>
>>> I moved the Java issues out of 0.17.0, they seem complex enough or not of
>>> enough significance to make them blockers for 0.17.0 release.  If owners of
>>> the issues disagree please move them back int.
>>>
>>> On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney  wrote:
>>>
 We've made good progress, but there are still 35 issues in the
 backlog. Some of them are documentation related, but there are some
 functionality-related patches that could be at risk. If all could
 review again to trim out anything that isn't going to make the cut for
 0.17.0, please do

 On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
>>> wrote:
>
> I just took a first pass at reviewing the Java and Rust issues and
 removed
> some from the 0.17.0 release. There are a few small Rust issues that I
>>> am
> actively working on for this release.
>
> Thanks.
>
>
> On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney 
 wrote:
>
>> hi Neal,
>>
>> Thanks for helping coordinate. I agree we should be in a position to
>> release sometime next week.
>>
>> Can folks from the Rust and Java side review issues in the backlog?
>> According to the dashboard there are 19 Rust issues open and 7 Java
>> issues.
>>
>> Thanks
>>
>> On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
>>  wrote:
>>>
>>> Hi all,
>>> A few weeks ago, there seemed to be consensus (lazy, at least) for
>>> a
 0.17
>>> release at the end of the month. Judging from
>>>
 https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.17.0+Release,
>> it
>>> looks like we're getting closer.
>>>
>>> I'd encourage everyone to 

Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Wes McKinney
That may be so. If we do partially revert it (the dict return value is the
only thing probably that needs to be changed), we need to get the
downstream libraries to make changes to allow us to make this change.
Another option is returning the KV wrapper via another attribute.

On Mon, Apr 6, 2020, 3:06 AM Antoine Pitrou  wrote:

>
> Hmm, if downstream libraries were expecting a dict, perhaps we'll need
> to revert that change...
>
> Regards
>
> Antoine.
>
>
> Le 06/04/2020 à 08:50, Joris Van den Bossche a écrit :
> > We also have a recent regression related to the KeyValueMetadata wrapping
> > python that is causing failures in downstream libraries, that seems a
> > blocker for the release:
> https://issues.apache.org/jira/browse/ARROW-8342
> >
> > On Mon, 6 Apr 2020 at 00:25, Wes McKinney  wrote:
> >
> >> We are getting close to the 0.17.0 endgame.
> >>
> >> Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
> >> issues without patches yet so we should decide quickly whether they
> >> need to be included. Are they any blocking issues not accounted for in
> >> the milestone?
> >>
> >> * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
> >>
> >> Patch available
> >>
> >> * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
> >> relative path to Flight.proto
> >>
> >> No patch yet
> >>
> >> * ARROW-7222 [Python][Release] Wipe any existing generated Python API
> >> documentation when updating website
> >>
> >> This issue needs to be addressed by the release manager and the
> >> Confluence instructions must be updated.
> >>
> >> * ARROW-7891 [C++] RecordBatch->Equals should also have a
> >> check_metadata argument
> >>
> >> Patch available that needs to be reviewed and approved
> >>
> >> * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical
> >> schema
> >>
> >> Patch available, but failures to be resolved
> >>
> >> * ARROW-7965: [Python] Hold a reference to the dataset factory for later
> >> reuse
> >>
> >> Depends on ARROW-8164, will require rebase
> >>
> >> * ARROW-8039: [Python][Dataset] Support using dataset API in
> >> pyarrow.parquet with a minimal ParquetDataset shim
> >>
> >> Patch pending
> >>
> >> * ARROW-8047: [Python][Documentation] Document migration from
> >> ParquetDataset to pyarrow.datasets
> >>
> >> May be tackled beyond 0.17.0
> >>
> >> * ARROW-8063: [Python] Add user guide documentation for Datasets API
> >>
> >> May be tackled beyond 0.17.0
> >>
> >> * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
> >>
> >> Does not seem strictly necessary for release, since a packaging issue
> >>
> >> * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
> >>
> >> Patch available, but needs review. May
> >>
> >> * ARROW-8213: [Python][Dataset] Opening a dataset with a local
> >> incorrect path gives confusing error message
> >>
> >> Nice to have, but not essential
> >>
> >> * ARROW-8266: [C++] Add backup mirrors for external project source
> >> downloads
> >>
> >> Patch available, nice to have
> >>
> >> * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
> >> per "Feather V2" changes
> >>
> >> Patch available
> >>
> >> * ARROW-8300 [R] Documentation and changelog updates for 0.17
> >>
> >> Patch available
> >>
> >> * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
> >> requirements in C data interface
> >>
> >> Patch available
> >>
> >> * ARROW-8330: [Documentation] The post release script generates the
> >> documentation with a development version
> >>
> >> Patch available
> >>
> >> * ARROW-8335: [Release] Add crossbow jobs to run release verification
> >>
> >> Patch in progress
> >>
> >> On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
> >>>
> >>> I see ARROW-6871 in the list.
> >>> It seems it has some bugs, which are being fixed by ARROW-8239.
> >>> So I have added ARROW-8239 to the list.
> >>>
> >>> The PR for ARROW-8239 is already approved, so it is expected to be
> >> resolved
> >>> soon.
> >>>
> >>> Best,
> >>> Liya Fan
> >>>
> >>> On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield  >
> >>> wrote:
> >>>
>  I moved the Java issues out of 0.17.0, they seem complex enough or not
> >> of
>  enough significance to make them blockers for 0.17.0 release.  If
> >> owners of
>  the issues disagree please move them back int.
> 
>  On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney 
> >> wrote:
> 
> > We've made good progress, but there are still 35 issues in the
> > backlog. Some of them are documentation related, but there are some
> > functionality-related patches that could be at risk. If all could
> > review again to trim out anything that isn't going to make the cut
> >> for
> > 0.17.0, please do
> >
> > On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
>  wrote:
> >>
> >> I just took a first pass at reviewing the Java and Rust issues and
> > removed
> >> some from the 0.17.0 release. There are a few small Rust issues
> >> 

[jira] [Created] (ARROW-8344) [C#] StringArray.Builder.Clear() corrupts subsequent array contents

2020-04-06 Thread Adam Szmigin (Jira)
Adam Szmigin created ARROW-8344:
---

 Summary: [C#] StringArray.Builder.Clear() corrupts subsequent 
array contents
 Key: ARROW-8344
 URL: https://issues.apache.org/jira/browse/ARROW-8344
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 0.16.0
 Environment: Windows 10 x64
Reporter: Adam Szmigin


h1. Summary

Using the {{Clear()}} method on a {{StringArray.Builder}} class causes all 
subsequent built arrays to contain strings consisting solely of whitespace.  
The below minimal example illustrates:
{code:java}
namespace ArrowStringArrayBuilderBug
{
using Apache.Arrow;
using Apache.Arrow.Memory;

public class Program
{
private static readonly NativeMemoryAllocator Allocator
= new NativeMemoryAllocator();

public static void Main()
{
var builder = new StringArray.Builder();
AppendBuildPrint(builder, "Hello", "World");
builder.Clear();
AppendBuildPrint(builder, "Foo", "Bar");
}

private static void AppendBuildPrint(
StringArray.Builder builder, params string[] strings)
{
foreach (var elem in strings)
builder.Append(elem);

var arr = builder.Build(Allocator);
System.Console.Write("Array contents: [");
for (var i = 0; i < arr.Length; i++)
{
if (i > 0) System.Console.Write(", ");
System.Console.Write($"'{arr.GetString(i)}'");
}
System.Console.WriteLine("]");
}
}
{code}
h2. Expected Output
{noformat}
Array contents: ['Hello', 'World']
Array contents: ['Foo', 'Bar']
{noformat}
h2. Actual Output
{noformat}
Array contents: ['Hello', 'World']
Array contents: ['   ', '   '] {noformat}
h1. Workaround

The bug can be trivially worked around by constructing a new 
{{StringArray.Builder}} instead of calling {{Clear()}}.

The issue ARROW-7040 mentions other issues with string arrays in C#, but I'm 
not sure if this is related or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8343) [GLib] Add GArrowRecordBatchIterator

2020-04-06 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8343:
---

 Summary: [GLib] Add GArrowRecordBatchIterator
 Key: ARROW-8343
 URL: https://issues.apache.org/jira/browse/ARROW-8343
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Antoine Pitrou


Hmm, if downstream libraries were expecting a dict, perhaps we'll need
to revert that change...

Regards

Antoine.


Le 06/04/2020 à 08:50, Joris Van den Bossche a écrit :
> We also have a recent regression related to the KeyValueMetadata wrapping
> python that is causing failures in downstream libraries, that seems a
> blocker for the release: https://issues.apache.org/jira/browse/ARROW-8342
> 
> On Mon, 6 Apr 2020 at 00:25, Wes McKinney  wrote:
> 
>> We are getting close to the 0.17.0 endgame.
>>
>> Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
>> issues without patches yet so we should decide quickly whether they
>> need to be included. Are they any blocking issues not accounted for in
>> the milestone?
>>
>> * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
>>
>> Patch available
>>
>> * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
>> relative path to Flight.proto
>>
>> No patch yet
>>
>> * ARROW-7222 [Python][Release] Wipe any existing generated Python API
>> documentation when updating website
>>
>> This issue needs to be addressed by the release manager and the
>> Confluence instructions must be updated.
>>
>> * ARROW-7891 [C++] RecordBatch->Equals should also have a
>> check_metadata argument
>>
>> Patch available that needs to be reviewed and approved
>>
>> * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical
>> schema
>>
>> Patch available, but failures to be resolved
>>
>> * ARROW-7965: [Python] Hold a reference to the dataset factory for later
>> reuse
>>
>> Depends on ARROW-8164, will require rebase
>>
>> * ARROW-8039: [Python][Dataset] Support using dataset API in
>> pyarrow.parquet with a minimal ParquetDataset shim
>>
>> Patch pending
>>
>> * ARROW-8047: [Python][Documentation] Document migration from
>> ParquetDataset to pyarrow.datasets
>>
>> May be tackled beyond 0.17.0
>>
>> * ARROW-8063: [Python] Add user guide documentation for Datasets API
>>
>> May be tackled beyond 0.17.0
>>
>> * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
>>
>> Does not seem strictly necessary for release, since a packaging issue
>>
>> * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
>>
>> Patch available, but needs review. May
>>
>> * ARROW-8213: [Python][Dataset] Opening a dataset with a local
>> incorrect path gives confusing error message
>>
>> Nice to have, but not essential
>>
>> * ARROW-8266: [C++] Add backup mirrors for external project source
>> downloads
>>
>> Patch available, nice to have
>>
>> * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
>> per "Feather V2" changes
>>
>> Patch available
>>
>> * ARROW-8300 [R] Documentation and changelog updates for 0.17
>>
>> Patch available
>>
>> * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
>> requirements in C data interface
>>
>> Patch available
>>
>> * ARROW-8330: [Documentation] The post release script generates the
>> documentation with a development version
>>
>> Patch available
>>
>> * ARROW-8335: [Release] Add crossbow jobs to run release verification
>>
>> Patch in progress
>>
>> On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
>>>
>>> I see ARROW-6871 in the list.
>>> It seems it has some bugs, which are being fixed by ARROW-8239.
>>> So I have added ARROW-8239 to the list.
>>>
>>> The PR for ARROW-8239 is already approved, so it is expected to be
>> resolved
>>> soon.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield 
>>> wrote:
>>>
 I moved the Java issues out of 0.17.0, they seem complex enough or not
>> of
 enough significance to make them blockers for 0.17.0 release.  If
>> owners of
 the issues disagree please move them back int.

 On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney 
>> wrote:

> We've made good progress, but there are still 35 issues in the
> backlog. Some of them are documentation related, but there are some
> functionality-related patches that could be at risk. If all could
> review again to trim out anything that isn't going to make the cut
>> for
> 0.17.0, please do
>
> On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
 wrote:
>>
>> I just took a first pass at reviewing the Java and Rust issues and
> removed
>> some from the 0.17.0 release. There are a few small Rust issues
>> that I
 am
>> actively working on for this release.
>>
>> Thanks.
>>
>>
>> On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney 
> wrote:
>>
>>> hi Neal,
>>>
>>> Thanks for helping coordinate. I agree we should be in a
>> position to
>>> release sometime next week.
>>>
>>> Can folks from the Rust and Java side review issues in the
>> backlog?
>>> According to the dashboard there are 19 Rust issues open and 7
>> Java
>>> issues.
>>>
>>> Thanks
>>>
>>> On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
>>>  wrote:

 Hi all,

Re: Preparing for 0.17.0 Arrow release

2020-04-06 Thread Joris Van den Bossche
We also have a recent regression related to the KeyValueMetadata wrapping
python that is causing failures in downstream libraries, that seems a
blocker for the release: https://issues.apache.org/jira/browse/ARROW-8342

On Mon, 6 Apr 2020 at 00:25, Wes McKinney  wrote:

> We are getting close to the 0.17.0 endgame.
>
> Here are the 18 JIRAs still in the 0.17.0 milestone. There are a few
> issues without patches yet so we should decide quickly whether they
> need to be included. Are they any blocking issues not accounted for in
> the milestone?
>
> * ARROW-6947 [Rust] [DataFusion] Add support for scalar UDFs
>
> Patch available
>
> * ARROW-7794 [Rust] cargo publish fails for arrow-flight due to
> relative path to Flight.proto
>
> No patch yet
>
> * ARROW-7222 [Python][Release] Wipe any existing generated Python API
> documentation when updating website
>
> This issue needs to be addressed by the release manager and the
> Confluence instructions must be updated.
>
> * ARROW-7891 [C++] RecordBatch->Equals should also have a
> check_metadata argument
>
> Patch available that needs to be reviewed and approved
>
> * ARROW-8164: [C++][Dataset] Let datasets be viewable with non-identical
> schema
>
> Patch available, but failures to be resolved
>
> * ARROW-7965: [Python] Hold a reference to the dataset factory for later
> reuse
>
> Depends on ARROW-8164, will require rebase
>
> * ARROW-8039: [Python][Dataset] Support using dataset API in
> pyarrow.parquet with a minimal ParquetDataset shim
>
> Patch pending
>
> * ARROW-8047: [Python][Documentation] Document migration from
> ParquetDataset to pyarrow.datasets
>
> May be tackled beyond 0.17.0
>
> * ARROW-8063: [Python] Add user guide documentation for Datasets API
>
> May be tackled beyond 0.17.0
>
> * ARROW-8149 [C++/Python] Enable CUDA Support in conda recipes
>
> Does not seem strictly necessary for release, since a packaging issue
>
> * ARROW-8162: [Format][Python] Add serialization for CSF sparse tensors
>
> Patch available, but needs review. May
>
> * ARROW-8213: [Python][Dataset] Opening a dataset with a local
> incorrect path gives confusing error message
>
> Nice to have, but not essential
>
> * ARROW-8266: [C++] Add backup mirrors for external project source
> downloads
>
> Patch available, nice to have
>
> * ARROW-8275 [Python][Docs] Review Feather + IPC file documentation
> per "Feather V2" changes
>
> Patch available
>
> * ARROW-8300 [R] Documentation and changelog updates for 0.17
>
> Patch available
>
> * ARROW-8320 [Documentation][Format] Clarify (lack of) alignment
> requirements in C data interface
>
> Patch available
>
> * ARROW-8330: [Documentation] The post release script generates the
> documentation with a development version
>
> Patch available
>
> * ARROW-8335: [Release] Add crossbow jobs to run release verification
>
> Patch in progress
>
> On Tue, Mar 31, 2020 at 11:23 PM Fan Liya  wrote:
> >
> > I see ARROW-6871 in the list.
> > It seems it has some bugs, which are being fixed by ARROW-8239.
> > So I have added ARROW-8239 to the list.
> >
> > The PR for ARROW-8239 is already approved, so it is expected to be
> resolved
> > soon.
> >
> > Best,
> > Liya Fan
> >
> > On Wed, Apr 1, 2020 at 12:01 PM Micah Kornfield 
> > wrote:
> >
> > > I moved the Java issues out of 0.17.0, they seem complex enough or not
> of
> > > enough significance to make them blockers for 0.17.0 release.  If
> owners of
> > > the issues disagree please move them back int.
> > >
> > > On Tue, Mar 31, 2020 at 6:05 PM Wes McKinney 
> wrote:
> > >
> > > > We've made good progress, but there are still 35 issues in the
> > > > backlog. Some of them are documentation related, but there are some
> > > > functionality-related patches that could be at risk. If all could
> > > > review again to trim out anything that isn't going to make the cut
> for
> > > > 0.17.0, please do
> > > >
> > > > On Wed, Mar 25, 2020 at 2:39 PM Andy Grove 
> > > wrote:
> > > > >
> > > > > I just took a first pass at reviewing the Java and Rust issues and
> > > > removed
> > > > > some from the 0.17.0 release. There are a few small Rust issues
> that I
> > > am
> > > > > actively working on for this release.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > On Wed, Mar 25, 2020 at 1:13 PM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > hi Neal,
> > > > > >
> > > > > > Thanks for helping coordinate. I agree we should be in a
> position to
> > > > > > release sometime next week.
> > > > > >
> > > > > > Can folks from the Rust and Java side review issues in the
> backlog?
> > > > > > According to the dashboard there are 19 Rust issues open and 7
> Java
> > > > > > issues.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Tue, Mar 24, 2020 at 10:01 AM Neal Richardson
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > > A few weeks ago, there seemed to be consensus (lazy, at least)
> for
> > > a
> > > > 0.17
> > > > > > > release at the end of the month. Judging from
> > > 

[jira] [Created] (ARROW-8342) [Python] dask and kartothek integration tests are failing

2020-04-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8342:


 Summary: [Python] dask and kartothek integration tests are failing
 Key: ARROW-8342
 URL: https://issues.apache.org/jira/browse/ARROW-8342
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


The integration tests for both dask and kartothek, and for both master and 
latest released version of them, started failing the last days.

Dask latest: 
https://circleci.com/gh/ursa-labs/crossbow/10629?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link
 
Kartothek latest: 
https://circleci.com/gh/ursa-labs/crossbow/10604?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

I think both are related to the KeyValueMetadata changes (ARROW-8079).

The kartothek one is clearly related, as it gives: TypeError: 
'pyarrow.lib.KeyValueMetadata' object does not support item assignment

And I think the dask one is related to the "pandas" key now being present 
twice, and therefore it is using the "wrong" one.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-04-05-0

2020-04-06 Thread Joris Van den Bossche
I opened https://issues.apache.org/jira/browse/ARROW-8342 for the
dask/kartothek integration failures.

On Mon, 6 Apr 2020 at 02:54, Crossbow  wrote:

>
> Arrow Build Report for Job nightly-2020-04-05-0
>
> All tasks:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0
>
> Failed Tasks:
> - gandiva-jar-osx:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-travis-gandiva-jar-osx
> - gandiva-jar-trusty:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-travis-gandiva-jar-trusty
> - test-conda-python-3.6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.6
> - test-conda-python-3.7-dask-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-dask-latest
> - test-conda-python-3.7-kartothek-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-kartothek-latest
> - test-conda-python-3.7-kartothek-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-kartothek-master
> - test-conda-python-3.7-turbodbc-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-turbodbc-latest
> - test-conda-python-3.7-turbodbc-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-turbodbc-master
> - test-conda-python-3.8-dask-master:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.8-dask-master
> - test-ubuntu-18.04-docs:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-ubuntu-18.04-docs
> - ubuntu-eoan:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-ubuntu-eoan
> - ubuntu-focal:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-ubuntu-focal
>
> Succeeded Tasks:
> - centos-6:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-centos-6
> - centos-7:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-centos-7
> - centos-8:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-centos-8
> - conda-linux-gcc-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-azure-conda-win-vs2015-py38
> - debian-buster:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-debian-buster
> - debian-stretch:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-github-debian-stretch
> - homebrew-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-travis-homebrew-cpp
> - macos-r-autobrew:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-travis-macos-r-autobrew
> - test-conda-cpp-valgrind:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-cpp-valgrind
> - test-conda-cpp:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-cpp
> - test-conda-python-3.7-hdfs-2.9.2:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-hdfs-2.9.2
> - test-conda-python-3.7-pandas-latest:
>   URL:
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-05-0-circle-test-conda-python-3.7-pandas-latest
> - 

Re: building arrow with CMake 2.8 on CentOS

2020-04-06 Thread Sutou Kouhei
Hi,

We don't support CMake 2.8. Please use CMake 3.2 or later.

Are you using CentOS 6? You can install CMake 3.6 with EPEL
on CentOS 6:

  % sudo yum install -y epel-release
  % sudo yum install -y cmake3
  % cmake3 --version
  cmake3 version 3.6.1

  CMake suite maintained and supported by Kitware (kitware.com/cmake).


Thanks,
---
kou

In 
 

  "building arrow with CMake 2.8 on CentOS" on Mon, 6 Apr 2020 05:39:56 +,
  "Lekshmi Narayanan, Arun Balajiee"  wrote:

> Hi
> 
> I am looking to build Arrow on CentOS with cmake version 2.8. It is a shared 
> server, so the server admin at my school doesn't want to update the version 
> of cmake. I checked these two issues,
> https://issues.apache.org/jira/browse/ARROW-73
> https://issues.apache.org/jira/browse/ARROW-66
> 
> but I couldn't arrive at a resolution on how to build on my machine.
> 
> I don't find a documentation for building on CentOS with these cmake settings 
> as well. Could you help me with this?
> 
> 
> Regards,
> Arun  Balajiee