[DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

2019-05-16 Thread Wes McKinney
hi folks,

In a prior mailing list thread from February [1] I brought up some
work I'd done in C++ to create an API to define custom data types that
can be embedded in built-in Arrow logical types. These are serialized
through IPC by adding special fields to the `custom_metadata` member
of Field in the Flatbuffers metadata [2]. The idea is that if an
implementation does not understand the custom type, then they can
still interact with the underlying data if need be, or pass on the
extension metadata in subsequent IPC messages.

David Li has put up a WIP PR to implement this for Java [4], so to
help the project move forward I think it's a good time to formalize
this, and if there are disagreements to hash them out now. I have just
opened a PR to the Arrow specification documents [3] that describes
the current state of C++ and also the WIP Java PR.

Any thought about this? If there is consensus about this solution
approach then I can hold a vote.

Thanks
Wes

[1]: 
https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
[2]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
[3]: https://github.com/apache/arrow/pull/4332
[4]: https://github.com/apache/arrow/pull/4251


[jira] [Created] (ARROW-5359) timestamp_as_object support for pa.Table.to_pandas in pyarrow

2019-05-16 Thread Joe Muruganandam (JIRA)
Joe Muruganandam created ARROW-5359:
---

 Summary: timestamp_as_object support for pa.Table.to_pandas in 
pyarrow
 Key: ARROW-5359
 URL: https://issues.apache.org/jira/browse/ARROW-5359
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: Ubuntu
Reporter: Joe Muruganandam


Creating ticket for issue reported in 
github([https://github.com/apache/arrow/issues/4284])
h2. pyarrow (Issue with timestamp conversion from arrow to pandas)

pyarrow Table.to_pandas has option date_as_object but does not have similar 
option for timestamp. When a timestamp column in arrow table is converted to 
pandas the target datetype is pd.Timestamp and pd.Timestamp does not handle 
time > 2262-04-11 23:47:16.854775807 and hence in the below scenario the date 
is transformed to incorrect value. Adding timestamp_as_object option in 
pa.Table.to_pandas will help in this scenario.

#Python(3.6.8)

import pandas as pd

import pyarrow as pa

pd.*version*
'0.24.1'

pa.*version*
'0.13.0'

import datetime

df = pd.DataFrame(\{"test_date": 
[datetime.datetime(3000,12,31,12,0),datetime.datetime(3100,12,31,12,0)]})

df
test_date
0 3000-12-31 12:00:00
1 3100-12-31 12:00:00

pa_table = pa.Table.from_pandas(df)

pa_table[0]
Column name='test_date' type=TimestampType(timestamp[us])
[
[
325351728,
356908464
]
]

pa_table.to_pandas()
test_date
0 1831-11-22 12:50:52.580896768
1 1931-11-22 12:50:52.580896768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5358) [Rust] Implement equality check for ArrayData and Array

2019-05-16 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5358:
---

 Summary: [Rust] Implement equality check for ArrayData and Array
 Key: ARROW-5358
 URL: https://issues.apache.org/jira/browse/ARROW-5358
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Chao Sun


Currently {{Array}} doesn't implement the {{Eq}} trait. Although {{ArrayData}} 
derives from the {{PartialEq}} trait, the default implementation is not 
suitable here. Instead, we should implement customized equality comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5357) [Rust] change Buffer::len to represent total bytes instead of used bytes

2019-05-16 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5357:
---

 Summary: [Rust] change Buffer::len to represent total bytes 
instead of used bytes
 Key: ARROW-5357
 URL: https://issues.apache.org/jira/browse/ARROW-5357
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Chao Sun
Assignee: Chao Sun


Currently {{Buffer::len}} records the number of used bytes, as opposed to the 
number of total bytes. This poses a problem when converting from buffers 
defined in flatbuffer, where the length is actually the number of allocated 
bytes for the buffer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5356) [JS] Implement Duration type, integration test support for Interval and Duration types

2019-05-16 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5356:
---

 Summary: [JS] Implement Duration type, integration test support 
for Interval and Duration types
 Key: ARROW-5356
 URL: https://issues.apache.org/jira/browse/ARROW-5356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Wes McKinney


Follow on work to ARROW-835



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-05-16 Thread Wes McKinney
hi Joris,

Somewhat related to this, I want to also point out that we have C++
extension types [1]. As part of this, it would also be good to define
and document a public API for users to create ExtensionArray
subclasses that can be serialized and deserialized using this
machinery.

As a motivating example, suppose that a Java application has a special
data type that can be serialized as a Binary value in Arrow, and we
want to be able to receive this special object as a pandas
ExtensionArray column, which unboxing into a Python user space type.

The ExtensionType can be implemented in Java, and then on the Python
side the implementation can occur either in C++ or Python. An API will
need to be defined to serializer functions for the pandas
ExtensionArray to map the pandas-space type onto the the Arrow-space
type. Does this seem like a project you might be able to help drive
forward? As a matter of sequencing, we do not yet have the capability
to interact with C++ ExtensionType in Python, so we might need to
first create callback machinery to enable Arrow extension types to be
defined in Python (that call into the C++ ExtensionType registry)

- Wes

[1]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc

On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche
 wrote:
>
> Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn :
>
> > +1 to the idea of adding a protocol to let other objects define their way
> > to Arrow structures. For pandas.Series I would expect that they return an
> > Arrow Column.
> >
> > For the Arrow->pandas conversion I have a bit mixed feelings. In the
> > normal Fletcher case I would expect that we don't convert anything as we
> > represent anything from Arrow with it.
>
>
> Yes, you don't want to convert anything (apart from wrapping the arrow
> array into a FletcherArray). But how does Table.to_pandas know that?
> Maybe it doesn't need to know that. And then you might write a function in
> fletcher to convert a pyarrow Table to a pandas DataFrame with
> fletcher-backed columns. But if you want to have this roundtrip
> automatically, without the need that each project that defines an
> ExtensionArray and wants to interact with arrow (eg in GeoPandas as well)
> needs to have his own "arrow-table-to-pandas-dataframe" converter, pyarrow
> needs to have some notion of how to convert back to a pandas ExtensionArray.
>
>
> > For the case where we want to restore the exact pandas DataFrame we had
> > before this will become a bit more complicated as we either would need to
> > have all third-party libraries to support Arrow via a hook as proposed or
> > we also define some kind of other protocol on the pandas side to
> > reconstruct ExtensionArrays from Arrow data.
> >
>
> That last one is basically what I proposed in
> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>
> Thanks Antoine and Uwe for the discussion!
>
> Joris


[jira] [Created] (ARROW-5355) [C++] DictionaryBuilder provides information to determine array builder type at run-time

2019-05-16 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5355:
---

 Summary: [C++] DictionaryBuilder provides information to determine 
array builder type at run-time
 Key: ARROW-5355
 URL: https://issues.apache.org/jira/browse/ARROW-5355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou


This is needed for Arrow GLib. In Arrow GLib, we need to determine how to wrap 
Arrow C++ `ArrayBuilder` at run-time. `ArrayBuilder` may be passed as a generic 
`ArrayBuilder` instead of concrete `DictionaryBuilder`. (e.g. 
`RecordBatchBuilder::GetField()`)

See also: https://github.com/apache/arrow/pull/4316#issuecomment-492995395



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] PR Backlog reduction

2019-05-16 Thread Wes McKinney
hi Micah,

This sounds like a reasonable proposal, and I agree in particular for
regular contributors that it makes sense to close PRs that are not
close to being in merge-readiness to thin the noise of the patch queue

We have some short-term issues such as various reviewers being busy
lately (e.g. I was on vacation in April, then heads down working on
ARROW-3144) but I agree that there are some structural issues with how
we're organizing code review efforts.

Note that Apache Spark, with ~500 open PRs, created this dashboard
application to help manage the insanity

https://spark-prs.appspot.com/

Ultimately (in the next few years as the number of active contributors
grows) I expect that we'll have to do something similar.

- Wes

On Thu, May 16, 2019 at 2:34 PM Micah Kornfield  wrote:
>
> Our backlog of open PRs is slowly creeping up.  This isn't great because it
> allows contributions to slip through the cracks (which in turn possibly
> turns off new contributors).  Perusing PRs I think things roughly fall into
> the following categories.
>
>
> 1.  PRs are work in progress that never got completed but were left open
> (mostly by regular arrow contributors).
>
> 2.  PR stalled because changes where requested and the PR author never
> responded.
>
> 3.  PR stalled due to lack of consensus on approach/design.
>
> 4.  PR is blocked on some external dependency (mostly these are PRs by
> regular arrow contributor).
>
>
> A straw-man proposal for handling these:
>
> 1.  Regular arrow contributors, please close the PR if it isn't close to
> being ready and you aren't actively working on it.
>
> 2.  I think we should start assigning reviewers who will have the
> responsibility of:
>
>a.  Pinging contributor and working through the review with them.
>
>b.  Closing out the PR in some form if there hasn't been activity in a
> 30 day period (either merging as is, making the necessary changes or
> closing the PR, and removing the tag from JIRA).
>
> 3.  Same as 2, but bring the discussion to the mailing list and try to have
> a formal vote if necessary.
>
> 4.  Same as 2, but tag the PR as blocked and the time window expands.
>
>
> The question comes up with how to manage assignment of PRs to reviewers.  I
> am happy to try to triage any PRs older then a week (assuming some PRs will
> be closed quickly with the current ad-hoc process) and load balance between
> volunteers (it would be great to have a doc someplace where people can
> express there available bandwidth and which languages they feel comfortable
> with).
>
>
> Thoughts/other proposals?
>
>
> Thanks,
>
> Micah
>
>
>
> P.S. A very rough analysis of PR tags gives the following counts.
>
>   29 C++
>
>   17 Python
>
>8 Rust
>
>7 WIP
>
>7 Plasma
>
>7 Java
>
>5 R
>
>4 Go
>
>4 Flight


Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Joris Van den Bossche
Missed the email of Wes, but yeah, I think we basically said the same.

Answer to another question you raised in the notebook:

> [about writing a _common_metadata file] ... uses the schema object for
> the 0th partition. This actually means that not *all* information in
> _common_metadata will be true for the entire dataset. More specifically,
> the "index_columns" [in the pandas_metadata] its "start" and "stop"
> values will correspond to the 0th partition, rather than the global dataset.
>
That's indeed a problem with storing the index information not as a column.
We have seen some other related issues about this, such as ARROW-5138 (when
reading a single row group of a parquet file).
In those cases, I think the only solution is to ignore this part of the
metadata. But, specifically for dask, I think the idea actually is to not
write the index at all (based on discussion in
https://github.com/dask/dask/pull/4336), so then you would not have this
problem.

However, note that writing the _common_metadata file like that from the
schema of the first partition might not be fully correct: it might have the
correct schema, but it will not have the correct dataset size (eg number of
row groups). Although I am not sure what the "common practice" is on this
aspect of _common_metadata file.

Joris



Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche <
jorisvandenboss...@gmail.com>:

> Hi Rick,
>
> Thanks for exploring this!
>
> I am still quite new to Parquet myself, so the following might not be
> fully correct, but based on my current understanding, to enable projects
> like dask to write the different pieces of a Parquet dataset using pyarrow,
> we need the following functionalities:
>
> - Write a single Parquet file (for one pieces / partition) and get the
> metadata of that file
> -> Writing is already long possible and ARROW-5258 (GH4236) enabled
> getting the metadata
> - Update and combine this list of metadata objects
> -> Dask needs a way to update the metadata (eg the exact file path
> where they put it inside the partitioned dataset): I opened ARROW-5349
> for this.
> -> We need to combine the metadata, discussed in ARROW-1983
> - Write a metadata object (for both the _metadata and _common_metadata
> files)
> -> Also discussed in ARROW-1983. The Python interface could also
> combine (step above) and write together.
>
> But it would be good if some people more familiar with Parquet could chime
> in here.
>
> Best,
> Joris
>
> Op do 16 mei 2019 om 16:37 schreef Richard Zamora :
>
>> Note that I was asked to post here after making a similar comment on
>> GitHub (https://github.com/apache/arrow/pull/4236)…
>>
>> I am hoping to help improve the use of pyarrow.parquet within dask (
>> https://github.com/dask/dask). To this end, I put together a simple
>> notebook to explore how pyarrow.parquet can be used to read/write a
>> partitioned dataset without dask (see:
>> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
>> If your search for "Assuming that a single-file metadata solution is
>> currently missing" in that notebook, you will see where I am unsure of the
>> best way to write/read metadata to/from a centralized location using
>> pyarrow.parquet.
>>
>> I believe that it would be best for dask to have a way to read/write a
>> single metadata file for a partitioned dataset using pyarrow (perhaps a
>> ‘_metadata’ file?).   Am I correct to assume that: (1) this functionality
>> is missing in pyarrow, and (2) this  approach is the best way to process a
>> partitioned dataset in parallel?
>>
>> Best,
>> Rick
>>
>> --
>> Richard J. Zamora
>> NVIDA
>>
>>
>>
>>
>> ---
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> ---
>>
>


[DISCUSS] PR Backlog reduction

2019-05-16 Thread Micah Kornfield
Our backlog of open PRs is slowly creeping up.  This isn't great because it
allows contributions to slip through the cracks (which in turn possibly
turns off new contributors).  Perusing PRs I think things roughly fall into
the following categories.


1.  PRs are work in progress that never got completed but were left open
(mostly by regular arrow contributors).

2.  PR stalled because changes where requested and the PR author never
responded.

3.  PR stalled due to lack of consensus on approach/design.

4.  PR is blocked on some external dependency (mostly these are PRs by
regular arrow contributor).


A straw-man proposal for handling these:

1.  Regular arrow contributors, please close the PR if it isn't close to
being ready and you aren't actively working on it.

2.  I think we should start assigning reviewers who will have the
responsibility of:

   a.  Pinging contributor and working through the review with them.

   b.  Closing out the PR in some form if there hasn't been activity in a
30 day period (either merging as is, making the necessary changes or
closing the PR, and removing the tag from JIRA).

3.  Same as 2, but bring the discussion to the mailing list and try to have
a formal vote if necessary.

4.  Same as 2, but tag the PR as blocked and the time window expands.


The question comes up with how to manage assignment of PRs to reviewers.  I
am happy to try to triage any PRs older then a week (assuming some PRs will
be closed quickly with the current ad-hoc process) and load balance between
volunteers (it would be great to have a doc someplace where people can
express there available bandwidth and which languages they feel comfortable
with).


Thoughts/other proposals?


Thanks,

Micah



P.S. A very rough analysis of PR tags gives the following counts.

  29 C++

  17 Python

   8 Rust

   7 WIP

   7 Plasma

   7 Java

   5 R

   4 Go

   4 Flight


Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Joris Van den Bossche
Hi Rick,

Thanks for exploring this!

I am still quite new to Parquet myself, so the following might not be fully
correct, but based on my current understanding, to enable projects like
dask to write the different pieces of a Parquet dataset using pyarrow, we
need the following functionalities:

- Write a single Parquet file (for one pieces / partition) and get the
metadata of that file
-> Writing is already long possible and ARROW-5258 (GH4236) enabled
getting the metadata
- Update and combine this list of metadata objects
-> Dask needs a way to update the metadata (eg the exact file path
where they put it inside the partitioned dataset): I opened ARROW-5349 for
this.
-> We need to combine the metadata, discussed in ARROW-1983
- Write a metadata object (for both the _metadata and _common_metadata
files)
-> Also discussed in ARROW-1983. The Python interface could also
combine (step above) and write together.

But it would be good if some people more familiar with Parquet could chime
in here.

Best,
Joris

Op do 16 mei 2019 om 16:37 schreef Richard Zamora :

> Note that I was asked to post here after making a similar comment on
> GitHub (https://github.com/apache/arrow/pull/4236)…
>
> I am hoping to help improve the use of pyarrow.parquet within dask (
> https://github.com/dask/dask). To this end, I put together a simple
> notebook to explore how pyarrow.parquet can be used to read/write a
> partitioned dataset without dask (see:
> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
> If your search for "Assuming that a single-file metadata solution is
> currently missing" in that notebook, you will see where I am unsure of the
> best way to write/read metadata to/from a centralized location using
> pyarrow.parquet.
>
> I believe that it would be best for dask to have a way to read/write a
> single metadata file for a partitioned dataset using pyarrow (perhaps a
> ‘_metadata’ file?).   Am I correct to assume that: (1) this functionality
> is missing in pyarrow, and (2) this  approach is the best way to process a
> partitioned dataset in parallel?
>
> Best,
> Rick
>
> --
> Richard J. Zamora
> NVIDA
>
>
>
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
>


[jira] [Created] (ARROW-5354) [C++] allow Array to have null buffers when all elements are null

2019-05-16 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5354:


 Summary: [C++] allow Array to have null buffers when all elements 
are null
 Key: ARROW-5354
 URL: https://issues.apache.org/jira/browse/ARROW-5354
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


In the case of all elements of an array being null, no buffers whatsoever 
*need* to be allocated (similar to NullArray). This is a more extreme case of 
the optimization which allows the null bitmap buffer to be null if all elements 
are valid. Currently {{arrow::Array}} requires at least a null bitmap buffer to 
be allocated (and all bits set to 0).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Wes McKinney
hi Richard,

We have been discussing this in

https://issues.apache.org/jira/browse/ARROW-1983

All that is currently missing is (AFAICT):

* A C++ function to write a vector of FileMetaData as a _metadata file
(make sure the file path is set in the metadata objects)
* A Python binding for this

This is a relatively low-complexity patch and does not require deep
understanding of the Parquet codebase, would someone like to submit a
pull request?

Thanks

On Thu, May 16, 2019 at 9:37 AM Richard Zamora  wrote:
>
> Note that I was asked to post here after making a similar comment on GitHub 
> (https://github.com/apache/arrow/pull/4236)…
>
> I am hoping to help improve the use of pyarrow.parquet within dask 
> (https://github.com/dask/dask). To this end, I put together a simple notebook 
> to explore how pyarrow.parquet can be used to read/write a partitioned 
> dataset without dask (see: 
> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
>   If your search for "Assuming that a single-file metadata solution is 
> currently missing" in that notebook, you will see where I am unsure of the 
> best way to write/read metadata to/from a centralized location using 
> pyarrow.parquet.
>
> I believe that it would be best for dask to have a way to read/write a single 
> metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ 
> file?).   Am I correct to assume that: (1) this functionality is missing in 
> pyarrow, and (2) this  approach is the best way to process a partitioned 
> dataset in parallel?
>
> Best,
> Rick
>
> --
> Richard J. Zamora
> NVIDA
>
>
>
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---


[jira] [Created] (ARROW-5353) 0-row table can be written but not read

2019-05-16 Thread Thomas Buhrmann (JIRA)
Thomas Buhrmann created ARROW-5353:
--

 Summary: 0-row table can be written but not read
 Key: ARROW-5353
 URL: https://issues.apache.org/jira/browse/ARROW-5353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.13.0, 0.12.0, 0.11.0
Reporter: Thomas Buhrmann


I can serialize a table with 0 rows, but not read it. The following code
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'x': [0,1,2]})[:0]
fnm = "tbl.arr"

tbl = pa.Table.from_pandas(df)
print(tbl.schema)

writer = pa.RecordBatchStreamWriter(fnm, tbl.schema)
writer.write_table(tbl)

reader = pa.RecordBatchStreamReader(fnm)
tbl2 = reader.read_all()
{code}
...results in the following output:
{code}
x: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 0, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "x", "field_name": "x", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
---
ArrowInvalid  Traceback (most recent call last)
 in 
 11 writer.write_table(tbl)
 12 
---> 13 reader = pa.RecordBatchStreamReader(fnm)
 14 tbl2 = reader.read_all()

~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/ipc.py in 
__init__(self, source)
 56 """
 57 def __init__(self, source):
---> 58 self._open(source)
 59 
 60 

~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/ipc.pxi in 
pyarrow.lib._RecordBatchStreamReader._open()

~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/error.pxi in 
pyarrow.lib.check_status()

ArrowInvalid: Expected schema message in stream, was null or length 0
{code}
Since the schema should be sufficient to build a table, even though it may not 
have any actual data, I wouldn't expect this to fail but return the same 0-row 
input table.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5352) [Rust] BinaryArray filter loses replaces nulls with empty strings

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5352:
-

 Summary: [Rust] BinaryArray filter loses replaces nulls with empty 
strings
 Key: ARROW-5352
 URL: https://issues.apache.org/jira/browse/ARROW-5352
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.13.0
Reporter: Neville Dipale


The filter implementation for BinaryArray discards nullness of data. 
BinaryArrays that are null (seem to) always return an empty string slice when 
getting a value, so the way filter works might be a bug depending on what Arrow 
developers' or users' intentions are.

I think we should either preserve nulls (and their count) or document this as 
intended behaviour.

Below is a test case that reproduces the bug.
{code:java}
#[test]
fn test_filter_binary_array_with_nulls() {
let mut a: BinaryBuilder = BinaryBuilder::new(100);
a.append_null().unwrap();
a.append_string("a string").unwrap();
a.append_null().unwrap();
a.append_string("with nulls").unwrap();
let array = a.finish();
let b = BooleanArray::from(vec![true, true, true, true]);
let c = filter(&array, &b).unwrap();
let d: &BinaryArray = c.as_any().downcast_ref::().unwrap();
// I didn't expect this behaviour
assert_eq!("", d.get_string(0));
// fails here
assert!(d.is_null(0));
assert_eq!(4, d.len());
// fails here
assert_eq!(2, d.null_count());
assert_eq!("a string", d.get_string(1));
// fails here
assert!(d.is_null(2));
assert_eq!("with nulls", d.get_string(3));
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5351) [Rust] Add support for take kernel functions

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5351:
-

 Summary: [Rust] Add support for take kernel functions
 Key: ARROW-5351
 URL: https://issues.apache.org/jira/browse/ARROW-5351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


Similar to https://issues.apache.org/jira/browse/ARROW-772, a take function 
would allow us random-access on arrays, which is useful for sorting and 
(potentially) filtering.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5350) [Rust] Support filtering on nested array types

2019-05-16 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5350:
-

 Summary: [Rust] Support filtering on nested array types
 Key: ARROW-5350
 URL: https://issues.apache.org/jira/browse/ARROW-5350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


We currently only filter on primitive types, but not on lists and structs. Add 
the ability to filter on nested array types



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Metadata for partitioned datasets in pyarrow.parquet

2019-05-16 Thread Richard Zamora
Note that I was asked to post here after making a similar comment on GitHub 
(https://github.com/apache/arrow/pull/4236)…

I am hoping to help improve the use of pyarrow.parquet within dask 
(https://github.com/dask/dask). To this end, I put together a simple notebook 
to explore how pyarrow.parquet can be used to read/write a partitioned dataset 
without dask (see: 
https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). 
 If your search for "Assuming that a single-file metadata solution is currently 
missing" in that notebook, you will see where I am unsure of the best way to 
write/read metadata to/from a centralized location using pyarrow.parquet.

I believe that it would be best for dask to have a way to read/write a single 
metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ 
file?).   Am I correct to assume that: (1) this functionality is missing in 
pyarrow, and (2) this  approach is the best way to process a partitioned 
dataset in parallel?

Best,
Rick

--
Richard J. Zamora
NVIDA



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[jira] [Created] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-16 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5349:


 Summary: [Python/C++] Provide a way to specify the file path in 
parquet ColumnChunkMetaData
 Key: ARROW-5349
 URL: https://issues.apache.org/jira/browse/ARROW-5349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now possible 
to collect the file metadata while writing different files (then how to write 
those metadata was not yet addressed -> original issue ARROW-1983).

However, currently, the {{file_path}} information in the ColumnChunkMetaData 
object is not set. This is, I think, expected / correct for the metadata as 
included within the single file; but for using the metadata in the combined 
dataset `_metadata`, it needs a file path set.

So if you want to use this metadata for a partitioned dataset, there needs to 
be a way to specify this file path. 
Ideas I am thinking of currently: either, we could specify a file path to be 
used when writing, or expose the `set_file_path` method on the Python side so 
you can create an updated version of the metadata after collecting it.

cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5348) [CI] [Java] Gandiva checkstyle failure

2019-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5348:
-

 Summary: [CI] [Java] Gandiva checkstyle failure
 Key: ARROW-5348
 URL: https://issues.apache.org/jira/browse/ARROW-5348
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva, Continuous Integration, Java
Reporter: Antoine Pitrou


This is failing Travis-CI builds now:
{code}
[WARNING] 
src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java:[145,3] 
(javadoc) JavadocMethod: Missing a Javadoc comment.
[WARNING] 
src/main/java/org/apache/arrow/gandiva/evaluator/DecimalTypeUtil.java:[38,3] 
(javadoc) JavadocMethod: Missing a Javadoc comment.

[...]

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on 
project arrow-gandiva: You have 2 Checkstyle violations. -> [Help 1]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5347) [C++] Building fails on Windows with gtest symbol issue

2019-05-16 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5347:
-

 Summary: [C++] Building fails on Windows with gtest symbol issue
 Key: ARROW-5347
 URL: https://issues.apache.org/jira/browse/ARROW-5347
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


I get the following on my WIndows VM:
{code}
compute-test.cc.obj : error LNK2001: unresolved external symbol "class testing::
internal::Mutex testing::internal::g_gmock_mutex" (?g_gmock_mutex@internal@testi
ng@@3VMutex@12@A)
release\arrow-compute-test.exe : fatal error LNK1120: 1 unresolved externals
{code}

It's probably caused by something like 
https://github.com/google/googletest/issues/292 except that our CMake code 
already seems to handle this issue, so I'm not sure what happens.

Here is my build script:
{code}
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Detackle tbug ^
-DARROW_USE_CLCACHE=ON ^
-DARROW_BOOST_USE_SHARED=OFF ^
-DBOOST_ROOT=C:\boost_1_67_0 ^
-DARROW_DEPENDENCY_SOURCE=BUNDLED ^
-DARROW_PYTHON=OFF ^
-DARROW_PLASMA=OFF ^
-DARROW_BUILD_TESTS=ON ^
..
cmake --build . --config Debug
{code}

I'm using conda for dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)