[DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol
hi folks, In a prior mailing list thread from February [1] I brought up some work I'd done in C++ to create an API to define custom data types that can be embedded in built-in Arrow logical types. These are serialized through IPC by adding special fields to the `custom_metadata` member of Field in the Flatbuffers metadata [2]. The idea is that if an implementation does not understand the custom type, then they can still interact with the underlying data if need be, or pass on the extension metadata in subsequent IPC messages. David Li has put up a WIP PR to implement this for Java [4], so to help the project move forward I think it's a good time to formalize this, and if there are disagreements to hash them out now. I have just opened a PR to the Arrow specification documents [3] that describes the current state of C++ and also the WIP Java PR. Any thought about this? If there is consensus about this solution approach then I can hold a vote. Thanks Wes [1]: https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E [2]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 [3]: https://github.com/apache/arrow/pull/4332 [4]: https://github.com/apache/arrow/pull/4251
[jira] [Created] (ARROW-5359) timestamp_as_object support for pa.Table.to_pandas in pyarrow
Joe Muruganandam created ARROW-5359: --- Summary: timestamp_as_object support for pa.Table.to_pandas in pyarrow Key: ARROW-5359 URL: https://issues.apache.org/jira/browse/ARROW-5359 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Environment: Ubuntu Reporter: Joe Muruganandam Creating ticket for issue reported in github([https://github.com/apache/arrow/issues/4284]) h2. pyarrow (Issue with timestamp conversion from arrow to pandas) pyarrow Table.to_pandas has option date_as_object but does not have similar option for timestamp. When a timestamp column in arrow table is converted to pandas the target datetype is pd.Timestamp and pd.Timestamp does not handle time > 2262-04-11 23:47:16.854775807 and hence in the below scenario the date is transformed to incorrect value. Adding timestamp_as_object option in pa.Table.to_pandas will help in this scenario. #Python(3.6.8) import pandas as pd import pyarrow as pa pd.*version* '0.24.1' pa.*version* '0.13.0' import datetime df = pd.DataFrame(\{"test_date": [datetime.datetime(3000,12,31,12,0),datetime.datetime(3100,12,31,12,0)]}) df test_date 0 3000-12-31 12:00:00 1 3100-12-31 12:00:00 pa_table = pa.Table.from_pandas(df) pa_table[0] Column name='test_date' type=TimestampType(timestamp[us]) [ [ 325351728, 356908464 ] ] pa_table.to_pandas() test_date 0 1831-11-22 12:50:52.580896768 1 1931-11-22 12:50:52.580896768 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5358) [Rust] Implement equality check for ArrayData and Array
Chao Sun created ARROW-5358: --- Summary: [Rust] Implement equality check for ArrayData and Array Key: ARROW-5358 URL: https://issues.apache.org/jira/browse/ARROW-5358 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Chao Sun Currently {{Array}} doesn't implement the {{Eq}} trait. Although {{ArrayData}} derives from the {{PartialEq}} trait, the default implementation is not suitable here. Instead, we should implement customized equality comparison. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5357) [Rust] change Buffer::len to represent total bytes instead of used bytes
Chao Sun created ARROW-5357: --- Summary: [Rust] change Buffer::len to represent total bytes instead of used bytes Key: ARROW-5357 URL: https://issues.apache.org/jira/browse/ARROW-5357 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Chao Sun Assignee: Chao Sun Currently {{Buffer::len}} records the number of used bytes, as opposed to the number of total bytes. This poses a problem when converting from buffers defined in flatbuffer, where the length is actually the number of allocated bytes for the buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5356) [JS] Implement Duration type, integration test support for Interval and Duration types
Wes McKinney created ARROW-5356: --- Summary: [JS] Implement Duration type, integration test support for Interval and Duration types Key: ARROW-5356 URL: https://issues.apache.org/jira/browse/ARROW-5356 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Wes McKinney Follow on work to ARROW-835 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [Discuss] [Python] protocol for conversion to pyarrow Array
hi Joris, Somewhat related to this, I want to also point out that we have C++ extension types [1]. As part of this, it would also be good to define and document a public API for users to create ExtensionArray subclasses that can be serialized and deserialized using this machinery. As a motivating example, suppose that a Java application has a special data type that can be serialized as a Binary value in Arrow, and we want to be able to receive this special object as a pandas ExtensionArray column, which unboxing into a Python user space type. The ExtensionType can be implemented in Java, and then on the Python side the implementation can occur either in C++ or Python. An API will need to be defined to serializer functions for the pandas ExtensionArray to map the pandas-space type onto the the Arrow-space type. Does this seem like a project you might be able to help drive forward? As a matter of sequencing, we do not yet have the capability to interact with C++ ExtensionType in Python, so we might need to first create callback machinery to enable Arrow extension types to be defined in Python (that call into the C++ ExtensionType registry) - Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/extension_type-test.cc On Fri, May 10, 2019 at 2:11 AM Joris Van den Bossche wrote: > > Op do 9 mei 2019 om 21:38 schreef Uwe L. Korn : > > > +1 to the idea of adding a protocol to let other objects define their way > > to Arrow structures. For pandas.Series I would expect that they return an > > Arrow Column. > > > > For the Arrow->pandas conversion I have a bit mixed feelings. In the > > normal Fletcher case I would expect that we don't convert anything as we > > represent anything from Arrow with it. > > > Yes, you don't want to convert anything (apart from wrapping the arrow > array into a FletcherArray). But how does Table.to_pandas know that? > Maybe it doesn't need to know that. And then you might write a function in > fletcher to convert a pyarrow Table to a pandas DataFrame with > fletcher-backed columns. But if you want to have this roundtrip > automatically, without the need that each project that defines an > ExtensionArray and wants to interact with arrow (eg in GeoPandas as well) > needs to have his own "arrow-table-to-pandas-dataframe" converter, pyarrow > needs to have some notion of how to convert back to a pandas ExtensionArray. > > > > For the case where we want to restore the exact pandas DataFrame we had > > before this will become a bit more complicated as we either would need to > > have all third-party libraries to support Arrow via a hook as proposed or > > we also define some kind of other protocol on the pandas side to > > reconstruct ExtensionArrays from Arrow data. > > > > That last one is basically what I proposed in > https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 > > Thanks Antoine and Uwe for the discussion! > > Joris
[jira] [Created] (ARROW-5355) [C++] DictionaryBuilder provides information to determine array builder type at run-time
Kouhei Sutou created ARROW-5355: --- Summary: [C++] DictionaryBuilder provides information to determine array builder type at run-time Key: ARROW-5355 URL: https://issues.apache.org/jira/browse/ARROW-5355 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou This is needed for Arrow GLib. In Arrow GLib, we need to determine how to wrap Arrow C++ `ArrayBuilder` at run-time. `ArrayBuilder` may be passed as a generic `ArrayBuilder` instead of concrete `DictionaryBuilder`. (e.g. `RecordBatchBuilder::GetField()`) See also: https://github.com/apache/arrow/pull/4316#issuecomment-492995395 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] PR Backlog reduction
hi Micah, This sounds like a reasonable proposal, and I agree in particular for regular contributors that it makes sense to close PRs that are not close to being in merge-readiness to thin the noise of the patch queue We have some short-term issues such as various reviewers being busy lately (e.g. I was on vacation in April, then heads down working on ARROW-3144) but I agree that there are some structural issues with how we're organizing code review efforts. Note that Apache Spark, with ~500 open PRs, created this dashboard application to help manage the insanity https://spark-prs.appspot.com/ Ultimately (in the next few years as the number of active contributors grows) I expect that we'll have to do something similar. - Wes On Thu, May 16, 2019 at 2:34 PM Micah Kornfield wrote: > > Our backlog of open PRs is slowly creeping up. This isn't great because it > allows contributions to slip through the cracks (which in turn possibly > turns off new contributors). Perusing PRs I think things roughly fall into > the following categories. > > > 1. PRs are work in progress that never got completed but were left open > (mostly by regular arrow contributors). > > 2. PR stalled because changes where requested and the PR author never > responded. > > 3. PR stalled due to lack of consensus on approach/design. > > 4. PR is blocked on some external dependency (mostly these are PRs by > regular arrow contributor). > > > A straw-man proposal for handling these: > > 1. Regular arrow contributors, please close the PR if it isn't close to > being ready and you aren't actively working on it. > > 2. I think we should start assigning reviewers who will have the > responsibility of: > >a. Pinging contributor and working through the review with them. > >b. Closing out the PR in some form if there hasn't been activity in a > 30 day period (either merging as is, making the necessary changes or > closing the PR, and removing the tag from JIRA). > > 3. Same as 2, but bring the discussion to the mailing list and try to have > a formal vote if necessary. > > 4. Same as 2, but tag the PR as blocked and the time window expands. > > > The question comes up with how to manage assignment of PRs to reviewers. I > am happy to try to triage any PRs older then a week (assuming some PRs will > be closed quickly with the current ad-hoc process) and load balance between > volunteers (it would be great to have a doc someplace where people can > express there available bandwidth and which languages they feel comfortable > with). > > > Thoughts/other proposals? > > > Thanks, > > Micah > > > > P.S. A very rough analysis of PR tags gives the following counts. > > 29 C++ > > 17 Python > >8 Rust > >7 WIP > >7 Plasma > >7 Java > >5 R > >4 Go > >4 Flight
Re: Metadata for partitioned datasets in pyarrow.parquet
Missed the email of Wes, but yeah, I think we basically said the same. Answer to another question you raised in the notebook: > [about writing a _common_metadata file] ... uses the schema object for > the 0th partition. This actually means that not *all* information in > _common_metadata will be true for the entire dataset. More specifically, > the "index_columns" [in the pandas_metadata] its "start" and "stop" > values will correspond to the 0th partition, rather than the global dataset. > That's indeed a problem with storing the index information not as a column. We have seen some other related issues about this, such as ARROW-5138 (when reading a single row group of a parquet file). In those cases, I think the only solution is to ignore this part of the metadata. But, specifically for dask, I think the idea actually is to not write the index at all (based on discussion in https://github.com/dask/dask/pull/4336), so then you would not have this problem. However, note that writing the _common_metadata file like that from the schema of the first partition might not be fully correct: it might have the correct schema, but it will not have the correct dataset size (eg number of row groups). Although I am not sure what the "common practice" is on this aspect of _common_metadata file. Joris Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche < jorisvandenboss...@gmail.com>: > Hi Rick, > > Thanks for exploring this! > > I am still quite new to Parquet myself, so the following might not be > fully correct, but based on my current understanding, to enable projects > like dask to write the different pieces of a Parquet dataset using pyarrow, > we need the following functionalities: > > - Write a single Parquet file (for one pieces / partition) and get the > metadata of that file > -> Writing is already long possible and ARROW-5258 (GH4236) enabled > getting the metadata > - Update and combine this list of metadata objects > -> Dask needs a way to update the metadata (eg the exact file path > where they put it inside the partitioned dataset): I opened ARROW-5349 > for this. > -> We need to combine the metadata, discussed in ARROW-1983 > - Write a metadata object (for both the _metadata and _common_metadata > files) > -> Also discussed in ARROW-1983. The Python interface could also > combine (step above) and write together. > > But it would be good if some people more familiar with Parquet could chime > in here. > > Best, > Joris > > Op do 16 mei 2019 om 16:37 schreef Richard Zamora : > >> Note that I was asked to post here after making a similar comment on >> GitHub (https://github.com/apache/arrow/pull/4236)… >> >> I am hoping to help improve the use of pyarrow.parquet within dask ( >> https://github.com/dask/dask). To this end, I put together a simple >> notebook to explore how pyarrow.parquet can be used to read/write a >> partitioned dataset without dask (see: >> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). >> If your search for "Assuming that a single-file metadata solution is >> currently missing" in that notebook, you will see where I am unsure of the >> best way to write/read metadata to/from a centralized location using >> pyarrow.parquet. >> >> I believe that it would be best for dask to have a way to read/write a >> single metadata file for a partitioned dataset using pyarrow (perhaps a >> ‘_metadata’ file?). Am I correct to assume that: (1) this functionality >> is missing in pyarrow, and (2) this approach is the best way to process a >> partitioned dataset in parallel? >> >> Best, >> Rick >> >> -- >> Richard J. Zamora >> NVIDA >> >> >> >> >> --- >> This email message is for the sole use of the intended recipient(s) and >> may contain >> confidential information. Any unauthorized review, use, disclosure or >> distribution >> is prohibited. If you are not the intended recipient, please contact the >> sender by >> reply email and destroy all copies of the original message. >> >> --- >> >
[DISCUSS] PR Backlog reduction
Our backlog of open PRs is slowly creeping up. This isn't great because it allows contributions to slip through the cracks (which in turn possibly turns off new contributors). Perusing PRs I think things roughly fall into the following categories. 1. PRs are work in progress that never got completed but were left open (mostly by regular arrow contributors). 2. PR stalled because changes where requested and the PR author never responded. 3. PR stalled due to lack of consensus on approach/design. 4. PR is blocked on some external dependency (mostly these are PRs by regular arrow contributor). A straw-man proposal for handling these: 1. Regular arrow contributors, please close the PR if it isn't close to being ready and you aren't actively working on it. 2. I think we should start assigning reviewers who will have the responsibility of: a. Pinging contributor and working through the review with them. b. Closing out the PR in some form if there hasn't been activity in a 30 day period (either merging as is, making the necessary changes or closing the PR, and removing the tag from JIRA). 3. Same as 2, but bring the discussion to the mailing list and try to have a formal vote if necessary. 4. Same as 2, but tag the PR as blocked and the time window expands. The question comes up with how to manage assignment of PRs to reviewers. I am happy to try to triage any PRs older then a week (assuming some PRs will be closed quickly with the current ad-hoc process) and load balance between volunteers (it would be great to have a doc someplace where people can express there available bandwidth and which languages they feel comfortable with). Thoughts/other proposals? Thanks, Micah P.S. A very rough analysis of PR tags gives the following counts. 29 C++ 17 Python 8 Rust 7 WIP 7 Plasma 7 Java 5 R 4 Go 4 Flight
Re: Metadata for partitioned datasets in pyarrow.parquet
Hi Rick, Thanks for exploring this! I am still quite new to Parquet myself, so the following might not be fully correct, but based on my current understanding, to enable projects like dask to write the different pieces of a Parquet dataset using pyarrow, we need the following functionalities: - Write a single Parquet file (for one pieces / partition) and get the metadata of that file -> Writing is already long possible and ARROW-5258 (GH4236) enabled getting the metadata - Update and combine this list of metadata objects -> Dask needs a way to update the metadata (eg the exact file path where they put it inside the partitioned dataset): I opened ARROW-5349 for this. -> We need to combine the metadata, discussed in ARROW-1983 - Write a metadata object (for both the _metadata and _common_metadata files) -> Also discussed in ARROW-1983. The Python interface could also combine (step above) and write together. But it would be good if some people more familiar with Parquet could chime in here. Best, Joris Op do 16 mei 2019 om 16:37 schreef Richard Zamora : > Note that I was asked to post here after making a similar comment on > GitHub (https://github.com/apache/arrow/pull/4236)… > > I am hoping to help improve the use of pyarrow.parquet within dask ( > https://github.com/dask/dask). To this end, I put together a simple > notebook to explore how pyarrow.parquet can be used to read/write a > partitioned dataset without dask (see: > https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). > If your search for "Assuming that a single-file metadata solution is > currently missing" in that notebook, you will see where I am unsure of the > best way to write/read metadata to/from a centralized location using > pyarrow.parquet. > > I believe that it would be best for dask to have a way to read/write a > single metadata file for a partitioned dataset using pyarrow (perhaps a > ‘_metadata’ file?). Am I correct to assume that: (1) this functionality > is missing in pyarrow, and (2) this approach is the best way to process a > partitioned dataset in parallel? > > Best, > Rick > > -- > Richard J. Zamora > NVIDA > > > > > --- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > --- >
[jira] [Created] (ARROW-5354) [C++] allow Array to have null buffers when all elements are null
Benjamin Kietzman created ARROW-5354: Summary: [C++] allow Array to have null buffers when all elements are null Key: ARROW-5354 URL: https://issues.apache.org/jira/browse/ARROW-5354 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman In the case of all elements of an array being null, no buffers whatsoever *need* to be allocated (similar to NullArray). This is a more extreme case of the optimization which allows the null bitmap buffer to be null if all elements are valid. Currently {{arrow::Array}} requires at least a null bitmap buffer to be allocated (and all bits set to 0). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Metadata for partitioned datasets in pyarrow.parquet
hi Richard, We have been discussing this in https://issues.apache.org/jira/browse/ARROW-1983 All that is currently missing is (AFAICT): * A C++ function to write a vector of FileMetaData as a _metadata file (make sure the file path is set in the metadata objects) * A Python binding for this This is a relatively low-complexity patch and does not require deep understanding of the Parquet codebase, would someone like to submit a pull request? Thanks On Thu, May 16, 2019 at 9:37 AM Richard Zamora wrote: > > Note that I was asked to post here after making a similar comment on GitHub > (https://github.com/apache/arrow/pull/4236)… > > I am hoping to help improve the use of pyarrow.parquet within dask > (https://github.com/dask/dask). To this end, I put together a simple notebook > to explore how pyarrow.parquet can be used to read/write a partitioned > dataset without dask (see: > https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). > If your search for "Assuming that a single-file metadata solution is > currently missing" in that notebook, you will see where I am unsure of the > best way to write/read metadata to/from a centralized location using > pyarrow.parquet. > > I believe that it would be best for dask to have a way to read/write a single > metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ > file?). Am I correct to assume that: (1) this functionality is missing in > pyarrow, and (2) this approach is the best way to process a partitioned > dataset in parallel? > > Best, > Rick > > -- > Richard J. Zamora > NVIDA > > > > --- > This email message is for the sole use of the intended recipient(s) and may > contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ---
[jira] [Created] (ARROW-5353) 0-row table can be written but not read
Thomas Buhrmann created ARROW-5353: -- Summary: 0-row table can be written but not read Key: ARROW-5353 URL: https://issues.apache.org/jira/browse/ARROW-5353 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.13.0, 0.12.0, 0.11.0 Reporter: Thomas Buhrmann I can serialize a table with 0 rows, but not read it. The following code {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [0,1,2]})[:0] fnm = "tbl.arr" tbl = pa.Table.from_pandas(df) print(tbl.schema) writer = pa.RecordBatchStreamWriter(fnm, tbl.schema) writer.write_table(tbl) reader = pa.RecordBatchStreamReader(fnm) tbl2 = reader.read_all() {code} ...results in the following output: {code} x: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 0, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "x", "field_name": "x", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) --- ArrowInvalid Traceback (most recent call last) in 11 writer.write_table(tbl) 12 ---> 13 reader = pa.RecordBatchStreamReader(fnm) 14 tbl2 = reader.read_all() ~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/ipc.py in __init__(self, source) 56 """ 57 def __init__(self, source): ---> 58 self._open(source) 59 60 ~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/ipc.pxi in pyarrow.lib._RecordBatchStreamReader._open() ~/anaconda/envs/grapy/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Expected schema message in stream, was null or length 0 {code} Since the schema should be sufficient to build a table, even though it may not have any actual data, I wouldn't expect this to fail but return the same 0-row input table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5352) [Rust] BinaryArray filter loses replaces nulls with empty strings
Neville Dipale created ARROW-5352: - Summary: [Rust] BinaryArray filter loses replaces nulls with empty strings Key: ARROW-5352 URL: https://issues.apache.org/jira/browse/ARROW-5352 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.13.0 Reporter: Neville Dipale The filter implementation for BinaryArray discards nullness of data. BinaryArrays that are null (seem to) always return an empty string slice when getting a value, so the way filter works might be a bug depending on what Arrow developers' or users' intentions are. I think we should either preserve nulls (and their count) or document this as intended behaviour. Below is a test case that reproduces the bug. {code:java} #[test] fn test_filter_binary_array_with_nulls() { let mut a: BinaryBuilder = BinaryBuilder::new(100); a.append_null().unwrap(); a.append_string("a string").unwrap(); a.append_null().unwrap(); a.append_string("with nulls").unwrap(); let array = a.finish(); let b = BooleanArray::from(vec![true, true, true, true]); let c = filter(&array, &b).unwrap(); let d: &BinaryArray = c.as_any().downcast_ref::().unwrap(); // I didn't expect this behaviour assert_eq!("", d.get_string(0)); // fails here assert!(d.is_null(0)); assert_eq!(4, d.len()); // fails here assert_eq!(2, d.null_count()); assert_eq!("a string", d.get_string(1)); // fails here assert!(d.is_null(2)); assert_eq!("with nulls", d.get_string(3)); } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5351) [Rust] Add support for take kernel functions
Neville Dipale created ARROW-5351: - Summary: [Rust] Add support for take kernel functions Key: ARROW-5351 URL: https://issues.apache.org/jira/browse/ARROW-5351 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Neville Dipale Similar to https://issues.apache.org/jira/browse/ARROW-772, a take function would allow us random-access on arrays, which is useful for sorting and (potentially) filtering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5350) [Rust] Support filtering on nested array types
Neville Dipale created ARROW-5350: - Summary: [Rust] Support filtering on nested array types Key: ARROW-5350 URL: https://issues.apache.org/jira/browse/ARROW-5350 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Neville Dipale We currently only filter on primitive types, but not on lists and structs. Add the ability to filter on nested array types -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Metadata for partitioned datasets in pyarrow.parquet
Note that I was asked to post here after making a similar comment on GitHub (https://github.com/apache/arrow/pull/4236)… I am hoping to help improve the use of pyarrow.parquet within dask (https://github.com/dask/dask). To this end, I put together a simple notebook to explore how pyarrow.parquet can be used to read/write a partitioned dataset without dask (see: https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). If your search for "Assuming that a single-file metadata solution is currently missing" in that notebook, you will see where I am unsure of the best way to write/read metadata to/from a centralized location using pyarrow.parquet. I believe that it would be best for dask to have a way to read/write a single metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ file?). Am I correct to assume that: (1) this functionality is missing in pyarrow, and (2) this approach is the best way to process a partitioned dataset in parallel? Best, Rick -- Richard J. Zamora NVIDA --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
[jira] [Created] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
Joris Van den Bossche created ARROW-5349: Summary: [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData Key: ARROW-5349 URL: https://issues.apache.org/jira/browse/ARROW-5349 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Fix For: 0.14.0 After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now possible to collect the file metadata while writing different files (then how to write those metadata was not yet addressed -> original issue ARROW-1983). However, currently, the {{file_path}} information in the ColumnChunkMetaData object is not set. This is, I think, expected / correct for the metadata as included within the single file; but for using the metadata in the combined dataset `_metadata`, it needs a file path set. So if you want to use this metadata for a partitioned dataset, there needs to be a way to specify this file path. Ideas I am thinking of currently: either, we could specify a file path to be used when writing, or expose the `set_file_path` method on the Python side so you can create an updated version of the metadata after collecting it. cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5348) [CI] [Java] Gandiva checkstyle failure
Antoine Pitrou created ARROW-5348: - Summary: [CI] [Java] Gandiva checkstyle failure Key: ARROW-5348 URL: https://issues.apache.org/jira/browse/ARROW-5348 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva, Continuous Integration, Java Reporter: Antoine Pitrou This is failing Travis-CI builds now: {code} [WARNING] src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java:[145,3] (javadoc) JavadocMethod: Missing a Javadoc comment. [WARNING] src/main/java/org/apache/arrow/gandiva/evaluator/DecimalTypeUtil.java:[38,3] (javadoc) JavadocMethod: Missing a Javadoc comment. [...] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on project arrow-gandiva: You have 2 Checkstyle violations. -> [Help 1] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5347) [C++] Building fails on Windows with gtest symbol issue
Antoine Pitrou created ARROW-5347: - Summary: [C++] Building fails on Windows with gtest symbol issue Key: ARROW-5347 URL: https://issues.apache.org/jira/browse/ARROW-5347 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou I get the following on my WIndows VM: {code} compute-test.cc.obj : error LNK2001: unresolved external symbol "class testing:: internal::Mutex testing::internal::g_gmock_mutex" (?g_gmock_mutex@internal@testi ng@@3VMutex@12@A) release\arrow-compute-test.exe : fatal error LNK1120: 1 unresolved externals {code} It's probably caused by something like https://github.com/google/googletest/issues/292 except that our CMake code already seems to handle this issue, so I'm not sure what happens. Here is my build script: {code} cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Detackle tbug ^ -DARROW_USE_CLCACHE=ON ^ -DARROW_BOOST_USE_SHARED=OFF ^ -DBOOST_ROOT=C:\boost_1_67_0 ^ -DARROW_DEPENDENCY_SOURCE=BUNDLED ^ -DARROW_PYTHON=OFF ^ -DARROW_PLASMA=OFF ^ -DARROW_BUILD_TESTS=ON ^ .. cmake --build . --config Debug {code} I'm using conda for dependencies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)