[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field
Joris Van den Bossche created ARROW-9105: Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field Key: ARROW-9105 URL: https://issues.apache.org/jira/browse/ARROW-9105 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 When splitting a fragment into row group fragments, filtering on the partition field raises an error. Python reproducer: ``` df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]}) df.to_parquet("test_partitioned_filter", partition_cols="part", engine="pyarrow") import pyarrow.dataset as ds dataset = ds.dataset("test_partitioned_filter", format="parquet", partitioning="hive") fragment = list(dataset.get_fragments())[0] ``` ``` In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas() Out[31]: dummy part 0 1A 1 1A In [32]: fragment.split_by_row_group(ds.field("part") == "A") --- ArrowInvalid Traceback (most recent call last) in > 1 fragment.split_by_row_group(ds.field("part") == "A") ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFragment.split_by_row_group() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._insert_implicit_casts() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Field named 'part' not found or not unique in the schema. ``` This is probably a "strange" thing to do, since the fragment from a partitioned dataset is already coming only from a single partition (so will always only satisfy a single equality expression). But it's still nice that as a user you don't have to care about only passing part of the filter down to {{split_by_row_groups}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data
Joris Van den Bossche created ARROW-9103: Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 text data Key: ARROW-9103 URL: https://issues.apache.org/jira/browse/ARROW-9103 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche See https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9089) [Python] A PyFileSystem handler for fsspec-based filesystems
Joris Van den Bossche created ARROW-9089: Summary: [Python] A PyFileSystem handler for fsspec-based filesystems Key: ARROW-9089 URL: https://issues.apache.org/jira/browse/ARROW-9089 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-8766 to use this machinery to add an FSSpecHandler -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails
Joris Van den Bossche created ARROW-9078: Summary: [C++] Parquet writing of extension type with nested storage type fails Key: ARROW-9078 URL: https://issues.apache.org/jira/browse/ARROW-9078 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche A reproducer in Python: {code:python} import pyarrow as pa import pyarrow.parquet as pq class MyStructType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), ('right', pa.int64())])) def __reduce__(self): return MyStructType, () struct_array = pa.StructArray.from_arrays( [ pa.array([0, 1], type="int64", from_pandas=True), pa.array([1, 2], type="int64", from_pandas=True), ], names=["left", "right"], ) # works table = pa.table({'a': struct_array}) pq.write_table(table, "test_struct.parquet") # doesn't work mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array) table = pa.table({'a': mystruct_array}) pq.write_table(table, "test_struct.parquet") {code} Writing the simple StructArray nowadays works (and reading it back in as well). But when the struct array is the storage array of an ExtensionType, it fails with the following error: {code} ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9027) [Python] Split in multiple files + clean-up pyarrow.parquet tests
Joris Van den Bossche created ARROW-9027: Summary: [Python] Split in multiple files + clean-up pyarrow.parquet tests Key: ARROW-9027 URL: https://issues.apache.org/jira/browse/ARROW-9027 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche The current {{test_parquet.py}} file is already above 4000 lines of code, and it is becoming a bit unwieldy to work with. Better structuring it, and maybe splitting it in multiple files, would help (separate test files could cover tests for basic reading/writing, tests for metadata/statistics objects, tests for multi-file datasets) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9021) [Python] The filesystem keyword in parquet.read_table is not documented
Joris Van den Bossche created ARROW-9021: Summary: [Python] The filesystem keyword in parquet.read_table is not documented Key: ARROW-9021 URL: https://issues.apache.org/jira/browse/ARROW-9021 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9017) [Python] Refactor the Scalar classes
Joris Van den Bossche created ARROW-9017: Summary: [Python] Refactor the Scalar classes Key: ARROW-9017 URL: https://issues.apache.org/jira/browse/ARROW-9017 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The situation regarding scalars in Python is currently not optimal. We have two different "types" of scalars: - {{ArrayValue(Scalar)}} (and subclasses of that for all types): this is used when you access a single element of an array (eg {{arr[0]}}) - {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is used when wrapping a C++ scalar into a python scalar, eg when you get back a scalar from a reduction like {{arr.sum()}}. And while we have two versions of scalars, neither of them can actually easily be used as scalar as they both can't be constructed from a python scalar (there is no {{scalar(1)}} function to use when calling a kernel, for example). I think we should try to unify those scalar classes? (which probably means getting rid of the ArrayValue scalar) In addition, there is an issue of trying to re-use python scalar <-> arrow conversion code, as this is also logic for this in the {{python_to_arrow.cc}} code. But this is probably a bigger change. cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9009) [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files
Joris Van den Bossche created ARROW-9009: Summary: [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files Key: ARROW-9009 URL: https://issues.apache.org/jira/browse/ARROW-9009 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'a': [1, 2, 3]}) pq.write_table(table, "test.parquet") dataset = ds.dataset("test.parquet", format="parquet") {code} In [7]: dataset.schema Out[7]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114 In [8]: dataset.to_table().schema Out[8]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114 {code} while when reading with the `parquet` module reader, we do not preserve this metadata: {code} In [9]: pq.read_table("test.parquet").schema Out[9]: a: int64 -- field metadata -- PARQUET:field_id: '1' {code} Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8946) [Python] Add tests for parquet.write_metadata metadata_collector
Joris Van den Bossche created ARROW-8946: Summary: [Python] Add tests for parquet.write_metadata metadata_collector Key: ARROW-8946 URL: https://issues.apache.org/jira/browse/ARROW-8946 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-8062: the PR added functionality to {{parquet.write_metadata}} to pass a a collection of metadata objects to be concatenated. We should add some specific tests for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8943) [C++] Add support for Partitioning to ParquetDatasetFactory
Joris Van den Bossche created ARROW-8943: Summary: [C++] Add support for Partitioning to ParquetDatasetFactory Key: ARROW-8943 URL: https://issues.apache.org/jira/browse/ARROW-8943 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Follow-up on ARROW-8062: the ParquetDatasetFactory currently does not yet support specifying a Partitioning / inferring with a PartitioningFactory. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8860) [C++] Compressed Feather file with struct array roundtrips incorrectly
Joris Van den Bossche created ARROW-8860: Summary: [C++] Compressed Feather file with struct array roundtrips incorrectly Key: ARROW-8860 URL: https://issues.apache.org/jira/browse/ARROW-8860 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche When writing a table with a Struct typed column, this is read back with garbage values when using compression (which is the default): {code:python} >>> table = pa.table({'col': pa.StructArray.from_arrays([[0,1,2], [1,2,3]], >>> names=["f1", "f2"])}) >>> table.column("col") [ -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2 ] -- child 1 type: int64 [ 1, 2, 3 ] ] # roundtrip through feather >>> feather.write_feather(table, "test_struct.feather") >>> table2 = feather.read_table("test_struct.feather") >>> table2.column("col") [ -- is_valid: all not null -- child 0 type: int64 [ 24, 1261641627085906436, 1369095386551025664 ] -- child 1 type: int64 [ 24, 1405756815161762308, 281479842103296 ] ] {code} When not using compression, it is read back correctly: {code:python} >>> feather.write_feather(table, "test_struct.feather", >>> compression="uncompressed") >>> >>> table2 = feather.read_table("test_struct.feather") >>> >>> >>> table2.column("col") >>> >>> [ -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2 ] -- child 1 type: int64 [ 1, 2, 3 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8802) [C++][Dataset] Schema metadata are lost when reading a subset of columns
Joris Van den Bossche created ARROW-8802: Summary: [C++][Dataset] Schema metadata are lost when reading a subset of columns Key: ARROW-8802 URL: https://issues.apache.org/jira/browse/ARROW-8802 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Python example: {code} import pandas as pd import pyarrow.dataset as ds df = pd.DataFrame({'a': [1, 2, 3]}) df.to_parquet("test_metadata.parquet") dataset = ds.dataset("test_metadata.parquet") {code} gives: {code} >>> dataset.to_table().schema a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397 ARROW:schema: '/4ACAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwAAA' + 806 >>> dataset.to_table(columns=['a']).schema a: int64 -- field metadata -- PARQUET:field_id: '1' {code} So when specifying a subset of the columns, the additional metadata entries are lost (while those can still be informative, eg for conversion to pandas) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8799) [C++][Dataset] Reading list column as nested dictionary segfaults
Joris Van den Bossche created ARROW-8799: Summary: [C++][Dataset] Reading list column as nested dictionary segfaults Key: ARROW-8799 URL: https://issues.apache.org/jira/browse/ARROW-8799 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Python example: {code} import pyarrow as pa import pyarrow.parquet as pq from pyarrow.tests import util repeats = 10 nunique = 5 data = [ [[util.rands(10)] for i in range(nunique)] * repeats, ] table = pa.table(data, names=['f0']) pq.write_table(table, "test_dictionary.parquet") {code} Reading with the parquet code works: {code} >>> pq.read_table("test_dictionary.parquet", read_dictionary=['f0.list.item']) >>> >>> pyarrow.Table f0: list> child 0, item: dictionary {code} but doing the same with the datasets API segfaults: {code} >>> fmt = >>> ds.ParquetFileFormat(read_options=dict(dictionary_columns=["f0.list.item"])) >>> dataset = ds.dataset("test_dictionary.parquet", format=fmt) >>> >>> dataset.to_table() Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8780) [Python] A fsspec-compatible wrapper for pyarrow.fs filesystems
Joris Van den Bossche created ARROW-8780: Summary: [Python] A fsspec-compatible wrapper for pyarrow.fs filesystems Key: ARROW-8780 URL: https://issues.apache.org/jira/browse/ARROW-8780 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The new {{pyarrow.fs}} FileSystem objects have a limited Python API (currently mimicking the C++ API). In Python, [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common API for a variety filesystem implementations. We could try to implement a, fsspec-compatible class wrapping the {{pyarrow.fs}} native filesystems. Such as class would provide the methods expected according to fsspec, and implement those using the actual {{pyarrow.fs.FileSystem}} under the hood. This might be mainly useful for two use cases: - {{pyarrow.fs}} filesystems can be used in settings that expect an fsspec-compatible filesytem object - it provides a way to have a "richer" API around our {{pyarrow.fs}} filesystems (which has been requested before, cfr ARROW-7584), without expanding the core filesystem objects -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8766) [Python] A FileSystem implementation based on Python callbacks
Joris Van den Bossche created ARROW-8766: Summary: [Python] A FileSystem implementation based on Python callbacks Key: ARROW-8766 URL: https://issues.apache.org/jira/browse/ARROW-8766 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The new {{pyarrow.fs}} filesystems are now actual C++ objects, and no longer "just" a python interface. So they can't easily be expanded from the Python side, and the existing integration with {{fsspec}} filesystems is therefore also not working anymore. One possible solution is to have a C++ filesystem that calls back into a python object for each of its methods (possibly similar to how you can implement a flight server in Python, I suppose). Such a FileSystem implementation would allow to make a {{pyarrow.fs}} wrapper for {{fsspec}} filesystems, and thus allow such filesystems to be used in pyarrow where new filesystems are expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8733) [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata
Joris Van den Bossche created ARROW-8733: Summary: [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata Key: ARROW-8733 URL: https://issues.apache.org/jira/browse/ARROW-8733 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Fix For: 1.0.0 Related to ARROW-8062 (as there we will also need a way to expose the global FileMetadata). But independently, it would be useful to get access to the FileMetadata on each {{ParquetFileFragment}} (eg to get access to the statistics). This would be relatively simple to code on the Python/R side, since we have access to the file path, and could read the metadata from the file backing the fragment, and return this as a FileMetadata object. I am wondering if we want to integrate this with ARROW-8062, since when the fragments were created from a {{_metadata}} file, a {{ParquetFileFragment.metadata}} attribute would not need to read it from the parquet file in this case, but from the global metadata (at least for eg the row group data). Another question: what for a ParquetFileFragment that maps to a single row group? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8729) [C++][Dataset] Only selecting a partition column results in empty table
Joris Van den Bossche created ARROW-8729: Summary: [C++][Dataset] Only selecting a partition column results in empty table Key: ARROW-8729 URL: https://issues.apache.org/jira/browse/ARROW-8729 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Python reproducer: {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds path = "test_dataset" table = pa.table({'part': ['a', 'a', 'b', 'b'], 'col': [1, 2, 3, 4]}) pq.write_to_dataset(table, str(path), partition_cols=["part"]) {code} gives {code} In [38]: ds.dataset(str(path), partitioning="hive").to_table().num_rows Out[38]: 4 In [39]: ds.dataset(str(path), partitioning="hive").to_table(columns=["part"]).num_rows Out[39]: 0 {code} The schema correctly only includes the "part" column, but there are no rows. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8693) [Python] Dataset.get_fragments is missing an implicit cast when filtering
Joris Van den Bossche created ARROW-8693: Summary: [Python] Dataset.get_fragments is missing an implicit cast when filtering Key: ARROW-8693 URL: https://issues.apache.org/jira/browse/ARROW-8693 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche This currently segfaults: {code} dataset.get_fragments(filter=ds.field("col") > 1) {code} in case "col" is not int64 (like default inferred partition columns are int32) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8690) [Python] Clean-up dataset+parquet tests now order is determinstic
Joris Van den Bossche created ARROW-8690: Summary: [Python] Clean-up dataset+parquet tests now order is determinstic Key: ARROW-8690 URL: https://issues.apache.org/jira/browse/ARROW-8690 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 Follow-up on ARROW-8447, we should now be able to clean-up some tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset
Joris Van den Bossche created ARROW-8655: Summary: [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset Key: ARROW-8655 URL: https://issues.apache.org/jira/browse/ARROW-8655 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions. However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created: - The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write) - To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files
Joris Van den Bossche created ARROW-8652: Summary: [Python] Test error message when discovering dataset with invalid files Key: ARROW-8652 URL: https://issues.apache.org/jira/browse/ARROW-8652 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche There is comment in the test_parquet.py about the Dataset API needing a better error message for invalid files: https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648 Although, this seems to work now: {code} import tempfile import pathlib import pyarrow.dataset as ds tempdir = pathlib.Path(tempfile.mkdtemp()) with open(str(tempdir / "data.parquet"), 'wb') as f: pass In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet") ... OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': Invalid: Parquet file size is 0 bytes {code} So we need update the test to actually test it instead of skipping. The only difference with the python ParquetDataset implementation is that the datasets API raises an OSError and not an ArrowInvalid error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects
Joris Van den Bossche created ARROW-8651: Summary: [Python][Dataset] Support pickling of Dataset objects Key: ARROW-8651 URL: https://issues.apache.org/jira/browse/ARROW-8651 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 We alraedy made several parts of a Dataset serializable (the formats, the expressions, the filesystem). With those, it should also be possible to pickle FileFragments, and with that also Dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type
Joris Van den Bossche created ARROW-8647: Summary: [C++][Dataset] Optionally encode partition field values as dictionary type Key: ARROW-8647 URL: https://issues.apache.org/jira/browse/ARROW-8647 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns. In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in {{Partitioning}} passed to the dataset factory). Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns
Joris Van den Bossche created ARROW-8644: Summary: [Python] Dask integration tests failing due to change in not including partition columns Key: ARROW-8644 URL: https://issues.apache.org/jira/browse/ARROW-8644 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In ARROW-3861 (https://github.com/apache/arrow/pull/7050), I "fixed" a bug that the partition columns are always included even when the user did a manual column selection. But apparently, this behaviour was being relied upon by dask. See the failing nightly integration tests: https://circleci.com/gh/ursa-labs/crossbow/11854?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link So the best option might be to just keep the "old" behaviour for the legacy ParquetDataset, when using the new datasets API ({{use_legacy_datasets=False}}), you get the new / correct behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion
Joris Van den Bossche created ARROW-8643: Summary: [Python] Tests with pandas master failing due to freq assertion Key: ARROW-8643 URL: https://issues.apache.org/jira/browse/ARROW-8643 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Nightly pandas master tests are failing, eg https://circleci.com/gh/ursa-labs/crossbow/11858?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link This is caused by a change in pandas, see https://github.com/pandas-dev/pandas/pull/33815#issuecomment-620820134 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection
Joris Van den Bossche created ARROW-8641: Summary: [Python] Regression in feather: no longer supports permutation in column selection Key: ARROW-8641 URL: https://issues.apache.org/jira/browse/ARROW-8641 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche A quite annoying regression (original report from https://github.com/pandas-dev/pandas/issues/33878), is that when specifying {{columns}} to read, this now fails if the order of the columns is not exactly the same as in the file: {code: python} In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 'c']) In [29]: from pyarrow import feather In [30]: feather.write_feather(table, "test.feather") # this works fine In [32]: feather.read_table("test.feather", columns=['a', 'b']) Out[32]: pyarrow.Table a: int64 b: int64 In [33]: feather.read_table("test.feather", columns=['b', 'a']) --- ArrowInvalid Traceback (most recent call last) in > 1 feather.read_table("test.feather", columns=['b', 'a']) ~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, memory_map) 237 return reader.read_indices(columns) 238 elif all(map(lambda t: t == str, column_types)): --> 239 return reader.read_names(columns) 240 241 column_type_names = [t.__name__ for t in column_types] ~/scipy/repos/arrow/python/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.read_names() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Schema at index 0 was different: b: int64 a: int64 vs a: int64 b: int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8613) [C++][Dataset] Raise error for unparsable partition value
Joris Van den Bossche created ARROW-8613: Summary: [C++][Dataset] Raise error for unparsable partition value Key: ARROW-8613 URL: https://issues.apache.org/jira/browse/ARROW-8613 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field. Python example: {code:python} import pathlib import pyarrow.parquet as pq import pyarrow.datasets as d path = pathlib.Path(".") / "dataset_partition_schema_errors" path.mkdir(exist_ok=True) table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)}) pq.write_to_dataset(table, str(path), partition_cols=["part"]) {code} {code:java} In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() Out[17]: values part 0 0 1_2 1 1 1_2 2 2 3_4 3 3 3_4 In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive") In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas() Out[19]: values part 0 0 NaN 1 1 NaN 2 2 NaN 3 3 NaN {code} Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8446) [Python][Dataset] Detect and use _metadata file in a list of file paths
Joris Van den Bossche created ARROW-8446: Summary: [Python][Dataset] Detect and use _metadata file in a list of file paths Key: ARROW-8446 URL: https://issues.apache.org/jira/browse/ARROW-8446 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From https://github.com/dask/dask/pull/6047#discussion_r402391318 When specifying a directory to {{ParquetDataset}}, we will detect if a {{_metadata}} file is present in the directory and use that to populate the {{metadata}} attribute (and not include this file in the list of "pieces", since it does not include any data). However, when passing a list of files to {{ParquetDataset}}, with one being "_metadata", the metadata attribute is not populated, and the "_metadata" path is included as one of the ParquetDatasetPiece objects instead (which leads to an ArrowIOError during the read of that piece). We _could_ detect it in a list of paths as well. Note, I mentioned {{ParquetDataset}}, but if working on this, we should probably directly do it in the datasets API-based version. Also, I labeled this as Python and not C++ for now, as this might be something that can be handled on the Python side (once the C++ side knows how to process this kind of metadata -> ARROW-8062) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8442) [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy
Joris Van den Bossche created ARROW-8442: Summary: [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy Key: ARROW-8442 URL: https://issues.apache.org/jira/browse/ARROW-8442 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche There is this behaviour of {{to_pandas_dtype}} returning float, while all actual conversions to numpy or pandas use object dtype: {code} In [23]: pa.null().to_pandas_dtype() Out[23]: numpy.float64 In [24]: pa.array([], pa.null()).to_pandas() Out[24]: Series([], dtype: object) In [25]: pa.array([], pa.null()).to_numpy(zero_copy_only=False) Out[25]: array([], dtype=object) {code} So we should probably fix {{NullType.to_pandas_dtype}} to return object, which is used in practice. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8439) [Python] Filesystem docs are outdated
Joris Van den Bossche created ARROW-8439: Summary: [Python] Filesystem docs are outdated Key: ARROW-8439 URL: https://issues.apache.org/jira/browse/ARROW-8439 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8427) [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified
Joris Van den Bossche created ARROW-8427: Summary: [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified Key: ARROW-8427 URL: https://issues.apache.org/jira/browse/ARROW-8427 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 0.17.0 Currently, when passing a list of file path to FileSystemDatasetFactory, the files that have one of their parent directories with a underscore or dot are skipped. Since the file paths were passed as an explicit list, we should maybe not skip them. For example, when specifying a directory (Selector), it will only check child directories to skip, not parent directories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8416) [Python] Provide a "feather" alias in the dataset API
Joris Van den Bossche created ARROW-8416: Summary: [Python] Provide a "feather" alias in the dataset API Key: ARROW-8416 URL: https://issues.apache.org/jira/browse/ARROW-8416 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 I don't know what the plans are on the C++ side (ARROW-7586), but for 0.17, I think it would be nice if we can at least support {{ds.dataset(..., format="feather")}} (instead of needing to tell people to do {{ds.dataset(..., format="ipc")}} to read feather files). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8414) [Python] Non-deterministic row order failure in test_parquet.py
Joris Van den Bossche created ARROW-8414: Summary: [Python] Non-deterministic row order failure in test_parquet.py Key: ARROW-8414 URL: https://issues.apache.org/jira/browse/ARROW-8414 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8345) [Python] feather.read_table should not require pandas
Joris Van den Bossche created ARROW-8345: Summary: [Python] feather.read_table should not require pandas Key: ARROW-8345 URL: https://issues.apache.org/jira/browse/ARROW-8345 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 We still check the pandas version, while pandas is not actually needed. Will do a quick fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8342) [Python] dask and kartothek integration tests are failing
Joris Van den Bossche created ARROW-8342: Summary: [Python] dask and kartothek integration tests are failing Key: ARROW-8342 URL: https://issues.apache.org/jira/browse/ARROW-8342 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 The integration tests for both dask and kartothek, and for both master and latest released version of them, started failing the last days. Dask latest: https://circleci.com/gh/ursa-labs/crossbow/10629?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link Kartothek latest: https://circleci.com/gh/ursa-labs/crossbow/10604?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link I think both are related to the KeyValueMetadata changes (ARROW-8079). The kartothek one is clearly related, as it gives: TypeError: 'pyarrow.lib.KeyValueMetadata' object does not support item assignment And I think the dask one is related to the "pandas" key now being present twice, and therefore it is using the "wrong" one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8314) [Python] Provide a method to select a subset of columns of a Table
Joris Van den Bossche created ARROW-8314: Summary: [Python] Provide a method to select a subset of columns of a Table Key: ARROW-8314 URL: https://issues.apache.org/jira/browse/ARROW-8314 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Joris Van den Bossche I looked through the open issues and in our API, but didn't directly find something about selecting a subset of columns of a table. Assume you have a table like: {code} table = pa.table({'a': [1, 2], 'b': [.1, .2], 'c': ['a', 'b']}) {code} You can select a single column with {{table.column('a')}} or {{table['a']}} to get a chunked array. You can add, append, remove and replace columns (with {{add_column}}, {{append_column}}, {{remove_column}}, {{set_column}}). But an easy way to get a subset of the columns (without the manuall removing the ones you don't want one by one) doesn't seem possible. I would propose something like: {code} table.select(['a', 'c']) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8292) [Python][Dataset] Passthrough schema to Factory.finish() in dataset() function
Joris Van den Bossche created ARROW-8292: Summary: [Python][Dataset] Passthrough schema to Factory.finish() in dataset() function Key: ARROW-8292 URL: https://issues.apache.org/jira/browse/ARROW-8292 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Joris Van den Bossche This is already a very simple fix to allow manually specifying the schema, without exposing any other options -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8290) [Python][Dataset] Improve ergonomy of the FileSystemDataset constructor
Joris Van den Bossche created ARROW-8290: Summary: [Python][Dataset] Improve ergonomy of the FileSystemDataset constructor Key: ARROW-8290 URL: https://issues.apache.org/jira/browse/ARROW-8290 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently, to manually create a FileSystemDataset, you can do something like: {code} dataset = ds.FileSystemDataset( schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(), ["data_file1.parquet", "data_file2.parquet"], [ds.field('file') == 1, ds.field('file') == 2]) {code} There are some usibility improvements we can do though: - Allow passing the arguments by name to improve readability of the calling code (now they all need to be passed positionally, due to the way they are implemented in cython as {{not None}}) - I would maybe change the order of the arguments (eg start with the paths, we don't need to match the order of the C++ constructor) - Potentially allow {{partitions}} to be optional, in which case they need to be set to a list of ScalarExpression(True) values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8286) [Python] Creating dataset from pathlib results in UnionDataset instead of FileSystemDataset
Joris Van den Bossche created ARROW-8286: Summary: [Python] Creating dataset from pathlib results in UnionDataset instead of FileSystemDataset Key: ARROW-8286 URL: https://issues.apache.org/jira/browse/ARROW-8286 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'a': np.random.randn(10), 'b': range(10), 'c': ['a', 'b'] * 5}) pq.write_table(table, "test.parquet") import pathlib ds.dataset(pathlib.Path("./test.parquet")) # gives UnionDataset ds.dataset(str(pathlib.Path("./test.parquet"))) # correctly gives FileSystemDataset {code} and since those two dataset classes have different API, this is important to give FileSystemDataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8276) [C++][Dataset] Scannin a Fragment does not take into account the partition columns
Joris Van den Bossche created ARROW-8276: Summary: [C++][Dataset] Scannin a Fragment does not take into account the partition columns Key: ARROW-8276 URL: https://issues.apache.org/jira/browse/ARROW-8276 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Dataset Reporter: Joris Van den Bossche Fix For: 0.17.0 Follow-up on ARROW-8061, the {{to_table}} method doesn't work for fragments created from a partitioned dataset. (will add a reproducer later) cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8221) [Python][Dataset] Expose schema inference / validation options in the factory
Joris Van den Bossche created ARROW-8221: Summary: [Python][Dataset] Expose schema inference / validation options in the factory Key: ARROW-8221 URL: https://issues.apache.org/jira/browse/ARROW-8221 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the {{dataset(..)}} factory function: - Add ability to pass a user-specified schema with a {{schema}} keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method) - Add {{validate_schema}} option to toggle whether the schema is validated against the actual files or not. - Expose in some way the number of fragments to be inspected when inferring the schema. Not sure yet what the best API for this would be. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8220) [Python] Make dataset FileFormat objects serializable
Joris Van den Bossche created ARROW-8220: Summary: [Python] Make dataset FileFormat objects serializable Key: ARROW-8220 URL: https://issues.apache.org/jira/browse/ARROW-8220 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8213) [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message
Joris Van den Bossche created ARROW-8213: Summary: [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message Key: ARROW-8213 URL: https://issues.apache.org/jira/browse/ARROW-8213 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Even after the previous PRs related to local paths (https://github.com/apache/arrow/pull/6643, https://github.com/apache/arrow/pull/6655), I don't the user experience optimal in case you are working with local files, and pass a wrong, non-existent path (eg due to a typo). Currently, you get this error: {code} >>> dataset = ds.dataset("data_with_typo.parquet", format="parquet") ... ArrowInvalid: URI has empty scheme: 'data_with_typo.parquet' {code} where "URI has empty scheme" is rather confusing for the user in case of a non-existent path. I think ideally we should raise a "No such file or directory" error. I am not fully sure what the best solution is, as {{FileSystem.from_uri}} can also give other errors that we do want to propagate to the user. The most straightforward that I am now thinking of is checking if "URI has empty scheme" is in the error message, and then rewording it, but that's not very clean .. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8210) [C++]
Joris Van den Bossche created ARROW-8210: Summary: [C++] Key: ARROW-8210 URL: https://issues.apache.org/jira/browse/ARROW-8210 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8209) [Python] Accessing duplicate column of Table by name gives wrong error
Joris Van den Bossche created ARROW-8209: Summary: [Python] Accessing duplicate column of Table by name gives wrong error Key: ARROW-8209 URL: https://issues.apache.org/jira/browse/ARROW-8209 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche When you have a table with duplicate column names and you try to access this column, you get an error about the column not existing: {code} >>> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, >>> 9])], names=['a', 'b', 'a']) >>> table.column('a') >>> >>> --- KeyError Traceback (most recent call last) in > 1 table.column('a') ~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.column() KeyError: 'Column a does not exist in table' {code} It should rather give an error message about the column name being duplicate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8196) [Python] Empty table creation from schema with nested dictionary type
Joris Van den Bossche created ARROW-8196: Summary: [Python] Empty table creation from schema with nested dictionary type Key: ARROW-8196 URL: https://issues.apache.org/jira/browse/ARROW-8196 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-6872 / https://github.com/apache/arrow/pull/6698, creating an empty table from a schema in python ({{Schema.empty_table()}}) still fails with a nested dictionary type (eg a list of dictionaty type). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8186) [Python] Dataset expression != returns bool instead of expression for invalid value
Joris Van den Bossche created ARROW-8186: Summary: [Python] Dataset expression != returns bool instead of expression for invalid value Key: ARROW-8186 URL: https://issues.apache.org/jira/browse/ARROW-8186 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche It's a bit a strange case, but eg when doing {{!= {3}}} you get a boolean result instead of an expression: {code} In [8]: ds.field('col') != 3 Out[8]: In [9]: ds.field('col') != {3} Out[9]: True {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8136) [C++][Python] Creating dataset from relative path no longer working
Joris Van den Bossche created ARROW-8136: Summary: [C++][Python] Creating dataset from relative path no longer working Key: ARROW-8136 URL: https://issues.apache.org/jira/browse/ARROW-8136 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Fix For: 0.17.0 Since https://github.com/apache/arrow/pull/6597, local relative paths don't work anymore: {code} In [1]: import pyarrow.dataset as ds In [2]: ds.dataset("test.parquet") --- ArrowInvalid Traceback (most recent call last) in > 1 ds.dataset("test.parquet") ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format) 327 328 if isinstance(paths_or_factories, str): --> 329 return factory(paths_or_factories, **kwargs).finish() 330 331 if not isinstance(paths_or_factories, list): ~/scipy/repos/arrow/python/pyarrow/dataset.py in factory(path_or_paths, filesystem, partitioning, format) 246 factories = [] 247 for path in path_or_paths: --> 248 fs, paths_or_selector = _ensure_fs_and_paths(path, filesystem) 249 factories.append(FileSystemDatasetFactory(fs, paths_or_selector, 250 format, options)) ~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs_and_paths(path, filesystem) 165 from pyarrow.fs import FileType, FileSelector 166 --> 167 filesystem, path = _ensure_fs(filesystem, _stringify_path(path)) 168 infos = filesystem.get_target_infos([path])[0] 169 if infos.type == FileType.Directory: ~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs(filesystem, path) 158 if filesystem is not None: 159 return filesystem, path --> 160 return FileSystem.from_uri(path) 161 162 ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: URI has empty scheme: 'test.parquet' {code} [~apitrou] Is this something that should be fixed in {{FileSystemFromUriOrPath}} or rather on the python side? ({{FileSystem.from_uri}} ensures to get the absolute path for Pathlib objects, but not for strings) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls
Joris Van den Bossche created ARROW-8088: Summary: [C++][Dataset] Partition columns with specified dictionary type result in all nulls Key: ARROW-8088 URL: https://issues.apache.org/jira/browse/ARROW-8088 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche When specifying an explicit schema for the Partitioning, and when using a dictionary type, the materialization of the partition keys goes wrong: you don't get an error, but you get columns with all nulls. Python example: {code} foo_keys = [0, 1] bar_keys = ['a', 'b', 'c'] N = 30 df = pd.DataFrame({ 'foo': np.array(foo_keys, dtype='i4').repeat(15), 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), 'values': np.random.randn(N) }) pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) {code} When reading with discovery, all is fine: {code} >>> ds.dataset("test_order", format="parquet", >>> partitioning="hive").to_table().schema values: double bar: string foo: int32 >>> ds.dataset("test_order", format="parquet", >>> partitioning="hive").to_table().to_pandas().head(2) values bar foo 0 2.505903 a0 1 -1.760135 a0 {code} But when specifying the partition columns to be dictionary type with explicit {{HivePartitioning}}, you get no error but all null values: {code} >>> partitioning = ds.HivePartitioning(pa.schema([ ... ("foo", pa.dictionary(pa.int32(), pa.int64())), ... ("bar", pa.dictionary(pa.int32(), pa.string())) ... ])) >>> ds.dataset("test_order", format="parquet", >>> partitioning=partitioning).to_table().schema values: double foo: dictionary bar: dictionary >>> ds.dataset("test_order", format="parquet", >>> partitioning=partitioning).to_table().to_pandas().head(2) values foo bar 0 2.505903 NaN NaN 1 -1.760135 NaN NaN {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema
Joris Van den Bossche created ARROW-8087: Summary: [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema Key: ARROW-8087 URL: https://issues.apache.org/jira/browse/ARROW-8087 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Joris Van den Bossche Currently, when reading a partitioned dataset with hive partitioning, it seems that the partition columns get sorted alphabetically when appending them to the schema (while the old ParquetDataset implementation keeps the order as it is present in the paths). For a regular partitioning this order is consistent for all fragments. So for example for the typical NYC Taxi data example, with datasets, the schema ends with columns "month, year", while the ParquetDataset appends them as "year, month". Python example: {code} foo_keys = [0, 1] bar_keys = ['a', 'b', 'c'] N = 30 df = pd.DataFrame({ 'foo': np.array(foo_keys, dtype='i4').repeat(15), 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), 'values': np.random.randn(N) }) pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) {code} {code} >>> pq.read_table("test_order").schema values: double foo: dictionary bar: dictionary >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema values: double bar: string foo: int32 {code} so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something else) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8074) [C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset?
Joris Van den Bossche created ARROW-8074: Summary: [C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset? Key: ARROW-8074 URL: https://issues.apache.org/jira/browse/ARROW-8074 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche The current {{pyarrow.parquet.read_table}}/{{ParquetFile}} can work with buffer (reader) objects (file-like objects, pyarrow.Buffer, pyarrow.BufferReader) as input when dealing with single files. This functionality is for example being used by pandas and kartothek (in addition to being extensively used in our own tests as well). While we could keep the old implementation to handle single files (which is different from the ParquetDataset logic), there are also some advantages of being able to handle this in the Datasets API. For example, this would enable to filtering functionality of the datasets API, also for this single-file buffers use case, which would be a nice enhancement (currently, {{read_table}} does not support {{filters}} in case of single files, which is eg why kartothek implements this themselves). Would this be possible to support? The {{arrow::dataset::FileSource}} already has PATH and BUFFER enum types (https://github.com/apache/arrow/blob/08f8bff05af37921ff1e5a2b630ce1e7ec1c0ede/cpp/src/arrow/dataset/file_base.h#L46-L49), so it seems in principle possible to create a FileSource (for a FileSystemDataset / FileFragment) from a buffer instead of from a path? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8063) [Python] Add user guide documentation for Datasets API
Joris Van den Bossche created ARROW-8063: Summary: [Python] Add user guide documentation for Datasets API Key: ARROW-8063 URL: https://issues.apache.org/jira/browse/ARROW-8063 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Fix For: 0.17.0 Currently, we only have API docs (https://arrow.apache.org/docs/python/api/dataset.html), but we also need prose docs explaining what the dataset module does with examples. This can also include guidelines on how to use this instead of the ParquetDataset API (depending on how we end up doing ARROW-8039), this aspect is also covered by ARROW-8047 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
Joris Van den Bossche created ARROW-8062: Summary: [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file Key: ARROW-8062 URL: https://issues.apache.org/jira/browse/ARROW-8062 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Partitioned parquet datasets sometimes come with {{_metadata}} / {{_common_metadata}} files. Those files include information about the schema of the full dataset and potentially all RowGroup metadata as well (for {{_metadata}}). Using those files during the creation of a parquet {{Dataset}} can give a more efficient factory (using the stored schema instead of inferring the schema from unioning the schemas of all files + using the paths to individual parquet files instead of crawling the directory). Basically, based those files, the schema, list of paths and partition expressions (the information that is needed to create a Dataset) could be constructed. Such logic could be put in a different factory class, eg {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)
Joris Van den Bossche created ARROW-8061: Summary: [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups) Key: ARROW-8061 URL: https://issues.apache.org/jira/browse/ARROW-8061 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Joris Van den Bossche Specifically for parquet (not sure if it will be relevant for other file formats as well, for IPC/feather potentially ther record batch), it would be useful to target row groups instead of files as fragments. Quoting the original design documents: _"In datasets consisting of many fragments, the dataset API must expose the granularity of fragments in a public way to enable parallel processing, if desired. "._ And a comment from Wes on that: _"a single Parquet file can "export" one or more fragments based on settings. The default might be to split fragments based on row group"_ Currently, the level on which fragments are defined (at least in the typical partitioned parquet dataset) is "1 file == 1 fragment". Would it be possible or desirable to make this more fine grained, where you could also opt to have a fragment per row group? We could have a ParquetFragment that has this option, and a ParquetFileFormat specific option to say what the granularity of a fragment is (file vs row group)? cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8060) [Python] Make dataset Expression objects serializable
Joris Van den Bossche created ARROW-8060: Summary: [Python] Make dataset Expression objects serializable Key: ARROW-8060 URL: https://issues.apache.org/jira/browse/ARROW-8060 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche It would be good to be able to pickle pyarrow.dataset.Expression objects (eg for use in dask.distributed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8059) [Python] Make FileSystem objects serializable
Joris Van den Bossche created ARROW-8059: Summary: [Python] Make FileSystem objects serializable Key: ARROW-8059 URL: https://issues.apache.org/jira/browse/ARROW-8059 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche It would be good to be able to pickle {{pyarrow.fs.FileSystem}} objects (eg for use in dask.distributed) cc [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments
Joris Van den Bossche created ARROW-7963: Summary: [C++][Python][Dataset] Expose listing fragments Key: ARROW-7963 URL: https://issues.apache.org/jira/browse/ARROW-7963 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Assignee: Ben Kietzman It would be useful to able to list the fragments, to get their paths / partition expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7907) [Python] Conversion to pandas of empty table with timestamp type aborts
Joris Van den Bossche created ARROW-7907: Summary: [Python] Conversion to pandas of empty table with timestamp type aborts Key: ARROW-7907 URL: https://issues.apache.org/jira/browse/ARROW-7907 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.1 Creating an empty table: {code} In [1]: table = pa.table({'a': pa.array([], type=pa.timestamp('us'))}) In [2]: table['a'] Out[2]: [ [] ] In [3]: table.to_pandas() Out[3]: Empty DataFrame Columns: [a] Index: [] {code} the above works. But the ChunkedArray still has 1 empty chunk. When filtering data, you can actually get no chunks, and this fails: {code} In [4]: table2 = table.slice(0, 0) In [5]: table2['a'] Out[5]: [ ] In [6]: table2.to_pandas() ../src/arrow/table.cc:48: Check failed: (chunks.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type ... Aborted (core dumped) {code} and this seems to happen specifically for timestamp type, and specifically with non-ns unit (eg with us as above, which is the default in arrow). I noticed this when reading a parquet file of the taxi dataset, where the filter I used resulted in an empty batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7892) [Python] Expose FilesystemSource.format attribute
Joris Van den Bossche created ARROW-7892: Summary: [Python] Expose FilesystemSource.format attribute Key: ARROW-7892 URL: https://issues.apache.org/jira/browse/ARROW-7892 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type
Joris Van den Bossche created ARROW-7858: Summary: [C++][Python] Support casting an Extension type to its storage type Key: ARROW-7858 URL: https://issues.apache.org/jira/browse/ARROW-7858 Project: Apache Arrow Issue Type: Test Components: C++, Python Reporter: Joris Van den Bossche Currently, casting an extension type will always fail: "No cast implemented from extension to ...". However, for casting, we could fall back to the storage array's casting rules? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion
Joris Van den Bossche created ARROW-7857: Summary: [Python] Failing test with pandas master for extension type conversion Key: ARROW-7857 URL: https://issues.apache.org/jira/browse/ARROW-7857 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche The pandas master test build has one failure {code} ___ test_conversion_extensiontype_to_extensionarray monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0> def test_conversion_extensiontype_to_extensionarray(monkeypatch): # converting extension type to linked pandas ExtensionDtype/Array import pandas.core.internals as _int storage = pa.array([1, 2, 3, 4], pa.int64()) arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage) table = pa.table({'a': arr}) if LooseVersion(pd.__version__) < "0.26.0.dev": # ensure pandas Int64Dtype has the protocol method (for older pandas) monkeypatch.setattr( pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__, raising=False) # extension type points to Int64Dtype, which knows how to create a # pandas ExtensionArray > result = table.to_pandas() opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/ipc.pxi:559: in pyarrow.lib.read_message ??? pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas ??? opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764: in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: in _table_to_blocks for item in result] opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: in for item in result] opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723: in _reconstruct_block pd_ext_arr = pandas_dtype.__from_arrow__(arr) opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108: in __from_arrow__ array = array.cast(pyarrow_type) pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E pyarrow.lib.ArrowNotImplementedError: No cast implemented from extension to int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
Joris Van den Bossche created ARROW-7854: Summary: [C++][Dataset] Option to memory map when reading IPC format Key: ARROW-7854 URL: https://issues.apache.org/jira/browse/ARROW-7854 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Joris Van den Bossche For the IPC format it would be interesting to be able to memory map the IPC files? cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7839) [Python][Dataset] Add IPC format to python bindings
Joris Van den Bossche created ARROW-7839: Summary: [Python][Dataset] Add IPC format to python bindings Key: ARROW-7839 URL: https://issues.apache.org/jira/browse/ARROW-7839 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The C++ / R was done in ARROW-7415, we should add bindings for it in Python as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault
Joris Van den Bossche created ARROW-7781: Summary: [C++][Dataset] Filtering on a non-existent column gives a segfault Key: ARROW-7781 URL: https://issues.apache.org/jira/browse/ARROW-7781 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche Fix For: 1.0.0 Example with python code: {code} In [1]: import pandas as pd In [2]: df = pd.DataFrame({'a': [1, 2, 3]}) In [3]: df.to_parquet("test-filter-crash.parquet") In [4]: import pyarrow.dataset as ds In [5]: dataset = ds.dataset("test-filter-crash.parquet") In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas() Out[6]: a 0 2 1 3 In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas() ../src/arrow/dataset/filter.cc:929: Check failed: _s.ok() Operation failed: maybe_value.status() Bad status: Invalid: attempting to cast non-null scalar to NullScalar /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFu
[jira] [Created] (ARROW-7762) [Python] Exceptions in ParquetWriter get ignored
Joris Van den Bossche created ARROW-7762: Summary: [Python] Exceptions in ParquetWriter get ignored Key: ARROW-7762 URL: https://issues.apache.org/jira/browse/ARROW-7762 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche For example: {code:python} In [43]: table = pa.table({'a': [1, 2, 3]}) In [44]: pq.write_table(table, "test.parquet", version="2.2") --- ArrowExceptionTraceback (most recent call last) ArrowException: Unsupported Parquet format version Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_version' pyarrow.lib.ArrowException: Unsupported Parquet format version {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7703) [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources
Joris Van den Bossche created ARROW-7703: Summary: [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources Key: ARROW-7703 URL: https://issues.apache.org/jira/browse/ARROW-7703 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche Currently, if you try to create a dataset from files with different schemes, you get this error: {code} ArrowInvalid: Unable to merge: Field a has incompatible types: int64 vs int32 {code} If you are reading a directory of files, it would be very helpful if the error message can indicate which files are involved here (eg if you have a lot of files and only one has an error). You can already inspect the schema's if you first make a SourceFactory manually, but that also only gives a list of schema's, not mapped to the original file (this last item probably relates to ARROW-7608 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches
Joris Van den Bossche created ARROW-7702: Summary: [C++][Dataset] Provide (optional) deterministic order of batches Key: ARROW-7702 URL: https://issues.apache.org/jira/browse/ARROW-7702 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Example with python: {code} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': range(12)}) pq.write_table(table, "test_chunks.parquet", chunk_size=3) # reading with dataset import pyarrow.dataset as ds ds.dataset("test_chunks.parquet").to_table().to_pandas() {code} gives non-deterministic result (order of the row groups in the parquet file): ``` In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[25]: a 00 11 22 33 44 55 66 77 88 99 10 10 11 11 In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[26]: a 00 11 22 33 48 59 6 10 7 11 84 95 10 6 11 7 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7677) [C++] Handle Windows file paths with backslashes in GetTargetStats
Joris Van den Bossche created ARROW-7677: Summary: [C++] Handle Windows file paths with backslashes in GetTargetStats Key: ARROW-7677 URL: https://issues.apache.org/jira/browse/ARROW-7677 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Currently, if the base path passed to {{GetTargetStats}} has backslashes, the produces FileStats also include them, resulting in some other functionality (such as splitting the path) not working. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7652) [Python] Insert implicit cast in ScannerBuilder.filter
Joris Van den Bossche created ARROW-7652: Summary: [Python] Insert implicit cast in ScannerBuilder.filter Key: ARROW-7652 URL: https://issues.apache.org/jira/browse/ARROW-7652 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7649) [Python] Expose dataset PartitioningFactory.inspect ?
Joris Van den Bossche created ARROW-7649: Summary: [Python] Expose dataset PartitioningFactory.inspect ? Key: ARROW-7649 URL: https://issues.apache.org/jira/browse/ARROW-7649 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In C++, the PartitioningFactory has a {{Inspect}} method, which, given a path, will infer the schema. We could expose this in Python as well, it could eg be used to easily explore or illustrate what types are inferred from a path (int32, string) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7638) [Python] Segfault when inspecting dataset.Source with invalid file/partitioning
Joris Van den Bossche created ARROW-7638: Summary: [Python] Segfault when inspecting dataset.Source with invalid file/partitioning Key: ARROW-7638 URL: https://issues.apache.org/jira/browse/ARROW-7638 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche Getting a segfault with: {code} In [1]: import pyarrow.dataset as ds In [2]: !touch test_empty.txt In [3]: source_factory = ds.source("test_empty.txt", partitioning=ds.partitioning(field_names=['a', 'b'])) In [4]: source_factory.inspect() Segmentation fault (core dumped) {code} Didn't yet further investigate what might be the reason (there are several "wrong" things here: it's not a valid file for the parquet format, the partitioning does not match the files, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7636) [Python] Clean-up the pyarrow.dataset.partitioning() API
Joris Van den Bossche created ARROW-7636: Summary: [Python] Clean-up the pyarrow.dataset.partitioning() API Key: ARROW-7636 URL: https://issues.apache.org/jira/browse/ARROW-7636 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.0 A left-over review comment at https://github.com/apache/arrow/pull/6022#discussion_r367016454 on the API of {{partitioning()}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7634) [Python] Dataset tests failing on Windows to parse file path
Joris Van den Bossche created ARROW-7634: Summary: [Python] Dataset tests failing on Windows to parse file path Key: ARROW-7634 URL: https://issues.apache.org/jira/browse/ARROW-7634 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.0 See eg https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=5217&view=logs&j=4c86bc1b-1091-5192-4404-c74dfaad23e7&t=ec99a26b-0264-5e86-36fb-9cfd0ca0f9f3&l=4066 Failing on the backward slashes of the pathlib file paths, and clearly not run in CI since this was not catched. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7593) [CI][Python] Python datasets failing on master / not run on CI
Joris Van den Bossche created ARROW-7593: Summary: [CI][Python] Python datasets failing on master / not run on CI Key: ARROW-7593 URL: https://issues.apache.org/jira/browse/ARROW-7593 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7591) [Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array
Joris Van den Bossche created ARROW-7591: Summary: [Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array Key: ARROW-7591 URL: https://issues.apache.org/jira/browse/ARROW-7591 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently, the {{to_numpy}} method doesn't return an ndarray incase of dictionaryd type data: {code} In [54]: a = pa.array(pd.Categorical(["a", "b", "a"])) In [55]: a Out[55]: -- dictionary: [ "a", "b" ] -- indices: [ 0, 1, 0 ] In [57]: a.to_numpy(zero_copy_only=False) Out[57]: {'indices': array([0, 1, 0], dtype=int8), 'dictionary': array(['a', 'b'], dtype=object), 'ordered': False} {code} This is actually just an internal representation that is passed from C++ to Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be constructed, but it's not something we should return as such to the user. Rather, I think we should return a decoded / dense numpy array (or at least error instead of returning this dict) (also, if the user wants those parts, they are already available from the dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}}) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions
Joris Van den Bossche created ARROW-7569: Summary: [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Key: ARROW-7569 URL: https://issues.apache.org/jira/browse/ARROW-7569 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.0 ARROW-2428 was about adding such a mapping, and described three use cases (see this [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231] for details): * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if the pandas_metadata specify pandas extension dtypes, and if so, use this as the target dtype for that column) * Conversion for pyarrow extension types that can define their equivalent pandas extension dtype * A way to override default conversion (eg for the built-in types, or in absence of pandas_metadata in the schema). This would require the user to be able to specify some mapping of pyarrow type or column name to the pandas extension dtype to use. The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) only covered the first two cases, and not the third case. I think it is still interesting to also cover the third case in some way. An example use case are the new nullable dtypes that are introduced in pandas (eg the nullable integer dtype). Assume I want to read a parquet file into a pandas DataFrame using this nullable integer dtype. The pyarrow Table has no pandas_metadata indicating to use this dtype (unless it was created from a pandas DataFrame that was already using this dtype, but that will often not be the case), and the pyarrow.int64() type is also not an extension type that can define its equivalent pandas extension dtype. Currently, the only solution is first read it into pandas DataFrame (which will use floats for the integers if there are nulls), and then afterwards to convert those floats back to a nullable integer dtype. A possible API for this could look like: {code} table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) {code} to indicate that you want to convert all columns of the pyarrow table with int64 type to a pandas column using the nullable Int64 dtype. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7547) [C++] [Python] [Dataset] Additional reader options in ParquetFileFormat
Joris Van den Bossche created ARROW-7547: Summary: [C++] [Python] [Dataset] Additional reader options in ParquetFileFormat Key: ARROW-7547 URL: https://issues.apache.org/jira/browse/ARROW-7547 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche [looking into using the datasets machinery in the current python parquet code] In the current python API, we expose several options that influence reading the parquet file (eg {{read_dictionary}} to indicate to read certain BYTE_ARRAY columns directly into a dictionary type, or {{memory_map}}, {{buffer_size}}). Those could be added to {{ParquetFileFormat}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7545) [C++] Scanning dataset with dictionary type hangs
Joris Van den Bossche created ARROW-7545: Summary: [C++] Scanning dataset with dictionary type hangs Key: ARROW-7545 URL: https://issues.apache.org/jira/browse/ARROW-7545 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche I assume it is an issue on the C++ side of the datasets code, but reproducer in Python. I create a small parquet file with a single column of dictionary type. Reading it with {{pq.read_table}} works fine, reading it with the datasets machinery hangs when scanning: {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)}) arrow_table = pa.Table.from_pandas(df) filename = "test.parquet" pq.write_table(arrow_table, filename) from pyarrow.fs import LocalFileSystem from pyarrow.dataset import ParquetFileFormat, Dataset, FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions filesystem = LocalFileSystem() format = ParquetFileFormat() options = FileSystemDiscoveryOptions() discovery = FileSystemDataSourceDiscovery( filesystem, [filename], format, options) inspected_schema = discovery.inspect() dataset = Dataset([discovery.finish()], inspected_schema) # dataset.schema works fine and gives correct schema dataset.schema scanner_builder = dataset.new_scan() scanner = scanner_builder.finish() # this hangs scanner.to_table() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7528) [Python] The pandas.datetime class (import of datetime.datetime) is deprecated
Joris Van den Bossche created ARROW-7528: Summary: [Python] The pandas.datetime class (import of datetime.datetime) is deprecated Key: ARROW-7528 URL: https://issues.apache.org/jira/browse/ARROW-7528 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.16.0 The {{pd.datetime}} was actually just an import from {{datetime.datetime}}, and is being removed from pandas (to use the stdlib one directly). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7527) [Python] pandas/feather tests failing on pandas master
Joris Van den Bossche created ARROW-7527: Summary: [Python] pandas/feather tests failing on pandas master Key: ARROW-7527 URL: https://issues.apache.org/jira/browse/ARROW-7527 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Because I merged a PR in pandas to support Period dtype, some tests in pyarrow are now failing (they were using period dtype to test "unsupported" dtypes) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7497) [Python] pandas master failures: pandas.util.testing is deprecated
Joris Van den Bossche created ARROW-7497: Summary: [Python] pandas master failures: pandas.util.testing is deprecated Key: ARROW-7497 URL: https://issues.apache.org/jira/browse/ARROW-7497 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The nightly pandas-master tests are failing (eg https://circleci.com/gh/ursa-labs/crossbow/6815?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link) due to the deprecation of {{pandas.util.testing}} in pandas. This deprecation gives a lot of warnings (which we should solve), but also some errors because the deprecations was not fully done properly on the pandas side, opened https://github.com/pandas-dev/pandas/issues/30735 for this (will be fixed shortly) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7432) [Python] Add higher-level datasets functions
Joris Van den Bossche created ARROW-7432: Summary: [Python] Add higher-level datasets functions Key: ARROW-7432 URL: https://issues.apache.org/jira/browse/ARROW-7432 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 >From [~kszucs]: We need to define a more pythonic API for the dataset >bindings, because the current one is pretty low-level. One option is to provide a "open_dataset" function similar as what is available in R. A short-cut to go from a Dataset to a Table might also be useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7431) [Python] Add dataset API to reference docs
Joris Van den Bossche created ARROW-7431: Summary: [Python] Add dataset API to reference docs Key: ARROW-7431 URL: https://issues.apache.org/jira/browse/ARROW-7431 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Add dataset to python API docs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7430) [Python] Add more docstrings to dataset bindings
Joris Van den Bossche created ARROW-7430: Summary: [Python] Add more docstrings to dataset bindings Key: ARROW-7430 URL: https://issues.apache.org/jira/browse/ARROW-7430 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
Joris Van den Bossche created ARROW-7365: Summary: [Python] Support FixedSizeList type in conversion to numpy/pandas Key: ARROW-7365 URL: https://issues.apache.org/jira/browse/ARROW-7365 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet
Joris Van den Bossche created ARROW-7273: Summary: [Python] Non-nullable null field is allowed / crashes when writing to parquet Key: ARROW-7273 URL: https://issues.apache.org/jira/browse/ARROW-7273 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche It seems to be possible to create a "non-nullable null field". While this does not make any sense (so already a reason to disallow this I think), this can also lead to crashed in further operations, such as writing to parquet: {code} In [18]: table = pa.table([pa.array([None, None], pa.null())], schema=pa.schema([pa.field('a', pa.null(), nullable=False)])) In [19]: table Out[19]: pyarrow.Table a: null not null In [20]: pq.write_table(table, "test_null.parquet") WARNING: Logging before InitGoogleLogging() is written to STDERR F1128 14:08:30.267439 27560 column_writer.cc:837] Check failed: (nullptr) != (values) *** Check failure stack trace: *** Aborted (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type
Joris Van den Bossche created ARROW-7261: Summary: [Python] Python support for fixed size list type Key: ARROW-7261 URL: https://issues.apache.org/jira/browse/ARROW-7261 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2
Joris Van den Bossche created ARROW-7220: Summary: [CI] Docker compose (github actions) Mac Python 3 build is using Python 2 Key: ARROW-7220 URL: https://issues.apache.org/jira/browse/ARROW-7220 Project: Apache Arrow Issue Type: Test Reporter: Joris Van den Bossche The "AMD64 MacOS 10.15 Python 3" build is also running in python 2. Possibly related to how brew is installing python 2 / 3, or because it is using the system python, ... (not familiar with mac) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working
Joris Van den Bossche created ARROW-7218: Summary: [Python] Conversion from boolean numpy scalars not working Key: ARROW-7218 URL: https://issues.apache.org/jira/browse/ARROW-7218 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche In general, we are fine to accept a list of numpy scalars: {code} In [12]: type(list(np.array([1, 2]))[0]) Out[12]: numpy.int64 In [13]: pa.array(list(np.array([1, 2]))) Out[13]: [ 1, 2 ] {code} But for booleans, this doesn't work: {code} In [14]: pa.array(list(np.array([True, False]))) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array(list(np.array([True, False]))) ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array() ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to boolean {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env
Joris Van den Bossche created ARROW-7217: Summary: Docker compose / github actions ignores PYTHON env Key: ARROW-7217 URL: https://issues.apache.org/jira/browse/ARROW-7217 Project: Apache Arrow Issue Type: Test Components: CI Reporter: Joris Van den Bossche The "AMD64 Conda Python 2.7" build is actually using Python 3.6. This python 3.6 version is written in the conda-python.dockerfile: https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24 and I am not fully sure how the ENV variable overrides that or not cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas
Joris Van den Bossche created ARROW-7209: Summary: [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas Key: ARROW-7209 URL: https://issues.apache.org/jira/browse/ARROW-7209 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our tests where assuming this did not yet work in pandas, and thus need to be updated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions
Joris Van den Bossche created ARROW-7167: Summary: [CI][Python] Add nightly tests for older pandas versions to Github Actions Key: ARROW-7167 URL: https://issues.apache.org/jira/browse/ARROW-7167 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy
Joris Van den Bossche created ARROW-7154: Summary: [C++] Build error when building tests but not with snappy Key: ARROW-7154 URL: https://issues.apache.org/jira/browse/ARROW-7154 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Since the docker-compose PR landed, I am having build errors like: {code:java} [361/376] Linking CXX executable debug/arrow-python-test FAILED: debug/arrow-python-test : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o debug/arrow-python-test -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread -ldl -lutil -lrt -ldl /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::system::detail::generic_category_ncx()' /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::filesystem::path::operator/=(boost::filesystem::path const&)' collect2: error: ld returned 1 exit status {code} which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found" (although this is certainly present). The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to OFF, it works fine. It also seems to be related to this specific change in the docker compose PR: {code:java} diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c80ac3310..3b3c9eb8f 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -266,6 +266,15 @@ endif(UNIX) # Set up various options # -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) - # Currently the compression tests require at least these libraries; bz2 and - # zstd are optional. See ARROW-3984 - set(ARROW_WITH_BROTLI ON) - set(ARROW_WITH_LZ4 ON) - set(ARROW_WITH_SNAPPY ON) - set(ARROW_WITH_ZLIB ON) -endif() - if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) set(ARROW_JSON ON) endif() {code} If I add that back, the build works. With only `set(ARROW_WITH_BROTLI ON)`, it still fails With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about liblz4 instead of libboost (but also liblz4 is actually present) With only `set(ARROW_WITH_SNAPPY ON)`, it works With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about libz.so.1 not found So it seems that the absence of snappy causes others to fail. In the recommended build settings in the development docs ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] the compression libraries are enabled. But I was still building without them (stemming from the time they were enabled by default). So I was using: {code} cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_BUILD_TESTS=ON \ .. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array
Joris Van den Bossche created ARROW-7068: Summary: [C++] Expose the offsets of a ListArray as a Int32Array Key: ARROW-7068 URL: https://issues.apache.org/jira/browse/ARROW-7068 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can move this into C++ and use that implementation from Python. Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this could be a \{{ListArray::value_offsets_array}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?
Joris Van den Bossche created ARROW-7066: Summary: [Python] support returning ChunkedArray from __arrow_array__ ? Key: ARROW-7066 URL: https://issues.apache.org/jira/browse/ARROW-7066 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can define how they should be converted to a pyarrow Array (similar to numpy's {{\_\_array\_\_}}). This is then also used to support converting pandas DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if the pandas ExtensionArray, such as nullable integer type, implements this {{\_\_arrow_array\_\_}} method). This last use case could also be useful for fletcher (https://github.com/xhochy/fletcher/, a package that implements pandas ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a pandas DataFrame). However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a pandas DataFrame (to have a better mapping with a Table, where the columns also consist of chunked arrays). While we currently require that the return value of {{\_\_arrow_array\_\_}} is a pyarrow.Array. So I was wondering: could we relax this constraint and also allow ChunkedArray as return value? However, this protocol is currently called in the {{pa.array(..)}} function, which probably should keep returning an Array (and not ChunkedArray in certain cases). cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python
Joris Van den Bossche created ARROW-7031: Summary: [Python] Expose the offsets of a ListArray in python Key: ARROW-7031 URL: https://issues.apache.org/jira/browse/ARROW-7031 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assume the following ListArray: {code} In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5]) In [2]: arr Out[2]: [ [ 1, 2, 3 ], [ 4, 5 ] ] {code} You can get the actual values as a flat array through {{.values}} / {{.flatten()}}, but there is currently no easy way to get back to the offsets (except from interpreting the buffers manually). We should probably add an {{offsets}} attribute (there is actually also a TODO comment for that). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object
Joris Van den Bossche created ARROW-7027: Summary: [Python] pa.table(..) returns instead of raises error if passing invalid object Key: ARROW-7027 URL: https://issues.apache.org/jira/browse/ARROW-7027 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 When passing eg a Series instead of a DataFrame, you get: {code} In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) In [5]: table = pa.table(df['a']) In [6]: table Out[6]: TypeError('Expected pandas DataFrame or python dictionary') In [7]: type(table) Out[7]: TypeError {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index
Joris Van den Bossche created ARROW-7023: Summary: [Python] pa.array does not use "from_pandas" semantics for pd.Index Key: ARROW-7023 URL: https://issues.apache.org/jira/browse/ARROW-7023 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 {code} In [15]: idx = pd.Index([1, 2, np.nan], dtype=object) In [16]: pa.array(idx) Out[16]: [ 1, 2, nan ] In [17]: pa.array(idx, from_pandas=True) Out[17]: [ 1, 2, null ] In [18]: pa.array(pd.Series(idx)) Out[18]: [ 1, 2, null ] {code} We should probably handle Series and Index the same in this regard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7022) [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas
Joris Van den Bossche created ARROW-7022: Summary: [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas Key: ARROW-7022 URL: https://issues.apache.org/jira/browse/ARROW-7022 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 When someone has a custom ExtensionType defined in Python, and an array class that gets converted to that (through {{\_\_arrow_array\_\_}}), the conversion in pyarrow works with the array class, but not yet for the array stored in a pandas DataFrame. Eg using my definition of ArrowPeriodType in https://github.com/pandas-dev/pandas/pull/28371, I see: {code} In [15]: pd_array = pd.period_range("2012-01-01", periods=3, freq="D").array In [16]: pd_array Out[16]: ['2012-01-01', '2012-01-02', '2012-01-03'] Length: 3, dtype: period[D] In [17]: pa.array(pd_array) Out[17]: [ 15340, 15341, 15342 ] In [18]: df = pd.DataFrame({'periods': pd_array}) In [19]: pa.table(df) ... ArrowInvalid: ('Could not convert 2012-01-01 with type Period: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column periods with type period[D]') {code} (this is working correctly for array objects whose {{\_\_arrow_array\_\_}} is returning a built-in pyarrow Array). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6974) [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern
Joris Van den Bossche created ARROW-6974: Summary: [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern Key: ARROW-6974 URL: https://issues.apache.org/jira/browse/ARROW-6974 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, the casting for time-like data is done with the {{ShiftTime}} function. It _might_ be possible to simplify this with ArrayDataVisitor (to avoid looping / checking the bitmap). -- This message was sent by Atlassian Jira (v8.3.4#803005)