[jira] [Created] (ARROW-9782) [C++][Dataset] Ability to write ".feather" files with IpcFileFormat
Joris Van den Bossche created ARROW-9782: Summary: [C++][Dataset] Ability to write ".feather" files with IpcFileFormat Key: ARROW-9782 URL: https://issues.apache.org/jira/browse/ARROW-9782 Project: Apache Arrow Issue Type: Improvement Components: C++, Python, R Reporter: Joris Van den Bossche With the new dataset writing bindings, one can do {{ds.write_dataset(data, format="feather")}} (Python) or {{write_dataset(data, format = "feather")}} (R) to write a dataset to feather files. However, because "feather" is just an alias for the IpcFileFormat, it will currently write all files with the {{.ipc}} extension. I think this can be a bit confusing, since many people will be more familiar with "feather" and expect such an extension. (more generally, ".ipc" is maybe not the best default, since it's not very descriptive extension. Something like ".arrow" might be better?) cc [~npr] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9864) [Python] pathlib.Path not suppored in write_to_dataset with partition columns
Joris Van den Bossche created ARROW-9864: Summary: [Python] pathlib.Path not suppored in write_to_dataset with partition columns Key: ARROW-9864 URL: https://issues.apache.org/jira/browse/ARROW-9864 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Copying over from https://github.com/pandas-dev/pandas/issues/35902 {code:python} import pathlib df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'}) df.to_parquet('tmp_path1.parquet') # OK df.to_parquet(pathlib.Path('tmp_path2.parquet')) # OK df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B']) # TypeError {code} {{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an argument in case when `partition_cols` argument is not None. If no partition cols are provided, then {{pathlib.Path()}} is properly accepted {code} --- TypeError Traceback (most recent call last) in 3 4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK > 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B']) # TypeError ... ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in write_to_dataset(table, root_path, partition_cols, partition_filename_cb, filesystem, **kwargs) 1790 subtable = pa.Table.from_pandas(subgroup, schema=subschema, 1791 safe=False) -> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir])) 1793 if partition_filename_cb: 1794 outfile = partition_filename_cb(keys) TypeError: sequence item 0: expected str instance, PosixPath found {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9875) [Python] Let FileSystem.get_file_info accept a single path
Joris Van den Bossche created ARROW-9875: Summary: [Python] Let FileSystem.get_file_info accept a single path Key: ARROW-9875 URL: https://issues.apache.org/jira/browse/ARROW-9875 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently you need to do {{fs.get_file_info([path])[0]}} to get the info of a single path. We can make the function also accept that directly (instead of a list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9893) [Python] Bindings for writing datasets to Parquet
Joris Van den Bossche created ARROW-9893: Summary: [Python] Bindings for writing datasets to Parquet Key: ARROW-9893 URL: https://issues.apache.org/jira/browse/ARROW-9893 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)
Joris Van den Bossche created ARROW-9906: Summary: [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem) Key: ARROW-9906 URL: https://issues.apache.org/jira/browse/ARROW-9906 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
Joris Van den Bossche created ARROW-9920: Summary: [Python] pyarrow.concat_arrays segfaults when passing it a chunked array Key: ARROW-9920 URL: https://issues.apache.org/jira/browse/ARROW-9920 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it the list of chunks: {code} In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) In [2]: pa.concat_arrays(arr.chunks) Out[2]: [ 0, 1, 3, 4 ] {code} but if passing the chunked array itself, you get a segfault: {code} In [4]: pa.concat_arrays(arr) Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet
Joris Van den Bossche created ARROW-9936: Summary: [Python] Fix / test relative file paths in pyarrow.parquet Key: ARROW-9936 URL: https://issues.apache.org/jira/browse/ARROW-9936 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 2.0.0 It seems that I broke writing parquet to relative file paths in ARROW-9718 (again, something similar happened in the pyarrow.dataset reading), so should fix that and properly test this. {code} In [3]: pq.write_table(table, "test_relative.parquet") ... ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri() ArrowInvalid: URI has empty scheme: 'test_relative.parquet' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
Joris Van den Bossche created ARROW-9938: Summary: [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)? Key: ARROW-9938 URL: https://issues.apache.org/jira/browse/ARROW-9938 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset
Joris Van den Bossche created ARROW-9952: Summary: [Python] Use pyarrow.dataset writing for pq.write_to_dataset Key: ARROW-9952 URL: https://issues.apache.org/jira/browse/ARROW-9952 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 2.0.0 Now ARROW-9658 and ARROW-9893 are in, we can explore using the {{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}. Similarly as was done in {{pq.read_table}}, we could initially have a keyword to switch between both implementations, eventually defaulting to the new datasets one, and to deprecated the old (inefficient) python implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9962) [Python] Conversion to pandas with index column using fixed timezone fails
Joris Van den Bossche created ARROW-9962: Summary: [Python] Conversion to pandas with index column using fixed timezone fails Key: ARROW-9962 URL: https://issues.apache.org/jira/browse/ARROW-9962 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From https://github.com/pandas-dev/pandas/issues/35997: it seems we are >handling a normal column and index column differently in the conversion to >pandas. {code} In [5]: import pandas as pd ...: from datetime import datetime, timezone ...: ...: df = pd.DataFrame([[datetime.now(timezone.utc), datetime.now(timezone.utc)]], columns=['date_index', 'date_column']) ...: table = pa.Table.from_pandas(df.set_index('date_index')) ...: In [6]: table Out[6]: pyarrow.Table date_column: timestamp[ns, tz=+00:00] date_index: timestamp[ns, tz=+00:00] In [7]: table.to_pandas() ... UnknownTimeZoneError: '+00:00' {code} So this happens specifically for "fixed offset" timezones, and only for index columns (eg {{table.select(["date_column"]).to_pandas()}} works fine). It seems this is because for columns we use our helper {{make_tz_aware}} to convert the string "+01:00" to a python timezone, which is then understood by pandas (the string is not handled by pandas). But for the index column we fail to do this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9963) [Python] Recognize datetime.timezone.utc as UTC on conversion python->pyarrow
Joris Van den Bossche created ARROW-9963: Summary: [Python] Recognize datetime.timezone.utc as UTC on conversion python->pyarrow Key: ARROW-9963 URL: https://issues.apache.org/jira/browse/ARROW-9963 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Related to ARROW-5248, but specifically for the stdlib {{datetime.timezone.utc}}, I think it would be nice to "recognize" this as UTC. Currently it is converted to "+00:00", while for pytz this is not the case: {code} from datetime import datetime, timezone import pytz print(pa.array([datetime.now(timezone.utc)]).type) print(pa.array([datetime.now(pytz.utc)]).type) {code} gives {code} timestamp[us, tz=+00:00] timestamp[us, tz=UTC] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10091) [C++][Dataset] Support isin filter for row group (statistics-based) filtering
Joris Van den Bossche created ARROW-10091: - Summary: [C++][Dataset] Support isin filter for row group (statistics-based) filtering Key: ARROW-10091 URL: https://issues.apache.org/jira/browse/ARROW-10091 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently the {{isin}} filter works for partition-based filtering, but not for row group (statistics)-based filtering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded
Joris Van den Bossche created ARROW-10099: - Summary: [C++][Dataset] Also allow integer partition fields to be dictionary encoded Key: ARROW-10099 URL: https://issues.apache.org/jira/browse/ARROW-10099 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 In ARROW-8647, we added the option to indicate that you partition field columns should be dictionary encoded, but it currently does only do this for string type, and not for integer type (wiht the reasoning that for integers, it is not giving any memory efficiency gains to use dictionary encoding). In dask, they have been using categorical dtypes for _all_ partition fields, also if they are integers. They would like to keep doing this (apart from memory efficiency, using categorical/dictionary type also gives information about all uniques values of the column, without having to calculate this), so it would be nice to enable this use case. So I think we could either simply always dictionary encode also integers when {{max_partition_dictionary_size}} indicates partition fields should be dictionary encoded, or either have an additional option to indicate also integer partition fields should be encoded (if the other option indicates dictionary encoding should be used). cc [~rjzamora] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10100) [C++]
Joris Van den Bossche created ARROW-10100: - Summary: [C++] Key: ARROW-10100 URL: https://issues.apache.org/jira/browse/ARROW-10100 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10130) [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status
Joris Van den Bossche created ARROW-10130: - Summary: [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status Key: ARROW-10130 URL: https://issues.apache.org/jira/browse/ARROW-10130 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 Splitting a ParquetFileFragment in multiple fragments per row group ({{SplitByRowGroup}}) calls {{EnsureCompleteMetadata}} initially, but then the created fragments afterwards don't have the {{has_complete_metadata_}} property set. This means that when calling {{EnsureCompleteMetadata}} on the splitted fragments, it will read/parse the metadata again, instead of using the cached ones (which are already present). Small example to illustrate: {code:python} In [1]: import pyarrow.dataset as ds In [2]: dataset = ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", partitioning="hive") In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in frag.split_by_row_group()] In [4]: len(rg_fragments) Out[4]: 4520 # row group fragments actually have statistics In [7]: rg_fragments[0].row_groups[0].statistics Out[7]: {'vendor_id': {'min': '1', 'max': '4'}, 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51), 'max': datetime.datetime(2018, 12, 26, 14, 48, 54)}, ... # but calling ensure_complete_metadata still takes a lot of time the first call In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments] CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s Wall time: 1.9 s In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments] CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms Wall time: 1.35 ms {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment
Joris Van den Bossche created ARROW-10131: - Summary: [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment Key: ARROW-10131 URL: https://issues.apache.org/jira/browse/ARROW-10131 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Related to ARROW-9730, parsing of the statistics in parquet metadata is expensive, and therefore should be avoided when possible. For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in python) parses all statistics of all files and all columns. While when doing a filtered read, you might only need the statistics of certain files (eg if a filter on a partition field already excluded many files) and certain columns (eg only the columns on which you are actually filtering). The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, or only parsing the other (non-statistics) metadata, ...), so I think we should try to think of better abstractions. cc [~rjzamora] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10134) [C++][Dataset] Add ParquetFileFragment::num_row_groups property
Joris Van den Bossche created ARROW-10134: - Summary: [C++][Dataset] Add ParquetFileFragment::num_row_groups property Key: ARROW-10134 URL: https://issues.apache.org/jira/browse/ARROW-10134 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 >From https://github.com/dask/dask/pull/6534#issuecomment-699512602, comment by >[~rjzamora]: bq. it would be great to have access the total row-group count for the fragment from a {{num_row_groups}} attribute (which pyarrow should be able to get without parsing all row-group metadata/statistics - I think?). One question is: does this attribute correspond to the row groups in the parquet file, or the (subset of) row groups represented by the fragment? I expect the second (so if you do SplitByRowGroup, you would get a fragment with num_row_groups==1), but this might be a potential confusing aspect of the attribute. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading
Joris Van den Bossche created ARROW-10145: - Summary: [C++][Dataset] Integer-like partition field values outside int32 range error on reading Key: ARROW-10145 URL: https://issues.apache.org/jira/browse/ARROW-10145 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset Small reproducer: {code} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'part': [3760212050]*10, 'col': range(10)}) pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) In [35]: pq.read_table("test_int64_partition/") ... ArrowInvalid: error parsing '3760212050' as scalar of type int32 In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) In ../src/arrow/dataset/partition.cc, line 218, code: (_error_or_value26).status() In ../src/arrow/dataset/partition.cc, line 229, code: (_error_or_value27).status() In ../src/arrow/dataset/discovery.cc, line 256, code: (_error_or_value17).status() In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) Out[36]: pyarrow.Table col: int64 part: dictionary {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
Joris Van den Bossche created ARROW-10244: - Summary: [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset Key: ARROW-10244 URL: https://issues.apache.org/jira/browse/ARROW-10244 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
Joris Van den Bossche created ARROW-10247: - Summary: [C++][Dataset] Cannot write dataset with dictionary column as partition field Key: ARROW-10247 URL: https://issues.apache.org/jira/browse/ARROW-10247 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 When the column to use for partitioning is dictionary encoded, we get this error: {code} In [9]: import pyarrow.dataset as ds In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 ...: table = pa.table([ ...: pa.array(range(len(part))), ...: pa.array(part).dictionary_encode(), ...: ], names=['col', 'part']) In [11]: part = ds.partitioning(table.select(["part"]).schema) In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part) --- ArrowTypeErrorTraceback (most recent call last) in > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part) ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, --> 775 filesystem, partitioning, file_options, use_threads, 776 ) ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowTypeError: scalar xxx (of type string) is invalid for part: dictionary In ../src/arrow/dataset/filter.cc, line 1082, code: VisitConjunctionMembers(*and_.left_operand(), visitor) In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const std::string& name, const std::shared_ptr& value) { auto&& _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { ::arrow::Status __s = ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, "(_error_or_value28).status()"); return _st; } } while (0); } while (false); auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const auto& field = schema_->field(match[0]); if (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", value->ToString(), " (of type ", *value->type, ") is invalid for ", field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); }) In ../src/arrow/dataset/file_base.cc, line 321, code: (_error_or_value24).status() In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() {code} While this seems a quit normal use case, as this column will typically be repeated many times (and we also support reading it as such with dictionary type, so a roundtrip is currently not possible in that case) I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata
Joris Van den Bossche created ARROW-10248: - Summary: [C++][Dataset] Dataset writing does not write schema metadata Key: ARROW-10248 URL: https://issues.apache.org/jira/browse/ARROW-10248 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 Not sure if this is related to the writing refactor that landed yesterday, but `write_dataset` does not preserve the schema metadata (eg used for pandas metadata): {code} In [20]: df = pd.DataFrame({'a': [1, 2, 3]}) In [21]: table = pa.Table.from_pandas(df) In [22]: table.schema Out[22]: a: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 396 In [23]: ds.write_dataset(table, "test_write_dataset_pandas", format="parquet") In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema Out[24]: a: int64 -- field metadata -- PARQUET:field_id: '1' {code} I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't yet look into how easy it would be to fix. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10264) [C++][Python] Parquet test failing with HadoopFileSystem URI
Joris Van den Bossche created ARROW-10264: - Summary: [C++][Python] Parquet test failing with HadoopFileSystem URI Key: ARROW-10264 URL: https://issues.apache.org/jira/browse/ARROW-10264 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche Fix For: 3.0.0 Follow-up on ARROW-10175. In the HDFS integration tests, there is a test using a URI failing if we use the new filesystem / dataset implementation: {code} FAILED opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri {code} fails with {code} pyarrow.lib.ArrowInvalid: Path '/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet' is not relative to '/user/root' {code} while it is passing a URI (and not a filesystem object) to {{parquet.read_table}}, and the new filesystems/dataset implementation should be able to handle URIs. cc [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10281) [Python] Fix warning when running tests
Joris Van den Bossche created ARROW-10281: - Summary: [Python] Fix warning when running tests Key: ARROW-10281 URL: https://issues.apache.org/jira/browse/ARROW-10281 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche We have accumulated quite some warnings -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10282) [Python] Conversion from custom types (eg decimal) to int dtype raises warning
Joris Van den Bossche created ARROW-10282: - Summary: [Python] Conversion from custom types (eg decimal) to int dtype raises warning Key: ARROW-10282 URL: https://issues.apache.org/jira/browse/ARROW-10282 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche {code:python} In [2]: import decimal In [3]: pa.array([decimal.Decimal("123456")], pa.int32()) DeprecationWarning: an integer is required (got type decimal.Decimal). Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python. Out[3]: [ 123456, ] {code} cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10283) [Python] Python deprecation warning for "PY_SSIZE_T_CLEAN will be required for '#' formats"
Joris Van den Bossche created ARROW-10283: - Summary: [Python] Python deprecation warning for "PY_SSIZE_T_CLEAN will be required for '#' formats" Key: ARROW-10283 URL: https://issues.apache.org/jira/browse/ARROW-10283 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 3.0.0 We have a few cases that run into this python deprecation warning: {code} pyarrow/tests/test_pandas.py: 9 warnings pyarrow/tests/test_parquet.py: 7790 warnings sys:1: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_decimal_with_None_explicit_type pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_decimal_with_None_infer_type /buildbot/AMD64_Conda_Python_3_8/python/pyarrow/tests/test_pandas.py:114: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats result = pd.Series(arr.to_pandas(), name=s.name) pyarrow/tests/test_pandas.py::TestConvertDecimalTypes::test_strided_objects /buildbot/AMD64_Conda_Python_3_8/python/pyarrow/pandas_compat.py:1127: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats result = pa.lib.table_to_blocks(options, block_table, categories, {code} Related to https://bugs.python.org/issue36381 I think one such usage example is at https://github.com/apache/arrow/blob/0b481523b7502a984788d93b822a335894ffe648/cpp/src/arrow/python/decimal.cc#L106 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10284) [Python] Pyarrow is raising deprecation warning about filesystems on import
Joris Van den Bossche created ARROW-10284: - Summary: [Python] Pyarrow is raising deprecation warning about filesystems on import Key: ARROW-10284 URL: https://issues.apache.org/jira/browse/ARROW-10284 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche This happens on import (when setting the warning to be visisble), so even when the user doesn't use the deprecated filesystems: {code} In [1]: import warnings In [2]: warnings.simplefilter("always") In [3]: import pyarrow /home/joris/scipy/repos/arrow/python/pyarrow/filesystem.py:255: DeprecationWarning: pyarrow.filesystem.LocalFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead. cls._instance = LocalFileSystem() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10285) [Python] pyarrow.orc submodule is using deprecated functionality
Joris Van den Bossche created ARROW-10285: - Summary: [Python] pyarrow.orc submodule is using deprecated functionality Key: ARROW-10285 URL: https://issues.apache.org/jira/browse/ARROW-10285 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10347) [Python][Dataset] Test behaviour in case of duplicate partition field / data column
Joris Van den Bossche created ARROW-10347: - Summary: [Python][Dataset] Test behaviour in case of duplicate partition field / data column Key: ARROW-10347 URL: https://issues.apache.org/jira/browse/ARROW-10347 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10423) [C++] Filter compute function seems slow compared to numpy nonzero + take
Joris Van den Bossche created ARROW-10423: - Summary: [C++] Filter compute function seems slow compared to numpy nonzero + take Key: ARROW-10423 URL: https://issues.apache.org/jira/browse/ARROW-10423 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/64581590/is-there-a-more-efficient-way-to-select-rows-from-a-pyarrow-table-based-on-conte I made a smaller, simplified example: {code:python} arr = pa.array(np.random.randn(1_000_000)) # mask with only few True values mask1 = np.zeros(len(arr), dtype=bool) mask1[np.random.randint(len(arr), size=100)] = True mask1_pa = pa.array(mask1) # mask with larger proportion of True values mask2 = np.zeros(len(arr), dtype=bool) mask2[np.random.randint(len(arr), size=10_000)] = True mask2_pa = pa.array(mask2) {code} Doing timings of doing a Arrow {{Filter}} kernel vs using numpy to convert the mask into indices and then using a {{Take}} kernel: {code} # mask 1 In [3]: %timeit arr.filter(mask1_pa) 132 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1 loops each) In [4]: %%timeit ...: indices = np.nonzero(mask1)[0] ...: arr.take(indices) 114 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1 loops each) # mask 2 In [8]: %timeit arr.filter(mask2_pa) 711 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [9]: %%timeit ...: indices = np.nonzero(mask2)[0] ...: arr.take(indices) 333 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) {code} So in the first case, both are quite similar in timing. But in the second case, the numpy+take version is faster. I know this might depend on a lot on the actual proportion of True values and how they are positioned in the array (random vs concentrated) etc, so there is probably not a general rule of what should be faster. But, it still seems a potential indication that things can be optimized in the Filter kernel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10425) [Python] Support reading (compressed) CSV file from remote file / binary blob
Joris Van den Bossche created ARROW-10425: - Summary: [Python] Support reading (compressed) CSV file from remote file / binary blob Key: ARROW-10425 URL: https://issues.apache.org/jira/browse/ARROW-10425 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/64588076/how-can-i-read-a-csv-gz-file-with-pyarrow-from-a-file-object Currently {{pyarrow.csv.rad_csv}} happily takes a path to a compressed file and automatically decompresses it, but AFAIK this only works for local paths. It would be nice to in general support reading CSV from remote files (with URI / specifying a filesystem), and in that case also support compression. In addition we could also read a compressed file from a BytesIO / file-like object, but not sure we want that (as it would required a keyword to indicate the used compression). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10432) [C++] CSV reader: support for multi-character / whitespace delimiter?
Joris Van den Bossche created ARROW-10432: - Summary: [C++] CSV reader: support for multi-character / whitespace delimiter? Key: ARROW-10432 URL: https://issues.apache.org/jira/browse/ARROW-10432 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche I don't know how useful general "multi-character" delimiter support is, but one specific type of it that seems useful is "whitespace delimited", meaning any whitespace (possibly multiple / different whitespace characters). In pandas you can achieve this either by passing {{delimiter="\s+"}} or specifying {{delim_whitespace=True}} (and both are equivalent, pandas special cases {{delimiter="\s+"}} as any other multi-character delimiter is interpreted as an actual regex and triggers the slower python engine intead of using the default c engine) cc [~apitrou] [~npr] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10462) [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows
Joris Van den Bossche created ARROW-10462: - Summary: [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows Key: ARROW-10462 URL: https://issues.apache.org/jira/browse/ARROW-10462 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 2.0.1 Dask reported some failures starting with the pyarrow 2.0 release, and specifically on Windows: https://github.com/dask/dask/issues/6754 After some investigation, it seems that this is due to the {{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a mixture of {{\\}} and {/}} in it. It specifically happens when dask is passing a posix-style base path pointing to the dataset base directory (so using all {{/}}), and passing an fsspec-based (local) filesystem. >From a debugging output during one of the dask tests: {code} (Pdb) dataset (Pdb) dataset.paths 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0' (Pdb) dataset.pieces[0].path 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet' {code} So you can see that the result here has a mix of {{\\}} and {{/}}. Using pyarrow 1.0, this was consistently using {{/}}. The reason for the change is that in pyarrow 2.0 we started to replace fsspec LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem that should be equivalent). But it seems that our own LocalFileSystem has a {{pathsep}}} property that equals to {{os.path.sep}}, which is {{\\}} on Windows (https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306. So note that while this started being broken in pyarrow 2.0 when using fsspec filesystem, this was already "broken" before when using our own local filesystem (or when not passing any filesystem). But, 1) dask always passes an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is thus especially sensitive to the change (using it as a file path to read something in, it will probably still work even with the mixture of path separators). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10469) [CI][Python] Run dask integration tests on Windows
Joris Van den Bossche created ARROW-10469: - Summary: [CI][Python] Run dask integration tests on Windows Key: ARROW-10469 URL: https://issues.apache.org/jira/browse/ARROW-10469 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche So we can catch bugs like ARROW-10462 in advance -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10471) [CI][Python] Ensure we have a test build with s3fs
Joris Van den Bossche created ARROW-10471: - Summary: [CI][Python] Ensure we have a test build with s3fs Key: ARROW-10471 URL: https://issues.apache.org/jira/browse/ARROW-10471 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10473) [Python] FSSpecHandler get_file_info with recursive selector not working with s3fs
Joris Van den Bossche created ARROW-10473: - Summary: [Python] FSSpecHandler get_file_info with recursive selector not working with s3fs Key: ARROW-10473 URL: https://issues.apache.org/jira/browse/ARROW-10473 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 2.0.1 The partitioned ParquetDataset tests are failing when using s3fs filesystem (I am adding tests in https://github.com/apache/arrow/pull/8573). I need to come up with a more minimal test isolating the FileSystem.get_file_info behaviour, but from debugging the parquet tests it seems that it is only listing the first level (and not further nested directories/files). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10482) [Python] Specifying compression type on a column basis when writing Parquet not working
Joris Van den Bossche created ARROW-10482: - Summary: [Python] Specifying compression type on a column basis when writing Parquet not working Key: ARROW-10482 URL: https://issues.apache.org/jira/browse/ARROW-10482 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/64666270/using-per-column-compression-codec-in-parquet-write-table According to the docs, you can specify the compression type on a column-by-column basis, but that doesn't seem to be working: {code} In [5]: table = pa.table([[1, 2], [3, 4], [5, 6]], names=["foo", "bar", "baz"]) In [6]: pq.write_table(table, 'test1.parquet', compression=dict(foo='zstd',bar='snappy',baz='brotli')) ... ~/scipy/repos/arrow/python/pyarrow/_parquet.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string() TypeError: expected bytes, str found {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10546) [Python] Deprecate the S3FSWrapper class
Joris Van den Bossche created ARROW-10546: - Summary: [Python] Deprecate the S3FSWrapper class Key: ARROW-10546 URL: https://issues.apache.org/jira/browse/ARROW-10546 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-10433 / discussion at https://github.com/apache/arrow/pull/8557#issuecomment-724225124 The {{S3FSWrapper}} class has been used in the past to wrap s3fs filesystems, before fsspec subclassed {{pyarrow.filesystem}} filesystems. This is however already more than 2 years ago, and AFAIK nobody should still be using {{S3FSWrapper}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10558) [Python] Filesystem S3 tests not independent (native s3 influences s3fs)
Joris Van den Bossche created ARROW-10558: - Summary: [Python] Filesystem S3 tests not independent (native s3 influences s3fs) Key: ARROW-10558 URL: https://issues.apache.org/jira/browse/ARROW-10558 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The filesystem tests in {{test_fs.py}} that are parametrized with all the tested filesystems have some "state" shared between them, at least in the case of S3. When first a test is run with our own S3FileSystem, which eg creates a directory, this directory is still present when we test the s3fs wrapped filesystem, which causes some tests to pass that would otherwise fail if run in isolation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10578) [C++] Comparison kernels crashing for string array with null string scalar
Joris Van den Bossche created ARROW-10578: - Summary: [C++] Comparison kernels crashing for string array with null string scalar Key: ARROW-10578 URL: https://issues.apache.org/jira/browse/ARROW-10578 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Comparing a string array with a string scalar works: {code} In [1]: import pyarrow.compute as pc In [2]: pc.equal(pa.array(["a", None, "b"]), pa.scalar("a", type="string") Out[2]: [ true, null, false ] {code} but if the scalar is a null (from the proper string type), it crashes: {code} In [4]: pc.equal(pa.array(["a", None, "b"]), pa.scalar(None, type="string")) Segmentation fault (core dumped) {code} (and not even debug messages ..) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10640) [C++] A "where" kernel to combine two arrays based on a mask
Joris Van den Bossche created ARROW-10640: - Summary: [C++] A "where" kernel to combine two arrays based on a mask Key: ARROW-10640 URL: https://issues.apache.org/jira/browse/ARROW-10640 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Joris Van den Bossche (from discussion in ARROW-9489 with [~maartenbreddels]) A general "where" kernel like {{np.where}} (https://numpy.org/doc/stable/reference/generated/numpy.where.html) seems a generally useful kernel to have, and could also help mimicking some other python (setitem-like) operations. The concrete use case in ARROW-9489 is to basically do a {{fill_null(array[string], array[string])}} which could be expressed as {{where(is_null(arr), arr2, arr)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10641) [C++] A "replace" or "map" kernel to replace values in array based on mapping
Joris Van den Bossche created ARROW-10641: - Summary: [C++] A "replace" or "map" kernel to replace values in array based on mapping Key: ARROW-10641 URL: https://issues.apache.org/jira/browse/ARROW-10641 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Joris Van den Bossche A "replace" or "map" kernel to replace values in array based on mapping. This would be similar as the pandas {{Series.replace}} (or {{Series.map}}) kernel, and as a small illustration of what is meant: {code} In [41]: s = pd.Series(["Yes", "Y", "No", "N"]) In [42]: s Out[42]: 0Yes 1 Y 2 No 3 N dtype: object In [43]: s.replace({"Y": "Yes", "N": "No"}) Out[43]: 0Yes 1Yes 2 No 3 No dtype: object {code} Note: in pandas the difference between "replace" and "map" is that replace will only replace a value if it is present in the mapping, while map will replace every value in the input array with the corresponding value in the mapping and return null if not present in the mapping. Note, this is different from ARROW-10306 which is about string replacement _within_ array elements (replacing a substring in each string element in the array), while here it is about replacing full elements of the array) cc [~maartenbreddels] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe
Joris Van den Bossche created ARROW-10643: - Summary: [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe Key: ARROW-10643 URL: https://issues.apache.org/jira/browse/ARROW-10643 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Joris Van den Bossche >From https://github.com/pandas-dev/pandas/issues/37897 The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, but a non-zero shape for the rows) isn't faithful: {code} In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1)) In [34]: df Out[34]: Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] In [35]: df.shape Out[35]: (10, 0) In [36]: table = pa.table(df) In [37]: table.to_pandas() Out[37]: Empty DataFrame Columns: [] Index: [] In [38]: table.to_pandas().shape Out[38]: (0, 0) {code} Since the pandas metadata in the Table actually have this RangeIndex information: {code} In [39]: table.schema.pandas_metadata Out[39]: {'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'stop': 10, 'step': 1}], 'column_indexes': [{'name': None, 'field_name': None, 'pandas_type': 'empty', 'numpy_type': 'object', 'metadata': None}], 'columns': [], 'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'}, 'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'} {code} we should in principle be able to correctly roundtrip this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10644) [Python] Consolidate path/filesystem handling in pyarrow.dataset and pyarrow.fs
Joris Van den Bossche created ARROW-10644: - Summary: [Python] Consolidate path/filesystem handling in pyarrow.dataset and pyarrow.fs Key: ARROW-10644 URL: https://issues.apache.org/jira/browse/ARROW-10644 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The {{pyarrow.dataset}} module grew some custom code to deal with paths and filesystems, but also the{{pyarrow.fs}} package has some general utilities for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10663) [C++/Doc] The IsIn kernel ignores the skip_nulls option of SetLookupOptions
Joris Van den Bossche created ARROW-10663: - Summary: [C++/Doc] The IsIn kernel ignores the skip_nulls option of SetLookupOptions Key: ARROW-10663 URL: https://issues.apache.org/jira/browse/ARROW-10663 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 3.0.0 The C++ docs of {{SetLookupOptions}} has this explanation of the {{skip_nulls}} option: {code} /// Whether nulls in `value_set` count for lookup. /// /// If true, any null in `value_set` is ignored and nulls in the input /// produce null (IndexIn) or false (IsIn) values in the output. /// If false, any null in `value_set` is successfully matched in /// the input. bool skip_nulls; {code} (from https://github.com/apache/arrow/blob/8b9f6b9d28b4524724e60fac589fb1a3552a32b4/cpp/src/arrow/compute/api_scalar.h#L78-L84) However, for {{IsIn}} this explanation doesn't seem to hold in practice: {code} In [16]: arr = pa.array([1, 2, None]) In [17]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=True) Out[17]: [ true, false, true ] In [18]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=False) Out[18]: [ true, false, true ] {code} This documentation was added in https://github.com/apache/arrow/pull/7695 (ARROW-8989)/ . BTW, for "index_in", it works as documented: {code} In [19]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=True) Out[19]: [ 0, null, null ] In [20]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=False) Out[20]: [ 0, null, 1 ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
Joris Van den Bossche created ARROW-10695: - Summary: [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset Key: ARROW-10695 URL: https://issues.apache.org/jira/browse/ARROW-10695 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently we allow the user to specify a {{basename_template}}, and this can include a {{"\{i\}"}} part to replace it with an automatically incremented integer (so each generated file written to a single partition is unique): https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 It _might_ be useful to also have the ability to use a UUID, to ensure the file is unique in general (not only for a single write) and to mimic the behaviour of the old {{write_to_dataset}} implementation. For example, we could look for a {{"\{uuid\}"}} in the template string, and if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10726) [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data
Joris Van den Bossche created ARROW-10726: - Summary: [Python] Reading multiple parquet files with different index column dtype (originating pandas) reads wrong data Key: ARROW-10726 URL: https://issues.apache.org/jira/browse/ARROW-10726 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 3.0.0 See https://github.com/pandas-dev/pandas/issues/38058 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10805) [C++] CSV reader: option to ignore trailing delimiters
Joris Van den Bossche created ARROW-10805: - Summary: [C++] CSV reader: option to ignore trailing delimiters Key: ARROW-10805 URL: https://issues.apache.org/jira/browse/ARROW-10805 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche It is not uncommon to have a CSV file that has "trailing" delimiters. For example, I ran into something like this: {code} 1|2|3| 4|5|6| {code} where we currently detect 4 columns. If you want to properly read this in while passing the column names, you need to add a "dummy" column name for the non-existing last column (and specify the actual column names to {{include_columns}} to drop it again): {code:python} column_names = [...] csv.read_csv( "path/to/dile.csv", read_options=csv.ReadOptions(column_names=column_names + ["dummy"]), parse_options=csv.ParseOptions(delimiter="|"), convert_options=csv.ConvertOptions(include_columns=column_names) ) {code} Pandas has indirect support for it through the {{index_col=False}} option (see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#index-columns-and-trailing-delimiters, i.e. when the length of the names is 1 shorter as the detected number of columns and this last column is all empty, it will drop this) Although the above provides a workaround, it might be nice to have out of the box support for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10845) [Python][CI] Add python CI build using numpy nightly
Joris Van den Bossche created ARROW-10845: - Summary: [Python][CI] Add python CI build using numpy nightly Key: ARROW-10845 URL: https://issues.apache.org/jira/browse/ARROW-10845 Project: Apache Arrow Issue Type: Improvement Components: CI, Python Reporter: Joris Van den Bossche Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10849) [Python] Handle numpy deprecation warnings for builtin type aliases
Joris Van den Bossche created ARROW-10849: - Summary: [Python] Handle numpy deprecation warnings for builtin type aliases Key: ARROW-10849 URL: https://issues.apache.org/jira/browse/ARROW-10849 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche See https://numpy.org/devdocs/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7063) [C++] Schema print method prints too much metadata
[ https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967438#comment-16967438 ] Joris Van den Bossche commented on ARROW-7063: -- I also ran into this recently when looking at the reports involving a huge number of columns (although that was in Python, and I see that we don't use the exact same code as the C++ pretty printer: https://github.com/apache/arrow/blob/e0cc9c43276840579a29332aca7348bbc415c051/python/pyarrow/types.pxi#L1245-L1264). We should probably at least truncate the metadata. Personally I would prefer truncating them (so they don't get annoying) instead of not showing them at all, as IMO it is useful to see that the table has metadata. We could for example truncate each entry to a max of 50 characters (adding {{...}}) while still showing all entries (all keys). {quote}And IDK what to do with this {{ARROW:schema: }} business but it's clearly not readable as is.{quote} It's a the original arrow schema in serialized format. Example with python how it is created when writing a parquet file, and retrieving it again: {code} In [33]: import pyarrow as pa In [34]: table = pa.table(pd.DataFrame({'a': [1, 2, 3]})) In [35]: table Out[35]: pyarrow.Table a: int64 metadata {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "' b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_' b'name": null, "pandas_type": "unicode", "numpy_type": "object", "' b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f' b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", ' b'"metadata": null}], "creator": {"library": "pyarrow", "version":' b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691' b'.g157495696.dirty"}'} In [36]: import pyarrow.parquet as pq In [37]: pq.write_table(table, 'test.parquet') In [39]: schema = pq.read_schema('test.parquet') In [40]: schema Out[40]: a: int64 metadata {b'ARROW:schema': b'/4ACAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwA' b'AAAEAAgACggCAAAEAQwIAAwABAAIAAgI' b'EAYAAABwYW5kYXMAANMBAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJr' b'aW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAi' b'c3RvcCI6IDMsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBb' b'eyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFz' b'X3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIs' b'ICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29s' b'dW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAi' b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2' b'NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJh' b'cnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjAuMTUuMS5kZXYyMTIr' b'ZzRhZmU5ZjBlYSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMC4yNi4wLmRl' b'djArNjkxLmcxNTc0OTU2OTYuZGlydHkifQABFBAAFAAIAAYA' b'BwAMEAAQAAABAiQUBAAIAAwACAAHAAgA' b'AAABQAEAAABh', b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "' b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_' b'name": null, "pandas_type": "unicode", "numpy_type": "object", "' b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f' b'ield_name": "a", "pandas_type": "int64", "numpy_type":
[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?
Joris Van den Bossche created ARROW-7066: Summary: [Python] support returning ChunkedArray from __arrow_array__ ? Key: ARROW-7066 URL: https://issues.apache.org/jira/browse/ARROW-7066 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can define how they should be converted to a pyarrow Array (similar to numpy's {{\_\_array\_\_}}). This is then also used to support converting pandas DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if the pandas ExtensionArray, such as nullable integer type, implements this {{\_\_arrow_array\_\_}} method). This last use case could also be useful for fletcher (https://github.com/xhochy/fletcher/, a package that implements pandas ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a pandas DataFrame). However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a pandas DataFrame (to have a better mapping with a Table, where the columns also consist of chunked arrays). While we currently require that the return value of {{\_\_arrow_array\_\_}} is a pyarrow.Array. So I was wondering: could we relax this constraint and also allow ChunkedArray as return value? However, this protocol is currently called in the {{pa.array(..)}} function, which probably should keep returning an Array (and not ChunkedArray in certain cases). cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index
[ https://issues.apache.org/jira/browse/ARROW-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-7023. -- Resolution: Fixed Issue resolved by pull request 5753 [https://github.com/apache/arrow/pull/5753] > [Python] pa.array does not use "from_pandas" semantics for pd.Index > --- > > Key: ARROW-7023 > URL: https://issues.apache.org/jira/browse/ARROW-7023 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > {code} > In [15]: idx = pd.Index([1, 2, np.nan], dtype=object) > > > In [16]: pa.array(idx) > > > Out[16]: > > [ > 1, > 2, > nan > ] > In [17]: pa.array(idx, from_pandas=True) > > > Out[17]: > > [ > 1, > 2, > null > ] > In [18]: pa.array(pd.Series(idx)) > > > Out[18]: > > [ > 1, > 2, > null > ] > {code} > We should probably handle Series and Index the same in this regard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array
Joris Van den Bossche created ARROW-7068: Summary: [C++] Expose the offsets of a ListArray as a Int32Array Key: ARROW-7068 URL: https://issues.apache.org/jira/browse/ARROW-7068 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can move this into C++ and use that implementation from Python. Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this could be a \{{ListArray::value_offsets_array}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap
[ https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968345#comment-16968345 ] Joris Van den Bossche commented on ARROW-7071: -- > NB: I'm not sure what kind of pitfalls there might be if replacing an > existing validity bitmap and exposing some previously-null values I would say this is the responsibility of the user then? What could happen? Are there potentially cases where interpreting the memory of a previously-null value as a value leads to segfaults? Like if you would do: {code} In [62]: a = pa.array([1, None, 3]) In [63]: np.frombuffer(a.buffers()[1], dtype="int64") Out[63]: array([1, 0, 3]) {code} > [Python] Add Array convenience method to create "masked" view with different > validity bitmap > > > Key: ARROW-7071 > URL: https://issues.apache.org/jira/browse/ARROW-7071 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > NB: I'm not sure what kind of pitfalls there might be if replacing an > existing validity bitmap and exposing some previously-null values -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap
[ https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968354#comment-16968354 ] Joris Van den Bossche commented on ARROW-7071: -- Now, I think the main question is: what API could we offer for this? * A method on Array? Something like {{array.set_validity_bitmap(..)}} or {{array.set_null_bitmap(..)}} (but not sure if it needs to be that clearly exposed) * A settable attribute like {{array.null_bitmap}} * A function to create a new array from a given array + bitmap? This could be similar to {{Array.from_buffers}}, but then a bit more convenient to use (as currently you can already use that to achieve this purpose) * Alternative could be to expand {{pa.array(values, mask=[..])}} to accept a pyarrow array as values, and then use the {{mask}} keyword to specify the nulls as a boolean mask (although the current behaviour here is to have the final bitmap be a combination of nulls in the values and the mask, so this is not a way to override the bitmap, but maybe that's actually good) A way to avoid the issue of "previously-null values" could also be to only allow setting the bitmap if there was not yet one before. That would be enough for my original use case for this, where I want to create a StructArray from two pyarrow arrays, but also give it a null bitmap: {code} pa.StructArray.from_arrays([pa.array([1, 2, 3]), pa.array([2, 3, 4])], names=['a', 'b']) {code} For this very specific case, an option could also be to be able to pass a bitmap or mask keyword to {{pa.StructArray.from_arrays}}, but that's of course not a general solution for other types. > [Python] Add Array convenience method to create "masked" view with different > validity bitmap > > > Key: ARROW-7071 > URL: https://issues.apache.org/jira/browse/ARROW-7071 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > NB: I'm not sure what kind of pitfalls there might be if replacing an > existing validity bitmap and exposing some previously-null values -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968370#comment-16968370 ] Joris Van den Bossche commented on ARROW-6820: -- To see the description in the (old) docs, this link can be used: https://github.com/apache/arrow/blob/apache-arrow-0.14.0/docs/source/format/Layout.rst#map-type The link above of https://arrow.apache.org/docs/format/Layout.html#map-type no longer works and similar section is not available in https://arrow.apache.org/docs/format/Columnar.html, I suppose it was removed in the format docs refactor (ARROW-6820) because it is considered a logical type and not a physical type? > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968373#comment-16968373 ] Joris Van den Bossche commented on ARROW-6820: -- Another inconsistency is that Schema.fbs speaks about "entry", not "entries" > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly
[ https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968398#comment-16968398 ] Joris Van den Bossche commented on ARROW-7076: -- There are not yet binary wheels available for Python 3.8, so therefore pip is trying to build from source. And then it appears something goes wrong with installing/finding numpy, which seems similar to the error reported in ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that we do not list numpy as a build dependency (pip will create a new environment with all build dependencies, therefore installing numpy before hand does not solve it). Now, even if the pyproject.toml would correctly list this, it is quite likely that installing from source with just {{pip install pyarrow}} is not going to work, as there are a lot of other (non-python) dependencies that you would need to ensure are available. If you do want to install from source, see https://arrow.apache.org/docs/developers/python.html#python-development for detailed instructions), otherwise you will need to wait until there are wheels available or use Python 3.7 instead of 3.8. > `pip install pyarrow` with python 3.8 fail with message : Could not build > wheels for pyarrow which use PEP 517 and cannot be installed directly > --- > > Key: ARROW-7076 > URL: https://issues.apache.org/jira/browse/ARROW-7076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Ubuntu 19.10 / Python 3.8.0 >Reporter: Fabien >Priority: Minor > > When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works. > However with python 3.8.0 it fails with the following error : > {noformat} > 14:06 $ pip install pyarrow > Collecting pyarrow > Using cached > https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz > Installing build dependencies ... done > Getting requirements to build wheel ... done > Preparing wheel metadata ... done > Collecting numpy>=1.14 > Using cached > https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl > Collecting six>=1.0.0 > Using cached > https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl > Building wheels for collected packages: pyarrow > Building wheel for pyarrow (PEP 517) ... error > ERROR: Command errored out with exit status 1: > command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 > /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py > build_wheel /tmp/tmp4gpyu82j > cwd: /tmp/pip-install-cj5ucedq/pyarrow > Complete output (490 lines): > running bdist_wheel > running build > running build_py > creating build > creating build/lib.linux-x86_64-3.8 > creating build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow > creating build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_strategies.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_array.py -> > build/lib.linux-x8
[jira] [Comment Edited] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly
[ https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968398#comment-16968398 ] Joris Van den Bossche edited comment on ARROW-7076 at 11/6/19 2:31 PM: --- There are not yet binary wheels available for Python 3.8, so therefore pip is trying to build from source. And then it appears something goes wrong with installing/finding numpy, which seems similar to the error reported in ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that we do not list numpy as a build dependency (pip will create a new environment with all build dependencies, therefore installing numpy before hand does not solve it). Now, even if the pyproject.toml would correctly list this, it is quite likely that installing from source with just {{pip install pyarrow}} is not going to work, as there are a lot of other (non-python) dependencies that you would need to ensure are available. If you do want to install from source, see https://arrow.apache.org/docs/developers/python.html#python-development for detailed instructions), otherwise you will need to wait until there are wheels available, or use Python 3.7 instead of 3.8, or use conda instead (conda-forge already has binary packages of pyarrow for Python 3.8). was (Author: jorisvandenbossche): There are not yet binary wheels available for Python 3.8, so therefore pip is trying to build from source. And then it appears something goes wrong with installing/finding numpy, which seems similar to the error reported in ARROW-5210. As I mentioned there, this is an error in the pyproject.toml that we do not list numpy as a build dependency (pip will create a new environment with all build dependencies, therefore installing numpy before hand does not solve it). Now, even if the pyproject.toml would correctly list this, it is quite likely that installing from source with just {{pip install pyarrow}} is not going to work, as there are a lot of other (non-python) dependencies that you would need to ensure are available. If you do want to install from source, see https://arrow.apache.org/docs/developers/python.html#python-development for detailed instructions), otherwise you will need to wait until there are wheels available or use Python 3.7 instead of 3.8. > `pip install pyarrow` with python 3.8 fail with message : Could not build > wheels for pyarrow which use PEP 517 and cannot be installed directly > --- > > Key: ARROW-7076 > URL: https://issues.apache.org/jira/browse/ARROW-7076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Ubuntu 19.10 / Python 3.8.0 >Reporter: Fabien >Priority: Minor > > When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works. > However with python 3.8.0 it fails with the following error : > {noformat} > 14:06 $ pip install pyarrow > Collecting pyarrow > Using cached > https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz > Installing build dependencies ... done > Getting requirements to build wheel ... done > Preparing wheel metadata ... done > Collecting numpy>=1.14 > Using cached > https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl > Collecting six>=1.0.0 > Using cached > https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl > Building wheels for collected packages: pyarrow > Building wheel for pyarrow (PEP 517) ... error > ERROR: Command errored out with exit status 1: > command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 > /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py > build_wheel /tmp/tmp4gpyu82j > cwd: /tmp/pip-install-cj5ucedq/pyarrow > Complete output (490 lines): > running bdist_wheel > running build > running build_py > creating build > creating build/lib.linux-x86_64-3.8 > creating build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/py
[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly
[ https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968417#comment-16968417 ] Joris Van den Bossche commented on ARROW-7076: -- See ARROW-6920 for wheels for Python 3.8 (I suppose they will only get added for the latest pyarrow release, 0.15.1) > `pip install pyarrow` with python 3.8 fail with message : Could not build > wheels for pyarrow which use PEP 517 and cannot be installed directly > --- > > Key: ARROW-7076 > URL: https://issues.apache.org/jira/browse/ARROW-7076 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Ubuntu 19.10 / Python 3.8.0 >Reporter: Fabien >Priority: Minor > > When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works. > However with python 3.8.0 it fails with the following error : > {noformat} > 14:06 $ pip install pyarrow > Collecting pyarrow > Using cached > https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz > Installing build dependencies ... done > Getting requirements to build wheel ... done > Preparing wheel metadata ... done > Collecting numpy>=1.14 > Using cached > https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl > Collecting six>=1.0.0 > Using cached > https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl > Building wheels for collected packages: pyarrow > Building wheel for pyarrow (PEP 517) ... error > ERROR: Command errored out with exit status 1: > command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 > /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py > build_wheel /tmp/tmp4gpyu82j > cwd: /tmp/pip-install-cj5ucedq/pyarrow > Complete output (490 lines): > running bdist_wheel > running build > running build_py > creating build > creating build/lib.linux-x86_64-3.8 > creating build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow > copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow > creating build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_strategies.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_array.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_tensor.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_json.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_cython.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_deprecations.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_memory.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/pandas_examples.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/test_compute.py -> > build/lib.linux-x86_64-3.8/pyarrow/tests > copying pyarrow/tests/util.py -> build/lib.linux-x86_64-3.8/py
[jira] [Assigned] (ARROW-3444) [Python] Table.nbytes attribute
[ https://issues.apache.org/jira/browse/ARROW-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-3444: Assignee: Joris Van den Bossche > [Python] Table.nbytes attribute > --- > > Key: ARROW-3444 > URL: https://issues.apache.org/jira/browse/ARROW-3444 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Dave Hirschfeld >Assignee: Joris Van den Bossche >Priority: Minor > Fix For: 1.0.0 > > > As it says in the title, I think this would be a very handy attribute to have > available in Python. You can get it by converting to pandas and using > `DataFrame.nbytes` but this is wasteful of both time and memory so it would > be good to have this information on the `pyarrow.Table` object itself. > This could be implemented using the > [__sizeof__|https://docs.python.org/3/library/sys.html#sys.getsizeof] protocol -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7071) [Python] Add Array convenience method to create "masked" view with different validity bitmap
[ https://issues.apache.org/jira/browse/ARROW-7071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972151#comment-16972151 ] Joris Van den Bossche commented on ARROW-7071: -- Would it then be OK to say that "it is the responsibility of the user to not expose undefined values" ? (so that you are only adding nulls) Or do we need to guard for this? > [Python] Add Array convenience method to create "masked" view with different > validity bitmap > > > Key: ARROW-7071 > URL: https://issues.apache.org/jira/browse/ARROW-7071 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > NB: I'm not sure what kind of pitfalls there might be if replacing an > existing validity bitmap and exposing some previously-null values -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?
[ https://issues.apache.org/jira/browse/ARROW-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972155#comment-16972155 ] Joris Van den Bossche commented on ARROW-7066: -- I still don't fully like returning a chunked array from {{pa.array}}, but also don't see an easy other solution to otherwise get the roundtrip working for eg fletcher that uses chunked arrays (alternative would be to have an "internal" version of {{pa.array(..)}} that allows this, and keep the public one strict, but that is also rather ugly). I will add some documentation update to the current open PR. > [Python] support returning ChunkedArray from __arrow_array__ ? > -- > > Key: ARROW-7066 > URL: https://issues.apache.org/jira/browse/ARROW-7066 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can > define how they should be converted to a pyarrow Array (similar to numpy's > {{\_\_array\_\_}}). This is then also used to support converting pandas > DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if > the pandas ExtensionArray, such as nullable integer type, implements this > {{\_\_arrow_array\_\_}} method). > This last use case could also be useful for fletcher > (https://github.com/xhochy/fletcher/, a package that implements pandas > ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a > pandas DataFrame). > However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a > pandas DataFrame (to have a better mapping with a Table, where the columns > also consist of chunked arrays). While we currently require that the return > value of {{\_\_arrow_array\_\_}} is a pyarrow.Array. > So I was wondering: could we relax this constraint and also allow > ChunkedArray as return value? > However, this protocol is currently called in the {{pa.array(..)}} function, > which probably should keep returning an Array (and not ChunkedArray in > certain cases). > cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973192#comment-16973192 ] Joris Van den Bossche commented on ARROW-6820: -- If both C++ and Java use "entries", we can also update the format spec? (since it is not a required name and only a recommendation, I would think it is not really a "format change" to update that description?) > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy
Joris Van den Bossche created ARROW-7154: Summary: [C++] Build error when building tests but not with snappy Key: ARROW-7154 URL: https://issues.apache.org/jira/browse/ARROW-7154 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Since the docker-compose PR landed, I am having build errors like: {code:java} [361/376] Linking CXX executable debug/arrow-python-test FAILED: debug/arrow-python-test : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o debug/arrow-python-test -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread -ldl -lutil -lrt -ldl /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::system::detail::generic_category_ncx()' /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::filesystem::path::operator/=(boost::filesystem::path const&)' collect2: error: ld returned 1 exit status {code} which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found" (although this is certainly present). The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to OFF, it works fine. It also seems to be related to this specific change in the docker compose PR: {code:java} diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c80ac3310..3b3c9eb8f 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -266,6 +266,15 @@ endif(UNIX) # Set up various options # -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) - # Currently the compression tests require at least these libraries; bz2 and - # zstd are optional. See ARROW-3984 - set(ARROW_WITH_BROTLI ON) - set(ARROW_WITH_LZ4 ON) - set(ARROW_WITH_SNAPPY ON) - set(ARROW_WITH_ZLIB ON) -endif() - if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) set(ARROW_JSON ON) endif() {code} If I add that back, the build works. With only `set(ARROW_WITH_BROTLI ON)`, it still fails With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about liblz4 instead of libboost (but also liblz4 is actually present) With only `set(ARROW_WITH_SNAPPY ON)`, it works With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about libz.so.1 not found So it seems that the absence of snappy causes others to fail. In the recommended build settings in the development docs ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] the compression libraries are enabled. But I was still building without them (stemming from the time they were enabled by default). So I was using: {code} cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_BUILD_TESTS=ON \ .. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7154) [C++] Build error when building tests but not with snappy
[ https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7154: - Description: Since the docker-compose PR landed, I am having build errors like: {code:java} [361/376] Linking CXX executable debug/arrow-python-test FAILED: debug/arrow-python-test : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o debug/arrow-python-test -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread -ldl -lutil -lrt -ldl /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::system::detail::generic_category_ncx()' /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::filesystem::path::operator/=(boost::filesystem::path const&)' collect2: error: ld returned 1 exit status {code} which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found" (although this is certainly present). The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to OFF, it works fine. It also seems to be related to this specific change in the docker compose PR: {code:java} diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c80ac3310..3b3c9eb8f 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -266,6 +266,15 @@ endif(UNIX) # Set up various options # -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) - # Currently the compression tests require at least these libraries; bz2 and - # zstd are optional. See ARROW-3984 - set(ARROW_WITH_BROTLI ON) - set(ARROW_WITH_LZ4 ON) - set(ARROW_WITH_SNAPPY ON) - set(ARROW_WITH_ZLIB ON) -endif() - if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) set(ARROW_JSON ON) endif() {code} If I add that back, the build works. With only `set(ARROW_WITH_BROTLI ON)`, it still fails With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about liblz4 instead of libboost (but also liblz4 is actually present) With only `set(ARROW_WITH_SNAPPY ON)`, it works With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about libz.so.1 not found With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also works. So it seems that the absence of snappy causes others to fail. In the recommended build settings in the development docs ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] the compression libraries are enabled. But I was still building without them (stemming from the time they were enabled by default). So I was using: {code} cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_BUILD_TESTS=ON \ .. {code} was: Since the docker-compose PR landed, I am having build errors like: {code:java} [361/376] Linking CXX executable debug/arrow-python-test FAILED: debug/arrow-python-test : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ -Wno-noexcept-type -fvisibility-inlines-hidde
[jira] [Commented] (ARROW-7154) [C++] Build error when building tests but not with snappy
[ https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973461#comment-16973461 ] Joris Van den Bossche commented on ARROW-7154: -- Creating a new conda env from scratch (which now has boost 1.70 instead of 1.68 in my old env, not sure if that is relevant), and then the problem also went away. So it might be OK to close this issue. > [C++] Build error when building tests but not with snappy > - > > Key: ARROW-7154 > URL: https://issues.apache.org/jira/browse/ARROW-7154 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > > Since the docker-compose PR landed, I am having build errors like: > {code:java} > [361/376] Linking CXX executable debug/arrow-python-test > FAILED: debug/arrow-python-test > : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache > /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ > -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 > -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong > -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 > -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror > -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro > -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic > src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o > debug/arrow-python-test > -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib > debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 > debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 > /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread > -lpthread -ldl -lutil -lrt -ldl > /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a > /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so > jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt > /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, > not found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not > found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::system::detail::generic_category_ncx()' > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::filesystem::path::operator/=(boost::filesystem::path const&)' > collect2: error: ld returned 1 exit status > {code} > which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed > by debug/libarrow.so.100.0.0, not found" (although this is certainly present). > The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set > to OFF, it works fine. > It also seems to be related to this specific change in the docker compose PR: > {code:java} > diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > index c80ac3310..3b3c9eb8f 100644 > --- a/cpp/CMakeLists.txt > +++ b/cpp/CMakeLists.txt > @@ -266,6 +266,15 @@ endif(UNIX) > # Set up various options > # > -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) > - # Currently the compression tests require at least these libraries; bz2 and > - # zstd are optional. See ARROW-3984 > - set(ARROW_WITH_BROTLI ON) > - set(ARROW_WITH_LZ4 ON) > - set(ARROW_WITH_SNAPPY ON) > - set(ARROW_WITH_ZLIB ON) > -endif() > - > if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) >set(ARROW_JSON ON) > endif() > {code} > If I add that back, the build works. > With only `set(ARROW_WITH_BROTLI ON)`, it still fails > With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about > liblz4 instead of libboost (but also liblz4 is actually present) > With only `set(ARROW_WITH_SNAPPY ON)`, it works > With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about > libz.so.1 not found > With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also > works. So it seems that the absence of snappy causes others to fail. > In the recommended build settings in the developme
[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions
Joris Van den Bossche created ARROW-7167: Summary: [CI][Python] Add nightly tests for older pandas versions to Github Actions Key: ARROW-7167 URL: https://issues.apache.org/jira/browse/ARROW-7167 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions
[ https://issues.apache.org/jira/browse/ARROW-7167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-7167: Assignee: Joris Van den Bossche > [CI][Python] Add nightly tests for older pandas versions to Github Actions > -- > > Key: ARROW-7167 > URL: https://issues.apache.org/jira/browse/ARROW-7167 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7168: - Summary: [Python] pa.array() doesn't respect provided dictionary type with all NaNs (was: pa.array() doesn't respect provided dictionary type with all NaNs) > [Python] pa.array() doesn't respect provided dictionary type with all NaNs > -- > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7168: - Summary: [Python] pa.array() doesn't respect specified dictionary type (was: [Python] pa.array() doesn't respect provided dictionary type with all NaNs) > [Python] pa.array() doesn't respect specified dictionary type > - > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974511#comment-16974511 ] Joris Van den Bossche commented on ARROW-7168: -- [~buhrmann] thanks for the report. When passing a type like that, I agree it should be honoured. Some other observations: Also when it's not all-NaN, the specified type gets ignored: {code} In [19]: cat = pd.Categorical(['a', 'b']) In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False) In [21]: pa.array(cat, type=typ) Out[21]: -- dictionary: [ "a", "b" ] -- indices: [ 0, 1 ] In [22]: pa.array(cat, type=typ).type Out[22]: DictionaryType(dictionary) {code} So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match). In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories). > [Python] pa.array() doesn't respect provided dictionary type with all NaNs > -- > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects
[ https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977581#comment-16977581 ] Joris Van den Bossche commented on ARROW-6926: -- I started with implementing the {{nbytes}} attribute last week (ARROW-3444, which is merged now), with the idea of afterwards looking at {{sizeof}}. Main question is if we just want to return what {{nbytes}} does (the number of bytes in the buffers), which is what the dask approximation does, or if we also want to include the size of the cython + C++ object. {{sys.getsizeof}} works out of the box for the cython object (but it ignores the relevant buffers): {code} In [38]: a = pa.array([1, 2]) In [39]: import sys In [40]: sys.getsizeof(a) Out[40]: 96 {code} but when overriding {{\_\_sizeof\_\_}} in Array, I am not sure how to get to this number so I can add the nbytes of the buffers to it. > [Python] Support __sizeof__ protocol for Python objects > --- > > Key: ARROW-6926 > URL: https://issues.apache.org/jira/browse/ARROW-6926 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Fix For: 1.0.0 > > > It would be helpful if PyArrow objects implemented the `__sizeof__` protocol > to give other libraries hints about how much data they have allocated. This > helps systems like Dask, which have to make judgements about whether or not > something is cheap to move or taking up a large amount of space. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas
[ https://issues.apache.org/jira/browse/ARROW-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-7209: Assignee: Joris Van den Bossche > [Python] tests with pandas master are failing now __from_arrow__ support > landed in pandas > - > > Key: ARROW-7209 > URL: https://issues.apache.org/jira/browse/ARROW-7209 > Project: Apache Arrow > Issue Type: Test > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > > I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in > https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our > tests where assuming this did not yet work in pandas, and thus need to be > updated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas
Joris Van den Bossche created ARROW-7209: Summary: [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas Key: ARROW-7209 URL: https://issues.apache.org/jira/browse/ARROW-7209 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our tests where assuming this did not yet work in pandas, and thus need to be updated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7208) [Python] Passing directory to ParquetFile class gives confusing error message
[ https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7208: - Summary: [Python] Passing directory to ParquetFile class gives confusing error message (was: Arrow using ParquetFile class) > [Python] Passing directory to ParquetFile class gives confusing error message > - > > Key: ARROW-7208 > URL: https://issues.apache.org/jira/browse/ARROW-7208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Roelant Stegmann >Priority: Major > > Somehow have the same errors. We are working with pyarrow 0.15.1, trying to > access a folder of `parquet` files generated with Amazon Athena. > ```python > table2 = pq.read_table('C:/Data/test-parquet') > ``` > works fine in contrast to > ```python > parquet_file = pq.ParquetFile('C:/Data/test-parquet') > # parquet_file.read_row_group(0) > ``` > which raises > `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: > Access is denied.` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7208) Arrow using ParquetFile class
[ https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978177#comment-16978177 ] Joris Van den Bossche commented on ARROW-7208: -- The {{ParquetFile}} object expects a single file, not a directory of files (the {{read_table}} can handle both). If you want to use the object interface for a directory of files, you need to use {{pq.ParquetDataset}}. A better error message would be useful though. > Arrow using ParquetFile class > - > > Key: ARROW-7208 > URL: https://issues.apache.org/jira/browse/ARROW-7208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Roelant Stegmann >Priority: Major > > Somehow have the same errors. We are working with pyarrow 0.15.1, trying to > access a folder of `parquet` files generated with Amazon Athena. > ```python > table2 = pq.read_table('C:/Data/test-parquet') > ``` > works fine in contrast to > ```python > parquet_file = pq.ParquetFile('C:/Data/test-parquet') > # parquet_file.read_row_group(0) > ``` > which raises > `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: > Access is denied.` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes
[ https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7214: - Fix Version/s: 1.0.0 > [Python] unpickling a pyarrow table with dictionary fields crashes > -- > > Key: ARROW-7214 > URL: https://issues.apache.org/jira/browse/ARROW-7214 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1 >Reporter: Yevgeni Litvin >Priority: Major > Fix For: 1.0.0 > > > The following code crashes on this check: > {code:java} > F1120 07:51:37.523720 12432 array.cc:773] Check failed: (data->dictionary) > != (nullptr) > {code} > > {code:java} > import cPickle as pickle > import pandas as pd > import pyarrow as pa > df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ]) > df["cat"] = df["cat"].astype('category')index_table = > pa.Table.from_pandas(df, preserve_index=False) > with open('/tmp/zz.pickle', 'wb') as f: > pickle.dump(index_table, f, protocol=2) > with open('/tmp/zz.pickle', 'rb') as f: >index_table = pickle.load(f) > {code} > > Used Python2 with the following environment: > {code:java} > Package Version > --- --- > enum34 1.1.6 > futures 3.3.0 > numpy 1.16.5 > pandas 0.24.2 > pip 19.3.1 > pyarrow 0.14.1 (0.14.0 and up suffer from this issue) > python-dateutil 2.8.1 > pytz2019.3 > setuptools 41.6.0 > six 1.13.0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes
[ https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978178#comment-16978178 ] Joris Van den Bossche commented on ARROW-7214: -- [~selitvin] Thanks for the report! I can confirm this crash with latest arrow. > [Python] unpickling a pyarrow table with dictionary fields crashes > -- > > Key: ARROW-7214 > URL: https://issues.apache.org/jira/browse/ARROW-7214 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1 >Reporter: Yevgeni Litvin >Priority: Major > Fix For: 1.0.0 > > > The following code crashes on this check: > {code:java} > F1120 07:51:37.523720 12432 array.cc:773] Check failed: (data->dictionary) > != (nullptr) > {code} > > {code:java} > import cPickle as pickle > import pandas as pd > import pyarrow as pa > df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ]) > df["cat"] = df["cat"].astype('category')index_table = > pa.Table.from_pandas(df, preserve_index=False) > with open('/tmp/zz.pickle', 'wb') as f: > pickle.dump(index_table, f, protocol=2) > with open('/tmp/zz.pickle', 'rb') as f: >index_table = pickle.load(f) > {code} > > Used Python2 with the following environment: > {code:java} > Package Version > --- --- > enum34 1.1.6 > futures 3.3.0 > numpy 1.16.5 > pandas 0.24.2 > pip 19.3.1 > pyarrow 0.14.1 (0.14.0 and up suffer from this issue) > python-dateutil 2.8.1 > pytz2019.3 > setuptools 41.6.0 > six 1.13.0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7214) [Python] unpickling a pyarrow table with dictionary fields crashes
[ https://issues.apache.org/jira/browse/ARROW-7214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-7214: Assignee: Joris Van den Bossche > [Python] unpickling a pyarrow table with dictionary fields crashes > -- > > Key: ARROW-7214 > URL: https://issues.apache.org/jira/browse/ARROW-7214 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.14.1, 0.15.0, 0.15.1 >Reporter: Yevgeni Litvin >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > The following code crashes on this check: > {code:java} > F1120 07:51:37.523720 12432 array.cc:773] Check failed: (data->dictionary) > != (nullptr) > {code} > > {code:java} > import cPickle as pickle > import pandas as pd > import pyarrow as pa > df = pd.DataFrame([{"cat": "a", "val":1},{"cat": "b", "val":2} ]) > df["cat"] = df["cat"].astype('category')index_table = > pa.Table.from_pandas(df, preserve_index=False) > with open('/tmp/zz.pickle', 'wb') as f: > pickle.dump(index_table, f, protocol=2) > with open('/tmp/zz.pickle', 'rb') as f: >index_table = pickle.load(f) > {code} > > Used Python2 with the following environment: > {code:java} > Package Version > --- --- > enum34 1.1.6 > futures 3.3.0 > numpy 1.16.5 > pandas 0.24.2 > pip 19.3.1 > pyarrow 0.14.1 (0.14.0 and up suffer from this issue) > python-dateutil 2.8.1 > pytz2019.3 > setuptools 41.6.0 > six 1.13.0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7208) [Python] Passing directory to ParquetFile class gives confusing error message
[ https://issues.apache.org/jira/browse/ARROW-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978256#comment-16978256 ] Joris Van den Bossche commented on ARROW-7208: -- Looking at the ParquetDataset docs (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html), it's indeed not clear how to read a part of it. A ParquetDataset contains several "ParquetDatasetPiece"s, accessible as the {{pieces}} attribute, and then you can read a single piece. But this part of the API is not really documented. If you only want to read a single file of the full directory, you can also create a {{ParquetFile}} but specify the full file path instead of only the directory. > [Python] Passing directory to ParquetFile class gives confusing error message > - > > Key: ARROW-7208 > URL: https://issues.apache.org/jira/browse/ARROW-7208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 >Reporter: Roelant Stegmann >Priority: Major > > Somehow have the same errors. We are working with pyarrow 0.15.1, trying to > access a folder of `parquet` files generated with Amazon Athena. > ```python > table2 = pq.read_table('C:/Data/test-parquet') > ``` > works fine in contrast to > ```python > parquet_file = pq.ParquetFile('C:/Data/test-parquet') > # parquet_file.read_row_group(0) > ``` > which raises > `ArrowIOError: Failed to open local file 'C:/Data/test-parquet', error: > Access is denied.` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7217) [CI] Docker compose / github actions ignores PYTHON env
[ https://issues.apache.org/jira/browse/ARROW-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7217: - Summary: [CI] Docker compose / github actions ignores PYTHON env (was: Docker compose / github actions ignores PYTHON env) > [CI] Docker compose / github actions ignores PYTHON env > --- > > Key: ARROW-7217 > URL: https://issues.apache.org/jira/browse/ARROW-7217 > Project: Apache Arrow > Issue Type: Test > Components: CI >Reporter: Joris Van den Bossche >Priority: Major > > The "AMD64 Conda Python 2.7" build is actually using Python 3.6. > This python 3.6 version is written in the conda-python.dockerfile: > https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24 > > and I am not fully sure how the ENV variable overrides that or not > cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env
Joris Van den Bossche created ARROW-7217: Summary: Docker compose / github actions ignores PYTHON env Key: ARROW-7217 URL: https://issues.apache.org/jira/browse/ARROW-7217 Project: Apache Arrow Issue Type: Test Components: CI Reporter: Joris Van den Bossche The "AMD64 Conda Python 2.7" build is actually using Python 3.6. This python 3.6 version is written in the conda-python.dockerfile: https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24 and I am not fully sure how the ENV variable overrides that or not cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7217) [CI] Docker compose / github actions ignores PYTHON env
[ https://issues.apache.org/jira/browse/ARROW-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978302#comment-16978302 ] Joris Van den Bossche commented on ARROW-7217: -- Ah, I see that there is a PYTHON_VERSION in the dockerfile, but the github action workflow uses PYTHON > [CI] Docker compose / github actions ignores PYTHON env > --- > > Key: ARROW-7217 > URL: https://issues.apache.org/jira/browse/ARROW-7217 > Project: Apache Arrow > Issue Type: Test > Components: CI >Reporter: Joris Van den Bossche >Priority: Major > > The "AMD64 Conda Python 2.7" build is actually using Python 3.6. > This python 3.6 version is written in the conda-python.dockerfile: > https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24 > > and I am not fully sure how the ENV variable overrides that or not > cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working
Joris Van den Bossche created ARROW-7218: Summary: [Python] Conversion from boolean numpy scalars not working Key: ARROW-7218 URL: https://issues.apache.org/jira/browse/ARROW-7218 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche In general, we are fine to accept a list of numpy scalars: {code} In [12]: type(list(np.array([1, 2]))[0]) Out[12]: numpy.int64 In [13]: pa.array(list(np.array([1, 2]))) Out[13]: [ 1, 2 ] {code} But for booleans, this doesn't work: {code} In [14]: pa.array(list(np.array([True, False]))) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array(list(np.array([True, False]))) ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array() ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to boolean {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7219) [CI][Python] Install pickle5 in the conda-python docker image for python version 3.6
[ https://issues.apache.org/jira/browse/ARROW-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978489#comment-16978489 ] Joris Van den Bossche commented on ARROW-7219: -- There are other optional dependencies for python that would be nice to include somewhere as well (s3fs, fastparquet): https://github.com/apache/arrow/pull/5562#issuecomment-553782658 > [CI][Python] Install pickle5 in the conda-python docker image for python > version 3.6 > > > Key: ARROW-7219 > URL: https://issues.apache.org/jira/browse/ARROW-7219 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Fix For: 1.0.0 > > > See conversation > https://github.com/apache/arrow/pull/5873#discussion_r348510729 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2
Joris Van den Bossche created ARROW-7220: Summary: [CI] Docker compose (github actions) Mac Python 3 build is using Python 2 Key: ARROW-7220 URL: https://issues.apache.org/jira/browse/ARROW-7220 Project: Apache Arrow Issue Type: Test Reporter: Joris Van den Bossche The "AMD64 MacOS 10.15 Python 3" build is also running in python 2. Possibly related to how brew is installing python 2 / 3, or because it is using the system python, ... (not familiar with mac) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2
[ https://issues.apache.org/jira/browse/ARROW-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7220: - Component/s: CI > [CI] Docker compose (github actions) Mac Python 3 build is using Python 2 > - > > Key: ARROW-7220 > URL: https://issues.apache.org/jira/browse/ARROW-7220 > Project: Apache Arrow > Issue Type: Test > Components: CI >Reporter: Joris Van den Bossche >Priority: Major > > The "AMD64 MacOS 10.15 Python 3" build is also running in python 2. > Possibly related to how brew is installing python 2 / 3, or because it is > using the system python, ... (not familiar with mac) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects
[ https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978546#comment-16978546 ] Joris Van den Bossche commented on ARROW-6926: -- Ah, thanks. But it seems cython is adding a bit more still: {code} In [21]: a = pa.array([1]*10) In [22]: sys.getsizeof(a) Out[22]: 96 In [23]: object.__sizeof__(a) Out[23]: 72 {code} (not sure how much we care about those small numbers, in reality users will mainly care for big arrays where the nbytes dominates the result) > [Python] Support __sizeof__ protocol for Python objects > --- > > Key: ARROW-6926 > URL: https://issues.apache.org/jira/browse/ARROW-6926 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Fix For: 1.0.0 > > > It would be helpful if PyArrow objects implemented the `__sizeof__` protocol > to give other libraries hints about how much data they have allocated. This > helps systems like Dask, which have to make judgements about whether or not > something is cheap to move or taking up a large amount of space. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6926) [Python] Support __sizeof__ protocol for Python objects
[ https://issues.apache.org/jira/browse/ARROW-6926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978558#comment-16978558 ] Joris Van den Bossche commented on ARROW-6926: -- OK, thanks! > [Python] Support __sizeof__ protocol for Python objects > --- > > Key: ARROW-6926 > URL: https://issues.apache.org/jira/browse/ARROW-6926 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Fix For: 1.0.0 > > > It would be helpful if PyArrow objects implemented the `__sizeof__` protocol > to give other libraries hints about how much data they have allocated. This > helps systems like Dask, which have to make judgements about whether or not > something is cheap to move or taking up a large amount of space. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7222) [Python] Wipe any existing generated Python API documentation when updating website
[ https://issues.apache.org/jira/browse/ARROW-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979100#comment-16979100 ] Joris Van den Bossche commented on ARROW-7222: -- It could also be an option to keep older versions in a /docs/version/xx/ ? (although that's maybe a bit unnecessary overhead for now) > [Python] Wipe any existing generated Python API documentation when updating > website > --- > > Key: ARROW-7222 > URL: https://issues.apache.org/jira/browse/ARROW-7222 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Removed APIs are persisting in Google searches, e.g. > https://arrow.apache.org/docs/python/generated/pyarrow.Column.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7222) [Python] Wipe any existing generated Python API documentation when updating website
[ https://issues.apache.org/jira/browse/ARROW-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979135#comment-16979135 ] Joris Van den Bossche commented on ARROW-7222: -- It's indeed a different problem (and solving it now will require explicit action), but the solution to prevent it happening again in the future might be related. Eg in pandas, we put the docs for each version in a /version/xx/ directory, and then /stable/ is a symlink to the latest version (which needs to be updated when releasing then). That way, you never overwrite the existing docs with a new set of files, potentially leaving older ones (now, ensuring the old ones are deleted when overwriting the docs should also not be hard, of course) > [Python] Wipe any existing generated Python API documentation when updating > website > --- > > Key: ARROW-7222 > URL: https://issues.apache.org/jira/browse/ARROW-7222 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Removed APIs are persisting in Google searches, e.g. > https://arrow.apache.org/docs/python/generated/pyarrow.Column.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979326#comment-16979326 ] Joris Van den Bossche commented on ARROW-1644: -- [~RinkeHoekstra] that looks unrelated (the json reader is mostly independent from the parquet IO). Can you open a separate JIRA ticket? > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7226) [JSON][Python] Json loader fails on example in documentation.
[ https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979375#comment-16979375 ] Joris Van den Bossche commented on ARROW-7226: -- So this may not be adequately documented, but currently the json reader _only_ supports line-delimited json. So that is the reason the documentation shows the example using that format. > [JSON][Python] Json loader fails on example in documentation. > - > > Key: ARROW-7226 > URL: https://issues.apache.org/jira/browse/ARROW-7226 > Project: Apache Arrow > Issue Type: Bug >Reporter: Rinke Hoekstra >Priority: Major > > I was just trying this with the example found in the pyarrow docs at > [http://arrow.apache.org/docs/python/json.html] > The documented example does not work. Is this related to this issue, or is it > another matter? > It says to load the following JSON file: > {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03" > {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01" > I fixed this to make it valid JSON (It is valid [JSON > Lines|[http://jsonlines.org/]], but that's another issue): > {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}} > {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}} > Then reading the JSON from a file called `my_data.json`: > {{from pyarrow import json}} > {{table = json.read_json("my_data.json")}} > Gives the following error: > {code:java} > ---}} > ArrowInvalid Traceback (most recent call last) > in () > 1 from pyarrow import json > > 2 table = json.read_json('test.json') > ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx > in pyarrow._json.read_json() > ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() > ArrowInvalid: JSON parse error: A column changed from object to array > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type
Joris Van den Bossche created ARROW-7261: Summary: [Python] Python support for fixed size list type Key: ARROW-7261 URL: https://issues.apache.org/jira/browse/ARROW-7261 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7268) [Rust] Propagate `custom_metadata` field from IPC message
[ https://issues.apache.org/jira/browse/ARROW-7268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7268: - Summary: [Rust] Propagate `custom_metadata` field from IPC message (was: Propagate `custom_metadata` field from IPC message) > [Rust] Propagate `custom_metadata` field from IPC message > - > > Key: ARROW-7268 > URL: https://issues.apache.org/jira/browse/ARROW-7268 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Martin Grund >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Right now, the custom metadata field in the Schema IPC message is not > propagated from the IPC message to the internal data type. To be closer to > parity compared to the other implementations it would be good to add the > necessary logic to serialize and deserialize. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result
[ https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7266: - Summary: [Python] dictionary_encode() of a slice gives wrong result (was: dictionary_encode() of a slice gives wrong result) > [Python] dictionary_encode() of a slice gives wrong result > -- > > Key: ARROW-7266 > URL: https://issues.apache.org/jira/browse/ARROW-7266 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 > Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4 >Reporter: Adam Hooper >Priority: Major > > Steps to reproduce: > {code:python} > import pyarrow as pa > arr = pa.array(["a", "b", "b", "b"])[1:] > arr.dictionary_encode() > {code} > Expected results: > {code} > -- dictionary: > [ > "b" > ] > -- indices: > [ > 0, > 0, > 0 > ] > {code} > Actual results: > {code} > -- dictionary: > [ > "b", > "" > ] > -- indices: > [ > 0, > 0, > 1 > ] > {code} > I don't know a workaround. Converting to pylist and back is too slow. Is > there a way to copy the slice to a new offset-0 StringArray that I could then > dictionary-encode? Otherwise, I'm considering building buffers by hand -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7266) [Python] dictionary_encode() of a slice gives wrong result
[ https://issues.apache.org/jira/browse/ARROW-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983548#comment-16983548 ] Joris Van den Bossche commented on ARROW-7266: -- [~adamhooper] Thanks of the report! This seems to be specific to the string type, as I don't see a similar bug for integer type: {code} In [7]: a = pa.array(['a', 'b', 'c', 'b']) In [9]: a[1:].dictionary_encode() Out[9]: -- dictionary: [ "c", "b", "" ] -- indices: [ 0, 1, 2 ] In [10]: a = pa.array([1, 2, 3, 2]) In [12]: a[1:].dictionary_encode() Out[12]: -- dictionary: [ 2, 3 ] -- indices: [ 0, 1, 0 ] {code} > Is there a way to copy the slice to a new offset-0 StringArray that I could > then dictionary-encode? At least in the current pyarrow API, I don't think such a functionality is exposed (apart from getting buffers, slicing/copying, and recreating an array) > [Python] dictionary_encode() of a slice gives wrong result > -- > > Key: ARROW-7266 > URL: https://issues.apache.org/jira/browse/ARROW-7266 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 > Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4 >Reporter: Adam Hooper >Priority: Major > > Steps to reproduce: > {code:python} > import pyarrow as pa > arr = pa.array(["a", "b", "b", "b"])[1:] > arr.dictionary_encode() > {code} > Expected results: > {code} > -- dictionary: > [ > "b" > ] > -- indices: > [ > 0, > 0, > 0 > ] > {code} > Actual results: > {code} > -- dictionary: > [ > "b", > "" > ] > -- indices: > [ > 0, > 0, > 1 > ] > {code} > I don't know a workaround. Converting to pylist and back is too slow. Is > there a way to copy the slice to a new offset-0 StringArray that I could then > dictionary-encode? Otherwise, I'm considering building buffers by hand -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983565#comment-16983565 ] Joris Van den Bossche commented on ARROW-6876: -- [~axelg] would you be able to share a reproducible example ? (eg the data, or code that creates a dummy dataset with the same characteristics that shows the problem) > [Python] Reading parquet file with many columns becomes slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983593#comment-16983593 ] Joris Van den Bossche commented on ARROW-6876: -- Ah, sorry, missed the "With the reproducer above:" in your message. I see a similar difference locally, it's indeed not the speed-up that [~wesm] reported on the PR: https://github.com/apache/arrow/pull/5653#issuecomment-541901845 (this might depend on the machine / number of cores ?) > [Python] Reading parquet file with many columns becomes slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)