[jira] [Created] (ARROW-15318) [C++][Python] Regression reading partition keys of large batches.
A. Coady created ARROW-15318: Summary: [C++][Python] Regression reading partition keys of large batches. Key: ARROW-15318 URL: https://issues.apache.org/jira/browse/ARROW-15318 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 7.0.0 Reporter: A. Coady In a partitioned dataset with chunks larger than the default 1Gi batch size, reading _only_ the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`. {code:python} In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np In [2]: pa.__version__ Out[2]: '7.0.0.dev468' In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)}) In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key']) In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key']) In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks Out[6]: 1 In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks Out[7]: 2 In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks zsh: killed ipython # hangs; kllled {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15317) [R] Expose API to create Dataset from Fragments
Will Jones created ARROW-15317: -- Summary: [R] Expose API to create Dataset from Fragments Key: ARROW-15317 URL: https://issues.apache.org/jira/browse/ARROW-15317 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 6.0.1 Reporter: Will Jones Third-party packages may define dataset factories for table formats like Delta Lake and Apache Iceberg. These formats store metadata like schema, file lists, and file-level statistics on the side, and can construct a dataset without a discovery process needed. Python exposed enough API to do this successfully for [a Delta Lake dataset reader here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280]. I propose adding the following to the R API: * Expose {{Fragment}} as an R6 object * Add the {{MakeFragment}} method to various file format objects. It's key that {{partition_expression}} is included as an argument. ([See Python equivalent here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210]) * Add a dataset constructor that takes a list of {{Fragments}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15316) [R] Make a one-function pointer function
Jonathan Keane created ARROW-15316: -- Summary: [R] Make a one-function pointer function Key: ARROW-15316 URL: https://issues.apache.org/jira/browse/ARROW-15316 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Jonathan Keane Assignee: Dragoș Moldovan-Grünfeld In ARROW-15173 [PR|https://github.com/apache/arrow/pull/12062/files] we added backwards compatibly to pointers between R and Python where we use `external_pointer_addr_double()` with old python versions. We could take a number of the blocks like: {code} if (pyarrow_version() >= pyarrow_version_pointers_changed) { x$`_export_to_c`(schema_ptr) } else { x$`_export_to_c`(external_pointer_addr_double(schema_ptr)) } {code} to {code} x$`_export_to_c`(backwards_compatible_pointer(schema_ptr)) {code} with {{backwards_compatible_pointer}} including the if/else -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15315) FlightSqlProducer#doAction always throws INVALID_ARGUMENT
Vinicius Fraga created ARROW-15315: -- Summary: FlightSqlProducer#doAction always throws INVALID_ARGUMENT Key: ARROW-15315 URL: https://issues.apache.org/jira/browse/ARROW-15315 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Java Affects Versions: 7.0.0 Reporter: Vinicius Fraga Fix For: 7.0.0 Even if the Action exists, since there is a missing return/else block, an exception is always thrown. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15314) Add missing metadata on Arrow schemas returned by Flight SQL
Jose Almeida created ARROW-15314: Summary: Add missing metadata on Arrow schemas returned by Flight SQL Key: ARROW-15314 URL: https://issues.apache.org/jira/browse/ARROW-15314 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Jose Almeida This add an auxiliary class FlightSqlColumnMetadata (Java) and ColumnMetadata(CPP) meant to read and write known metadata for Arrow schema fields, such as: * CATALOG_NAME * SCHEMA_NAME * TABLE_NAME * PRECISION * SCALE * IS_AUTO_INCREMENT * IS_CASE_SENSITIVE * IS_READ_ONLY * IS_SEARCHABLE -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15313) Add typeInfo functionallity to flight-sql
Jose Almeida created ARROW-15313: Summary: Add typeInfo functionallity to flight-sql Key: ARROW-15313 URL: https://issues.apache.org/jira/browse/ARROW-15313 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Jose Almeida This issue is related to add a new functionallity on FlightSql, the typeInfo command which is responsible to retrieve information about the data type support by the source. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15312) [R] filtering a dataset with is.na() misses some rows
Pierre Gramme created ARROW-15312: - Summary: [R] filtering a dataset with is.na() misses some rows Key: ARROW-15312 URL: https://issues.apache.org/jira/browse/ARROW-15312 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 6.0.1 Environment: R 4.1.2 on Windows arrow 6.0.1 dplyr 1.0.7 Reporter: Pierre Gramme Hi ! I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...) It seems linked to columns containing only one distinct value and some NA's. Can you also reproduce the following? {quote} library(arrow) library(dplyr) ds_path = "test-arrow-na" df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_)) df %>% arrow::write_dataset(ds_path) # OK: Collect then filter: returns row 3, as expected arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y)) # ERROR: Filter then collect (on y) returns a tibble with no row arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect() # OK: Filter then collect (on z) returns row 3, as expected arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {quote} Thanks Pierre -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15311) [C++][Python] Opening a partitioned dataset with schema and filter
Alenka Frim created ARROW-15311: --- Summary: [C++][Python] Opening a partitioned dataset with schema and filter Key: ARROW-15311 URL: https://issues.apache.org/jira/browse/ARROW-15311 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Alenka Frim Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done. Example: {code:python} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds # Define the data table = pa.table({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) # Write to partitioned dataset # The files will include columns "two" and "three" pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['one']) # Reading the partitioned dataset with schema not including partitioned names # will error schema = pa.schema([("three", "double")]) data = ds.dataset("dataset_name", partitioning="hive", schema=schema) subset = ds.field("one") == 2.5 data.to_table(filter=subset) # And will not if done like so: schema = pa.schema([("three", "double"), ("one", "double")]) data = ds.dataset("dataset_name", partitioning="hive", schema=schema) subset = ds.field("one") == 2.5 data.to_table(filter=subset) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?
Joris Van den Bossche created ARROW-15310: - Summary: [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path? Key: ARROW-15310 URL: https://issues.apache.org/jira/browse/ARROW-15310 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field. And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value"). I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. Basically what happens is this: {code:python} >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")])) >>> part.parse("part=a") {code} If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong. I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly. Illustrative code example: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with 1 hive-style partitioning level basedir = pathlib.Path(".") / "dataset_wrong_partitioning" basedir.mkdir(exist_ok=True) (basedir / "part=a").mkdir(exist_ok=True) (basedir / "part=b").mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "part=a" / "data.parquet") table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "part=b" / "data.parquet") {code} Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field: {code: python} >>> dataset = ds.dataset(basedir) >>> dataset.to_table(filter=ds.field("part") == "a") ... ArrowInvalid: No match for FieldRef.Name(part) in a: int64 {code} But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result: {code:python} >>> dataset = ds.dataset(basedir, partitioning=["part"]) >>> dataset.to_table(filter=ds.field("part") == "a") pyarrow.Table a: int64 b: int64 part: string a: [] b: [] part: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15309) [TESTS] Add a testing coverage tool
Benson Muite created ARROW-15309: Summary: [TESTS] Add a testing coverage tool Key: ARROW-15309 URL: https://issues.apache.org/jira/browse/ARROW-15309 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools Reporter: Benson Muite It would be good to estimate what fraction of the code the tests cover. Using a tool such as [CodeCov|https://about.codecov.io] or [Coveralls|https://coveralls.io/] maybe helpful -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15308) [TESTS] Add a testing coverage tool
Benson Muite created ARROW-15308: Summary: [TESTS] Add a testing coverage tool Key: ARROW-15308 URL: https://issues.apache.org/jira/browse/ARROW-15308 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Developer Tools Reporter: Benson Muite It would be good to estimate what fraction of the code the tests cover. Using a tool such as [CodeCov|https://about.codecov.io] or [Coveralls|https://coveralls.io/] maybe helpful -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning
Joris Van den Bossche created ARROW-15307: - Summary: [C++][Dataset] Provide more context in error message if cast fails during scanning Key: ARROW-15307 URL: https://issues.apache.org/jira/browse/ARROW-15307 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche If you have a partitioned dataset, and in one of the files there is a column with a mismatching type and that cannot be safely casted to the dataset schema's type for that column, you get (as expected) get an error about this cast. Small illustrative example code: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with two files basedir = pathlib.Path(".") / "dataset_test_mismatched_schema" basedir.mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "data1.parquet") table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "data2.parquet") ## reading the dataset dataset = ds.dataset(basedir) # by default infer dataset schema from first file dataset.schema # actually reading gives expected error dataset.to_table() {code} gives {code:python} >>> dataset.schema a: int64 b: int64 >>> dataset.to_table() --- ArrowInvalid Traceback (most recent call last) in 22 dataset.schema 23 # actually reading gives expected error ---> 24 dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Float value 1.5 was truncated converting to int64 ../src/arrow/compute/kernels/scalar_cast_numeric.cc:177 CheckFloatToIntTruncation(batch[0], *out) ../src/arrow/compute/exec.cc:700 kernel_->exec(kernel_ctx_, batch, &out) ../src/arrow/compute/exec.cc:641 ExecuteBatch(batch, listener) ../src/arrow/compute/function.cc:248 executor->Execute(implicitly_cast_args, &listener) ../src/arrow/compute/exec/expression.cc:444 compute::Cast(column, field->type(), compute::CastOptions::Safe()) ../src/arrow/dataset/scanner.cc:816 compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value) {code} So the actual error message (without the extra C++ context) is only *"ArrowInvalid: Float value 1.5 was truncated converting to int64"*. So this error message only says something about the two types and the first value that cannot be cast, but if you have a large dataset with many fragments and/or many columns, it can be hard to know 1) for which column this is failing and 2) for which fragment it is failing. So it would be nice to add some extra context to the error message. The cast itself of course doesn't know it, but I suppose when doing the cast in the scanner code, there at least we know eg the physical schema and dataset schema, so we could append or prepend the error message with something like "Casting from schema1 to schema2 failed with ...". cc [~alenkaf] -- This message was sent by Atlassian Jira (v8.20.1#820001)