[jira] [Created] (ARROW-15318) [C++][Python] Regression reading partition keys of large batches.

2022-01-12 Thread A. Coady (Jira)
A. Coady created ARROW-15318:


 Summary: [C++][Python] Regression reading partition keys of large 
batches.
 Key: ARROW-15318
 URL: https://issues.apache.org/jira/browse/ARROW-15318
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 7.0.0
Reporter: A. Coady


In a partitioned dataset with chunks larger than the default 1Gi batch size, 
reading _only_ the partition keys is hanging, and consuming unbounded memory. 
The bug first appeared in nightly build `7.0.0.dev468`.

{code:python}
In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np

In [2]: pa.__version__
Out[2]: '7.0.0.dev468'

In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': 
np.arange(2 ** 20 + 1)})

In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key'])

In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key'])

In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks
Out[6]: 1

In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks
Out[7]: 2

In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks
zsh: killed ipython # hangs; kllled
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15317) [R] Expose API to create Dataset from Fragments

2022-01-12 Thread Will Jones (Jira)
Will Jones created ARROW-15317:
--

 Summary: [R] Expose API to create Dataset from Fragments
 Key: ARROW-15317
 URL: https://issues.apache.org/jira/browse/ARROW-15317
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 6.0.1
Reporter: Will Jones


Third-party packages may define dataset factories for table formats like Delta 
Lake and Apache Iceberg. These formats store metadata like schema, file lists, 
and file-level statistics on the side, and can construct a dataset without a 
discovery process needed. Python exposed enough API to do this successfully for 
[a Delta Lake dataset reader 
here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].

I propose adding the following to the R API:

 * Expose {{Fragment}} as an R6 object
 * Add the {{MakeFragment}} method to various file format objects. It's key 
that {{partition_expression}} is included as an argument. ([See Python 
equivalent 
here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
 * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15316) [R] Make a one-function pointer function

2022-01-12 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15316:
--

 Summary: [R] Make a one-function pointer function
 Key: ARROW-15316
 URL: https://issues.apache.org/jira/browse/ARROW-15316
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Jonathan Keane
Assignee: Dragoș Moldovan-Grünfeld


In ARROW-15173 [PR|https://github.com/apache/arrow/pull/12062/files] we added 
backwards compatibly to pointers between R and Python where we use 
`external_pointer_addr_double()` with old python versions. We could take a 
number of the blocks like:

{code}
 if (pyarrow_version() >= pyarrow_version_pointers_changed) {
x$`_export_to_c`(schema_ptr)
  } else {
x$`_export_to_c`(external_pointer_addr_double(schema_ptr))
  }
{code}

to 

{code}
x$`_export_to_c`(backwards_compatible_pointer(schema_ptr))
{code}

with {{backwards_compatible_pointer}} including the if/else



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15315) FlightSqlProducer#doAction always throws INVALID_ARGUMENT

2022-01-12 Thread Vinicius Fraga (Jira)
Vinicius Fraga created ARROW-15315:
--

 Summary: FlightSqlProducer#doAction always throws INVALID_ARGUMENT
 Key: ARROW-15315
 URL: https://issues.apache.org/jira/browse/ARROW-15315
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Affects Versions: 7.0.0
Reporter: Vinicius Fraga
 Fix For: 7.0.0


Even if the Action exists, since there is a missing return/else block, an 
exception is always thrown. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15314) Add missing metadata on Arrow schemas returned by Flight SQL

2022-01-12 Thread Jose Almeida (Jira)
Jose Almeida created ARROW-15314:


 Summary: Add missing metadata on Arrow schemas returned by Flight 
SQL
 Key: ARROW-15314
 URL: https://issues.apache.org/jira/browse/ARROW-15314
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Jose Almeida


This add an auxiliary class FlightSqlColumnMetadata (Java) and 
ColumnMetadata(CPP) meant to read and write known metadata for Arrow schema 
fields, such as:
 * CATALOG_NAME
 * SCHEMA_NAME
 * TABLE_NAME
 * PRECISION
 * SCALE
 * IS_AUTO_INCREMENT
 * IS_CASE_SENSITIVE
 * IS_READ_ONLY
 * IS_SEARCHABLE



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15313) Add typeInfo functionallity to flight-sql

2022-01-12 Thread Jose Almeida (Jira)
Jose Almeida created ARROW-15313:


 Summary: Add typeInfo functionallity to flight-sql
 Key: ARROW-15313
 URL: https://issues.apache.org/jira/browse/ARROW-15313
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Jose Almeida


This issue is related to add a new functionallity on FlightSql, the typeInfo 
command which is responsible to retrieve information about the data type 
support by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15312) [R] filtering a dataset with is.na() misses some rows

2022-01-12 Thread Pierre Gramme (Jira)
Pierre Gramme created ARROW-15312:
-

 Summary: [R] filtering a dataset with is.na() misses some rows
 Key: ARROW-15312
 URL: https://issues.apache.org/jira/browse/ARROW-15312
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 6.0.1
 Environment: R 4.1.2 on Windows
arrow 6.0.1
dplyr 1.0.7
Reporter: Pierre Gramme


Hi !

I just found an issue when querying an Arrow dataset with dplyr, filtering on 
is.na(...)

It seems linked to columns containing only one distinct value and some NA's.

Can you also reproduce the following?
{quote}  library(arrow)
  library(dplyr)
  
  ds_path = "test-arrow-na"
  df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
  
  df %>% arrow::write_dataset(ds_path)
  
  # OK: Collect then filter: returns row 3, as expected
  arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))

  # ERROR: Filter then collect (on y) returns a tibble with no row
  arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
  
  # OK: Filter then collect (on z) returns row 3, as expected
  arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect()
{quote}
 

Thanks

Pierre



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15311) [C++][Python] Opening a partitioned dataset with schema and filter

2022-01-12 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-15311:
---

 Summary: [C++][Python] Opening a partitioned dataset with schema 
and filter
 Key: ARROW-15311
 URL: https://issues.apache.org/jira/browse/ARROW-15311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Alenka Frim


Add a note to the docs that if partitioning and schema are both specified at 
opening of a dataset and partitioning names are not included in the data, 
schema needs to include the partitioning names (directory or hive partitioning) 
in a case that filtering will be done.

Example:

{code:python}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Define the data
table = pa.table({'one': [-1, np.nan, 2.5],
   'two': ['foo', 'bar', 'baz'],
   'three': [True, False, True]})

# Write to partitioned dataset
# The files will include columns "two" and "three"
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one'])

# Reading the partitioned dataset with schema not including partitioned names
# will error

schema = pa.schema([("three", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

# And will not if done like so:
schema = pa.schema([("three", "double"), ("one", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15310) [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path?

2022-01-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15310:
-

 Summary: [C++][Python][Dataset] Detect (and warn?) when 
DirectoryPartitioning is parsing an actually hive-style file path?
 Key: ARROW-15310
 URL: https://issues.apache.org/jira/browse/ARROW-15310
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche


When you have a hive-style partitioned dataset, with our current 
{{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning 
and get confusing results. 

For example, if you specify the partitioning field names with 
{{partitioning=[...]}} (which is not needed for hive style since those are 
inferred), we actually assume you want directory partitioning. This 
DirectoryPartitioning will then parse the hive-style file paths and take the 
full "key=value" as the data values for the field.  
And then, doing a filter can result in a confusing empty result (because 
"value" doesn't match "key=value").

I am wondering if we can't relatively cheaply detect this case, and eg give an 
informative warning about this to the user. 

Basically what happens is this:

{code:python}
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")

{code}

If the parsed value is a string that contains a "=" (and in this case also 
contains the field name), that is I think a clear sign that (in the large 
majority of cases) the user is doing something wrong.

I am not fully sure where and at what stage the check could be done though. 
Doing it for every path in the dataset might be too costly.




Illustrative code example:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with 1 hive-style partitioning level

basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)

(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")

table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")
{code}

Reading as is (not specifying a partitioning, so default to no partitioning) 
will at least give an error about a missing field:

{code: python}
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64
{code}

But specifying the partitioning field name (which currently gets (silently) 
interpreted as directory partitioning) gives a confusing empty result:

{code:python}
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string

a: []
b: []
part: []
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15309) [TESTS] Add a testing coverage tool

2022-01-12 Thread Benson Muite (Jira)
Benson Muite created ARROW-15309:


 Summary: [TESTS] Add a testing coverage tool
 Key: ARROW-15309
 URL: https://issues.apache.org/jira/browse/ARROW-15309
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Benson Muite


It would be good to estimate what fraction of the code the tests cover. Using a 
tool such as [CodeCov|https://about.codecov.io] or 
[Coveralls|https://coveralls.io/] maybe helpful



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15308) [TESTS] Add a testing coverage tool

2022-01-12 Thread Benson Muite (Jira)
Benson Muite created ARROW-15308:


 Summary: [TESTS] Add a testing coverage tool
 Key: ARROW-15308
 URL: https://issues.apache.org/jira/browse/ARROW-15308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Benson Muite


It would be good to estimate what fraction of the code the tests cover. Using a 
tool such as [CodeCov|https://about.codecov.io] or 
[Coveralls|https://coveralls.io/] maybe helpful



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15307) [C++][Dataset] Provide more context in error message if cast fails during scanning

2022-01-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15307:
-

 Summary: [C++][Dataset] Provide more context in error message if 
cast fails during scanning
 Key: ARROW-15307
 URL: https://issues.apache.org/jira/browse/ARROW-15307
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


If you have a partitioned dataset, and in one of the files there is a column 
with a mismatching type and that cannot be safely casted to the dataset 
schema's type for that column, you get (as expected) get an error about this 
cast. 

Small illustrative example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with two files

basedir = pathlib.Path(".") / "dataset_test_mismatched_schema"
basedir.mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "data1.parquet")

table2 = pa.table({'a': [1.5, 2.0, 3.0], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "data2.parquet")

## reading the dataset

dataset = ds.dataset(basedir)
# by default infer dataset schema from first file
dataset.schema
# actually reading gives expected error
dataset.to_table()
{code}

gives

{code:python}
>>> dataset.schema
a: int64
b: int64
>>> dataset.to_table()
---
ArrowInvalid  Traceback (most recent call last)
 in 
 22 dataset.schema
 23 # actually reading gives expected error
---> 24 dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Dataset.to_table()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Scanner.to_table()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Float value 1.5 was truncated converting to int64

../src/arrow/compute/kernels/scalar_cast_numeric.cc:177  
CheckFloatToIntTruncation(batch[0], *out)
../src/arrow/compute/exec.cc:700  kernel_->exec(kernel_ctx_, batch, &out)
../src/arrow/compute/exec.cc:641  ExecuteBatch(batch, listener)
../src/arrow/compute/function.cc:248  executor->Execute(implicitly_cast_args, 
&listener)
../src/arrow/compute/exec/expression.cc:444  compute::Cast(column, 
field->type(), compute::CastOptions::Safe())
../src/arrow/dataset/scanner.cc:816  
compute::MakeExecBatch(*scan_options->dataset_schema, 
partial.record_batch.value)
{code}

So the actual error message (without the extra C++ context) is only 
*"ArrowInvalid: Float value 1.5 was truncated converting to int64"*.

So this error message only says something about the two types and the first 
value that cannot be cast, but if you have a large dataset with many fragments 
and/or many columns, it can be hard to know 1) for which column this is failing 
and 2) for which fragment it is failing.

So it would be nice to add some extra context to the error message.  
The cast itself of course doesn't know it, but I suppose when doing the cast in 
the scanner code, there at least we know eg the physical schema and dataset 
schema, so we could append or prepend the error message with something like 
"Casting from schema1 to schema2 failed with ...". 

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)