Joris Van den Bossche created ARROW-15310: ---------------------------------------------
Summary: [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path? Key: ARROW-15310 URL: https://issues.apache.org/jira/browse/ARROW-15310 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche When you have a hive-style partitioned dataset, with our current {{dataset(..)}} API, it's relatively easy to mess up the inferred partitioning and get confusing results. For example, if you specify the partitioning field names with {{partitioning=[...]}} (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field. And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value"). I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user. Basically what happens is this: {code:python} >>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")])) >>> part.parse("part=a") <pyarrow.dataset.Expression (part == "part=a")> {code} If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong. I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly. ---- Illustrative code example: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with 1 hive-style partitioning level basedir = pathlib.Path(".") / "dataset_wrong_partitioning" basedir.mkdir(exist_ok=True) (basedir / "part=a").mkdir(exist_ok=True) (basedir / "part=b").mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "part=a" / "data.parquet") table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "part=b" / "data.parquet") {code} Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field: {code: python} >>> dataset = ds.dataset(basedir) >>> dataset.to_table(filter=ds.field("part") == "a") ... ArrowInvalid: No match for FieldRef.Name(part) in a: int64 {code} But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result: {code:python} >>> dataset = ds.dataset(basedir, partitioning=["part"]) >>> dataset.to_table(filter=ds.field("part") == "a") pyarrow.Table a: int64 b: int64 part: string ---- a: [] b: [] part: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)