Re: [Python] dataset filter performance and partitioning

Joris Van den Bossche Mon, 19 Oct 2020 08:32:05 -0700

Coming back to this older thread, specifically on the topic of "duplicated"
information as both partition field and actual column in the data:

On Fri, 25 Sep 2020 at 14:43, Robin Kåveland Hansen <[email protected]>
wrote:

> Hi,
>
> Just thought I'd chime in on this point:
>
> > - In your case, the partitioning has the same name as one of the actual
> columns in the data files. I am not sure this corner case of duplicate
> fields is tested very well, or how the filtering will work?
>
> I _think_ this is the default behaviour for pyspark for writes. Eg. the
> column is both in the data files as well as in the partition.
>
> I think this might actually make sense, though, since putting the
> partition column in the schema means you'll know what type it should be
> when you read it back from disk (at least for data files that support
> schemas).
>

Thanks for this feedback!
I wasn't aware that this is something pyspark can do (for example, I know
that Dask does not include the partition column in the actual data). But
then we need to ensure we handle this correctly.

I did a few experiments to check the support (I don't know if we explicitly
ensured such support when implementing the datasets), and I observe the
following behaviour in case of duplicate partition field / actual data
column:

* The schema of the dataset doesn't include the column as duplicated, and
uses the schema of the parquet file (it includes parquet metadata like
field_id)
* When reading, it actually returns the values as they are in the physical
parquet files.
* When filtering, it uses the partition fields (i.e. information in the
file paths), and doesn't do any additional check / filter using the
physical data in the column (so if your partition field vs column is not in
sync, this can give wrong results).
* When the partition field's inferred type doesn't match with the file's
schema for the partition column, you get an appropriate error (only where
the types are "compatible", like int32 and int64, we should actually
support this, because right now this also errors)

I _think_ this behaviour is correct /  as expected, but feedback on that is
certainly welcome.

Actual code with output of the small experiment can be seen in this
notebook:
https://nbviewer.jupyter.org/gist/jorisvandenbossche/9382de2eb96db5db2ef801f63a359082

It would probably be good to add some explicit tests to ensure we support
this use case properly (I opened
https://issues.apache.org/jira/browse/ARROW-10347 for this)

Joris

>
> --
> Kind regards,
> Robin Kåveland
>
>

Re: [Python] dataset filter performance and partitioning

Reply via email to