Coming back to this older thread, specifically on the topic of "duplicated" information as both partition field and actual column in the data:
On Fri, 25 Sep 2020 at 14:43, Robin Kåveland Hansen <kaavel...@gmail.com> wrote: > Hi, > > Just thought I'd chime in on this point: > > > - In your case, the partitioning has the same name as one of the actual > columns in the data files. I am not sure this corner case of duplicate > fields is tested very well, or how the filtering will work? > > I _think_ this is the default behaviour for pyspark for writes. Eg. the > column is both in the data files as well as in the partition. > > I think this might actually make sense, though, since putting the > partition column in the schema means you'll know what type it should be > when you read it back from disk (at least for data files that support > schemas). > Thanks for this feedback! I wasn't aware that this is something pyspark can do (for example, I know that Dask does not include the partition column in the actual data). But then we need to ensure we handle this correctly. I did a few experiments to check the support (I don't know if we explicitly ensured such support when implementing the datasets), and I observe the following behaviour in case of duplicate partition field / actual data column: * The schema of the dataset doesn't include the column as duplicated, and uses the schema of the parquet file (it includes parquet metadata like field_id) * When reading, it actually returns the values as they are in the physical parquet files. * When filtering, it uses the partition fields (i.e. information in the file paths), and doesn't do any additional check / filter using the physical data in the column (so if your partition field vs column is not in sync, this can give wrong results). * When the partition field's inferred type doesn't match with the file's schema for the partition column, you get an appropriate error (only where the types are "compatible", like int32 and int64, we should actually support this, because right now this also errors) I _think_ this behaviour is correct / as expected, but feedback on that is certainly welcome. Actual code with output of the small experiment can be seen in this notebook: https://nbviewer.jupyter.org/gist/jorisvandenbossche/9382de2eb96db5db2ef801f63a359082 It would probably be good to add some explicit tests to ensure we support this use case properly (I opened https://issues.apache.org/jira/browse/ARROW-10347 for this) Joris > > -- > Kind regards, > Robin Kåveland > >