Re: [Python] dataset filter performance and partitioning

Robin Kåveland Hansen Fri, 25 Sep 2020 05:43:41 -0700

Hi,

Just thought I'd chime in on this point:


> - In your case, the partitioning has the same name as one of the actual
columns in the data files. I am not sure this corner case of duplicate
fields is tested very well, or how the filtering will work?

I _think_ this is the default behaviour for pyspark for writes. Eg. the
column is both in the data files as well as in the partition.

I think this might actually make sense, though, since putting the partition
column in the schema means you'll know what type it should be when you read
it back from disk (at least for data files that support schemas).

-- 
Kind regards,
Robin Kåveland

Re: [Python] dataset filter performance and partitioning

Reply via email to