Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

Etsuro Fujita Mon, 06 Aug 2012 22:57:39 -0700

> From: Robert Haas [mailto:[email protected]]


> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <[email protected]> wrote:
> > Robert Haas <[email protected]> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <[email protected]> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp.  Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies.  This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
> 
> Hmm, true.
> 
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort.  If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
> 
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering.  People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it.  But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

Reply via email to