Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
On Tue, Aug 7, 2012 at 2:02 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I think that optimizations like this are going to be essential for things like pgsql_fdw (or other_rdms_fdw). Despite the thorny semantic issues, we're just not going to be able to get around it. There will even be people who want SELECT * FROM ft ORDER BY 1 to order by the remote side's notion of ordering rather than ours, despite the fact that the remote side has some insane-by-PG-standards definition of ordering. People are going to find ways to do that kind of thing whether we condone it or not, so we might as well start thinking now about how we're going to live with it. But that doesn't answer the question of whether or not we ought to support it for file_fdw in particular, which seems like a more arguable point. For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key column is sorted in the specified way at the execution phase ie, at the (first) scan of a data file, only when pathkeys are set, and (2) to abort the transaction if it detects the data file is not sorted. That seems like an even worse idea. People who want to access data repeatedly should load it into tables. Mind you, if you want to publish a version of file_fdw on PGXN that does this, that's fine with me. But I don't think it belongs in core, at least not without a lot more evidence that there is a real demand for this than we have so far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I think file_fdw is useful for managing log files such as PG CSV logs. Since often, such files are sorted by timestamp, I think the patch can improve the performance of log analysis, though I have to admit my demonstration was not realistic. Hmm, I guess I could buy that as a plausible use case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
Robert Haas robertmh...@gmail.com writes: On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I think file_fdw is useful for managing log files such as PG CSV logs. Since often, such files are sorted by timestamp, I think the patch can improve the performance of log analysis, though I have to admit my demonstration was not realistic. Hmm, I guess I could buy that as a plausible use case. In the particular case of PG log files, I'd bet good money against them being *exactly* sorted by timestamp. Clock skew between backends, or varying amounts of time to construct and send messages, will result in small inconsistencies. This would generally not matter, until the planner relied on the claim of sortedness for something like a mergejoin ... and then it would matter a lot. In general I'm quite suspicious of the idea of believing that externally supplied data is sorted in exactly the way that PG thinks it should sort. If we implement this you can bet that people will screw up, for instance by using the wrong locale/collation to sort text data. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I think file_fdw is useful for managing log files such as PG CSV logs. Since often, such files are sorted by timestamp, I think the patch can improve the performance of log analysis, though I have to admit my demonstration was not realistic. Hmm, I guess I could buy that as a plausible use case. In the particular case of PG log files, I'd bet good money against them being *exactly* sorted by timestamp. Clock skew between backends, or varying amounts of time to construct and send messages, will result in small inconsistencies. This would generally not matter, until the planner relied on the claim of sortedness for something like a mergejoin ... and then it would matter a lot. Hmm, true. In general I'm quite suspicious of the idea of believing that externally supplied data is sorted in exactly the way that PG thinks it should sort. If we implement this you can bet that people will screw up, for instance by using the wrong locale/collation to sort text data. I think that optimizations like this are going to be essential for things like pgsql_fdw (or other_rdms_fdw). Despite the thorny semantic issues, we're just not going to be able to get around it. There will even be people who want SELECT * FROM ft ORDER BY 1 to order by the remote side's notion of ordering rather than ours, despite the fact that the remote side has some insane-by-PG-standards definition of ordering. People are going to find ways to do that kind of thing whether we condone it or not, so we might as well start thinking now about how we're going to live with it. But that doesn't answer the question of whether or not we ought to support it for file_fdw in particular, which seems like a more arguable point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
From: Robert Haas [mailto:robertmh...@gmail.com] On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I think file_fdw is useful for managing log files such as PG CSV logs. Since often, such files are sorted by timestamp, I think the patch can improve the performance of log analysis, though I have to admit my demonstration was not realistic. Hmm, I guess I could buy that as a plausible use case. In the particular case of PG log files, I'd bet good money against them being *exactly* sorted by timestamp. Clock skew between backends, or varying amounts of time to construct and send messages, will result in small inconsistencies. This would generally not matter, until the planner relied on the claim of sortedness for something like a mergejoin ... and then it would matter a lot. Hmm, true. In general I'm quite suspicious of the idea of believing that externally supplied data is sorted in exactly the way that PG thinks it should sort. If we implement this you can bet that people will screw up, for instance by using the wrong locale/collation to sort text data. I think that optimizations like this are going to be essential for things like pgsql_fdw (or other_rdms_fdw). Despite the thorny semantic issues, we're just not going to be able to get around it. There will even be people who want SELECT * FROM ft ORDER BY 1 to order by the remote side's notion of ordering rather than ours, despite the fact that the remote side has some insane-by-PG-standards definition of ordering. People are going to find ways to do that kind of thing whether we condone it or not, so we might as well start thinking now about how we're going to live with it. But that doesn't answer the question of whether or not we ought to support it for file_fdw in particular, which seems like a more arguable point. For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key column is sorted in the specified way at the execution phase ie, at the (first) scan of a data file, only when pathkeys are set, and (2) to abort the transaction if it detects the data file is not sorted. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
Hi Robert, From: Robert Haas [mailto:robertmh...@gmail.com] On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c: /* * If data file was sorted, and we knew it somehow, we could insert * appropriate pathkeys into the ForeignPath node to tell the planner * that. */ To do this, I would like to propose new generic options for a file_fdw foreign table to specify the sortedness of a data file. While it is best to allow to specify the sortedness on multiple columns, the current interface for the generic options dose not seems to be suitable for doing it. As a compromise, I would like to propose single-column sortedness options and insert appropriate pathkeys into the ForeignPath node based on these information: I am not sure it is a good idea to complicate file_fdw with frammishes of marginal utility. I guess I tend to view things like file_fdw as a mechanism for getting the data into the database, not necessarily something that you actually want to keep your data in permanently and run complex queries against. I think file_fdw is useful for managing log files such as PG CSV logs. Since often, such files are sorted by timestamp, I think the patch can improve the performance of log analysis, though I have to admit my demonstration was not realistic. It seems like that's the direction we're headed in here - statistics, indexing, etc. I am all in favor of having some kind of pluggable storage engine as an alternative to our heap, but I'm not sure a flat-file is a good choice. As you pointed out, I would like to allow indexing to be done for CSV foreign tables, but that is another problem. The submitted patch or the above comment is not something toward indexing, so to say, an optimization of the current file_fdw module. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c: /* * If data file was sorted, and we knew it somehow, we could insert * appropriate pathkeys into the ForeignPath node to tell the planner * that. */ To do this, I would like to propose new generic options for a file_fdw foreign table to specify the sortedness of a data file. While it is best to allow to specify the sortedness on multiple columns, the current interface for the generic options dose not seems to be suitable for doing it. As a compromise, I would like to propose single-column sortedness options and insert appropriate pathkeys into the ForeignPath node based on these information: I am not sure it is a good idea to complicate file_fdw with frammishes of marginal utility. I guess I tend to view things like file_fdw as a mechanism for getting the data into the database, not necessarily something that you actually want to keep your data in permanently and run complex queries against. It seems like that's the direction we're headed in here - statistics, indexing, etc. I am all in favor of having some kind of pluggable storage engine as an alternative to our heap, but I'm not sure a flat-file is a good choice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c: /* * If data file was sorted, and we knew it somehow, we could insert * appropriate pathkeys into the ForeignPath node to tell the planner * that. */ To do this, I would like to propose new generic options for a file_fdw foreign table to specify the sortedness of a data file. While it is best to allow to specify the sortedness on multiple columns, the current interface for the generic options dose not seems to be suitable for doing it. As a compromise, I would like to propose single-column sortedness options and insert appropriate pathkeys into the ForeignPath node based on these information: sorted: Boolean option to specify whether data file is sorted by a column. key: Specifies the name of a column by which data file is sorted. Required when the above option is set to true. direction: Specifies the sort order: asc or desc. The default is asc. nulls: Specifies that nulls sort before or after non-nulls: first or last. first is the default when direction option is set desc. When direction option is not set desc, last is the default. Attached is a WIP patch implementing this feature. I would like to demonstrate the usefulness of the patch. Experimental results are shown below. Here, data in /home/pgsql/relation.csv is sorted by aid in ascending order. postgres=# CREATE EXTENSION file_fdw; CREATE EXTENSION postgres=# CREATE SERVER fs FOREIGN DATA WRAPPER file_fdw; CREATE SERVER postgres=# CREATE FOREIGN TABLE ft (aid INTEGER, bid INTEGER) SERVER fs OPTIONS (filename '/home/pgsql/relation.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# ANALYZE ft; ANALYZE postgres=# SELECT count(*) FROM ft; count -- 5000 (1 row) postgres=# SELECT count(DISTINCT aid) FROM ft; count - 100 (1 row) postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid; QUERY PLAN Unique (cost=14187375.19..14437375.19 rows=1092929 width=4) (actual time=48952 .602..62788.934 rows=100 loops=1) - Sort (cost=14187375.19..14312375.19 rows=5000 width=4) (actual time= 48952.601..56635.448 rows=5000 loops=1) Sort Key: aid Sort Method: external sort Disk: 684272kB - Foreign Scan on ft (cost=0.00..5059137.00 rows=5000 width=4) ( actual time=0.073..18324.062 rows=5000 loops=1) Foreign File: /home/pgsql/relation.csv Foreign File Size: 48500 Total runtime: 63019.868 ms (8 rows) postgres=# ALTER FOREIGN TABLE ft OPTIONS ( ADD sorted 'true', key 'aid' ); ALTER FOREIGN TABLE postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid; QUERY PLAN -- Unique (cost=0.00..5184137.00 rows=1092929 width=4) (actual time=0.074..23124. 195 rows=100 loops=1) - Foreign Scan on ft (cost=0.00..5059137.00 rows=5000 width=4) (actual time=0.070..17633.821 rows=5000 loops=1) Foreign File: /home/pgsql/relation.csv Foreign File Size: 48500 Total runtime: 23213.909 ms (5 rows) Any comments and suggestions are welcomed. Thanks, Best regards, Etsuro Fujita file-fdw-pathkeys-0802.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I would like to propose single-column sortedness options and insert appropriate pathkeys into the ForeignPath node based on these information: sorted: Boolean option to specify whether data file is sorted by a column. key: Specifies the name of a column by which data file is sorted. Required when the above option is set to true. direction: Specifies the sort order: asc or desc. The default is asc. nulls: Specifies that nulls sort before or after non-nulls: first or last. first is the default when direction option is set desc. When direction option is not set desc, last is the default. For character-based columns, don't we need to know the collation? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning
Hi Kevin, I would like to propose single-column sortedness options and insert appropriate pathkeys into the ForeignPath node based on these information: For character-based columns, don't we need to know the collation? I forgot to add the collate option. I'll add it at the next version of the patch. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers