Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-07 Thread Robert Haas
On Tue, Aug 7, 2012 at 2:02 AM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I think that optimizations like this are going to be essential for
 things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
 semantic issues, we're just not going to be able to get around it.
 There will even be people who want SELECT * FROM ft ORDER BY 1 to
 order by the remote side's notion of ordering rather than ours,
 despite the fact that the remote side has some insane-by-PG-standards
 definition of ordering.  People are going to find ways to do that kind
 of thing whether we condone it or not, so we might as well start
 thinking now about how we're going to live with it.  But that doesn't
 answer the question of whether or not we ought to support it for
 file_fdw in particular, which seems like a more arguable point.

 For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the 
 key
 column is sorted in the specified way at the execution phase ie, at the 
 (first)
 scan of a data file, only when pathkeys are set, and (2) to abort the
 transaction if it detects the data file is not sorted.

That seems like an even worse idea.  People who want to access data
repeatedly should load it into tables.

Mind you, if you want to publish a version of file_fdw on PGXN that
does this, that's fine with me.  But I don't think it belongs in core,
at least not without a lot more evidence that there is a real demand
for this than we have so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-06 Thread Robert Haas
On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I think file_fdw is useful for managing log files such as PG CSV logs.  Since
 often, such files are sorted by timestamp, I think the patch can improve the
 performance of log analysis, though I have to admit my demonstration was not
 realistic.

Hmm, I guess I could buy that as a plausible use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:
 I think file_fdw is useful for managing log files such as PG CSV logs.  Since
 often, such files are sorted by timestamp, I think the patch can improve the
 performance of log analysis, though I have to admit my demonstration was not
 realistic.

 Hmm, I guess I could buy that as a plausible use case.

In the particular case of PG log files, I'd bet good money against them
being *exactly* sorted by timestamp.  Clock skew between backends, or
varying amounts of time to construct and send messages, will result in
small inconsistencies.  This would generally not matter, until the
planner relied on the claim of sortedness for something like a mergejoin
... and then it would matter a lot.

In general I'm quite suspicious of the idea of believing that externally
supplied data is sorted in exactly the way that PG thinks it should
sort.  If we implement this you can bet that people will screw up, for
instance by using the wrong locale/collation to sort text data.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-06 Thread Robert Haas
On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:
 I think file_fdw is useful for managing log files such as PG CSV logs.  
 Since
 often, such files are sorted by timestamp, I think the patch can improve the
 performance of log analysis, though I have to admit my demonstration was not
 realistic.

 Hmm, I guess I could buy that as a plausible use case.

 In the particular case of PG log files, I'd bet good money against them
 being *exactly* sorted by timestamp.  Clock skew between backends, or
 varying amounts of time to construct and send messages, will result in
 small inconsistencies.  This would generally not matter, until the
 planner relied on the claim of sortedness for something like a mergejoin
 ... and then it would matter a lot.

Hmm, true.

 In general I'm quite suspicious of the idea of believing that externally
 supplied data is sorted in exactly the way that PG thinks it should
 sort.  If we implement this you can bet that people will screw up, for
 instance by using the wrong locale/collation to sort text data.

I think that optimizations like this are going to be essential for
things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
semantic issues, we're just not going to be able to get around it.
There will even be people who want SELECT * FROM ft ORDER BY 1 to
order by the remote side's notion of ordering rather than ours,
despite the fact that the remote side has some insane-by-PG-standards
definition of ordering.  People are going to find ways to do that kind
of thing whether we condone it or not, so we might as well start
thinking now about how we're going to live with it.  But that doesn't
answer the question of whether or not we ought to support it for
file_fdw in particular, which seems like a more arguable point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-06 Thread Etsuro Fujita
 From: Robert Haas [mailto:robertmh...@gmail.com]

 On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
  fujita.ets...@lab.ntt.co.jp wrote:
  I think file_fdw is useful for managing log files such as PG CSV logs.
Since
  often, such files are sorted by timestamp, I think the patch can improve
 the
  performance of log analysis, though I have to admit my demonstration was
 not
  realistic.
 
  Hmm, I guess I could buy that as a plausible use case.
 
  In the particular case of PG log files, I'd bet good money against them
  being *exactly* sorted by timestamp.  Clock skew between backends, or
  varying amounts of time to construct and send messages, will result in
  small inconsistencies.  This would generally not matter, until the
  planner relied on the claim of sortedness for something like a mergejoin
  ... and then it would matter a lot.
 
 Hmm, true.
 
  In general I'm quite suspicious of the idea of believing that externally
  supplied data is sorted in exactly the way that PG thinks it should
  sort.  If we implement this you can bet that people will screw up, for
  instance by using the wrong locale/collation to sort text data.
 
 I think that optimizations like this are going to be essential for
 things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
 semantic issues, we're just not going to be able to get around it.
 There will even be people who want SELECT * FROM ft ORDER BY 1 to
 order by the remote side's notion of ordering rather than ours,
 despite the fact that the remote side has some insane-by-PG-standards
 definition of ordering.  People are going to find ways to do that kind
 of thing whether we condone it or not, so we might as well start
 thinking now about how we're going to live with it.  But that doesn't
 answer the question of whether or not we ought to support it for
 file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-05 Thread Etsuro Fujita
Hi Robert,

 From: Robert Haas [mailto:robertmh...@gmail.com]

 On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:
  The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:
 
  /*
   * If data file was sorted, and we knew it somehow, we could insert
   * appropriate pathkeys into the ForeignPath node to tell the planner
   * that.
   */
 
  To do this, I would like to propose new generic options for a file_fdw
foreign
  table to specify the sortedness of a data file.  While it is best to allow
 to
  specify the sortedness on multiple columns, the current interface for the
  generic options dose not seems to be suitable for doing it.  As a
compromise,
 I
  would like to propose single-column sortedness options and insert
appropriate
  pathkeys into the ForeignPath node based on these information:
 
 I am not sure it is a good idea to complicate file_fdw with frammishes
 of marginal utility.  I guess I tend to view things like file_fdw as a
 mechanism for getting the data into the database, not necessarily
 something that you actually want to keep your data in permanently and
 run complex queries against.

I think file_fdw is useful for managing log files such as PG CSV logs.  Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

 It seems like that's the direction we're
 headed in here - statistics, indexing, etc.  I am all in favor of
 having some kind of pluggable storage engine as an alternative to our
 heap, but I'm not sure a flat-file is a good choice.

As you pointed out, I would like to allow indexing to be done for CSV foreign
tables, but that is another problem.  The submitted patch or the above comment
is not something toward indexing, so to say, an optimization of the current
file_fdw module.

Thanks,

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-03 Thread Robert Haas
On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:

 /*
  * If data file was sorted, and we knew it somehow, we could insert
  * appropriate pathkeys into the ForeignPath node to tell the planner
  * that.
  */

 To do this, I would like to propose new generic options for a file_fdw foreign
 table to specify the sortedness of a data file.  While it is best to allow to
 specify the sortedness on multiple columns, the current interface for the
 generic options dose not seems to be suitable for doing it.  As a compromise, 
 I
 would like to propose single-column sortedness options and insert appropriate
 pathkeys into the ForeignPath node based on these information:

I am not sure it is a good idea to complicate file_fdw with frammishes
of marginal utility.  I guess I tend to view things like file_fdw as a
mechanism for getting the data into the database, not necessarily
something that you actually want to keep your data in permanently and
run complex queries against.  It seems like that's the direction we're
headed in here - statistics, indexing, etc.  I am all in favor of
having some kind of pluggable storage engine as an alternative to our
heap, but I'm not sure a flat-file is a good choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-02 Thread Etsuro Fujita
The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:

/*
 * If data file was sorted, and we knew it somehow, we could insert
 * appropriate pathkeys into the ForeignPath node to tell the planner
 * that.
 */

To do this, I would like to propose new generic options for a file_fdw foreign
table to specify the sortedness of a data file.  While it is best to allow to
specify the sortedness on multiple columns, the current interface for the
generic options dose not seems to be suitable for doing it.  As a compromise, I
would like to propose single-column sortedness options and insert appropriate
pathkeys into the ForeignPath node based on these information:

 sorted: Boolean option to specify whether data file is sorted by a column.
 key: Specifies the name of a column by which data file is sorted.  Required
when the above option is set to true.
 direction: Specifies the sort order: asc or desc.  The default is asc.
 nulls: Specifies that nulls sort before or after non-nulls: first or last.
first is the default when direction option is set desc.  When direction option
is not set desc, last is the default.

Attached is a WIP patch implementing this feature.  I would like to demonstrate
the usefulness of the patch.  Experimental results are shown below.  Here, data
in /home/pgsql/relation.csv is sorted by aid in ascending order.

postgres=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER fs FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
postgres=# CREATE FOREIGN TABLE ft (aid INTEGER, bid INTEGER) SERVER fs OPTIONS
(filename '/home/pgsql/relation.csv', format 'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE ft;
ANALYZE
postgres=# SELECT count(*) FROM ft;
  count
--
 5000
(1 row)

postgres=# SELECT count(DISTINCT aid) FROM ft;
  count
-
 100
(1 row)

postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
 QUERY PLAN



 Unique  (cost=14187375.19..14437375.19 rows=1092929 width=4) (actual time=48952
.602..62788.934 rows=100 loops=1)
   -  Sort  (cost=14187375.19..14312375.19 rows=5000 width=4) (actual time=
48952.601..56635.448 rows=5000 loops=1)
 Sort Key: aid
 Sort Method: external sort  Disk: 684272kB
 -  Foreign Scan on ft  (cost=0.00..5059137.00 rows=5000 width=4) (
actual time=0.073..18324.062 rows=5000 loops=1)
   Foreign File: /home/pgsql/relation.csv
   Foreign File Size: 48500
 Total runtime: 63019.868 ms
(8 rows)

postgres=# ALTER FOREIGN TABLE ft OPTIONS ( ADD  sorted 'true', key 'aid' );
ALTER FOREIGN TABLE
postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
  QUERY PLAN


--
 Unique  (cost=0.00..5184137.00 rows=1092929 width=4) (actual time=0.074..23124.
195 rows=100 loops=1)
   -  Foreign Scan on ft  (cost=0.00..5059137.00 rows=5000 width=4) (actual
 time=0.070..17633.821 rows=5000 loops=1)
 Foreign File: /home/pgsql/relation.csv
 Foreign File Size: 48500
 Total runtime: 23213.909 ms
(5 rows)

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita


file-fdw-pathkeys-0802.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-02 Thread Kevin Grittner
Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote:
 
 I would like to propose single-column sortedness options and
 insert appropriate pathkeys into the ForeignPath node based on
 these information:
 
  sorted: Boolean option to specify whether data file is sorted by
  a column.
  key: Specifies the name of a column by which data file is sorted.
  Required when the above option is set to true.
  direction: Specifies the sort order: asc or desc.  The default is
  asc.
  nulls: Specifies that nulls sort before or after non-nulls: first
  or last.  first is the default when direction option is set
  desc.  When direction option is not set desc, last is the
  default.
 
For character-based columns, don't we need to know the collation?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Use sortedness of CSV foreign tables for query planning

2012-08-02 Thread Etsuro Fujita
Hi Kevin,

  I would like to propose single-column sortedness options and
  insert appropriate pathkeys into the ForeignPath node based on
  these information:

 For character-based columns, don't we need to know the collation?

I forgot to add the collate option.  I'll add it at the next version of the
patch.

Thanks,

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers