[PERFORM] Does the psql executable support a "fetch many" approach when dumping large queries to stdout?

Beaver, John E Fri, 02 Apr 2010 13:40:44 -0700

Does the psql executable have any ability to do a "fetch many", using a 
server-side named cursor, when returning results? It seems like it tries to 
retrieve the query entirely to local memory before printing to standard out.


Specifically, I've tried running the following command on my desktop, which 
returns about 70 million lines:
    echo "select [thirteen columns, all floating point and boolean types] from 
go_prediction_view_mouse gpv order by combined_score desc nulls last" | psql -U 
[username] -h [remote host] [database] > mouse_predictions.tab

The command increases in memory usage until about 12GB, when it stops and there 
is significant disk activity (assume paging to disk). Closing a few big 
programs immediately causes it to increase its memory usage accordingly. After 
about 50 minutes, I killed it.

If I instead run the Python program below, which simply performs the query 
using a fetchmany() call to retrieve a few hundred thousand tuples at a time 
through a named server-side cursor, the program remains under about 20 MB of 
memory usage throughout and finishes in about 35 minutes.

I know that the query used here could have been a COPY statement, which I 
assume would be better-behaved, but I'm more concerned about the case in which 
the query is more complex.

The local (OSX 10.6.2) version of Postgres is 8.4.2, and the server's (Ubuntu 
4.x) version of Postgres is 8.3.5.


------ Python source to use a named server-side cursor to dump a large number 
of rows to a file ------

import psycopg2
import psycopg2.extensions

#---Modify this section---#
species = 'mouse'
query = 'select go_term_ref, gene_remote_id, function_verified_exactly, 
function_verified_with_parent_go, function_verified_with_child_go, 
combined_score, obozinski_score, lee_score, mostafavi_score, guan_score, 
kim_score, joshi_score, tasan_score, tasan_revised_score, qi_score, leone_score 
from go_prediction_view_' + species + ' gpv order by combined_score desc nulls 
last'
outputFilePath = [*output file path*]
connectionString = [*connection string*]
queryBufferSize = 10000

def processRow(row):
    # not important
#---End modify this section----#


#---Everything below should be genetic---#
conn = psycopg2.connect(connectionString);
cur = conn.cursor('temp_dump_cursor')
cur.arraysize = queryBufferSize

def resultIter(cursor):
    'An iterator that uses fetchmany to keep memory usage down'
    done = False
    while not done:
        results = cursor.fetchmany(queryBufferSize)
        if not results or len(results) == 0:
            done = True
        for result in results:
            yield result

with open(outputFilePath, 'w') as file:
    print 'Starting ' + query + '...'
    cur.execute(query)
    i = 0
    for row in resultIter(cur):
        i += 1
        row = processRow(row)
        file.write('\t'.join(row) + '\n')
        
        if i % queryBufferSize == 0:
            print 'Wrote ' + str(i) + ' lines.'

print 'finished. Total of ' + str(i) + ' lines.'
cur.close()
conn.close()
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Does the psql executable support a "fetch many" approach when dumping large queries to stdout?

Reply via email to