We've run into the problem several times where our customers run a large but
not unreasonable report returning 100k of rows. Most of the RAM is spent after
.query(), and during .getresult(), converting many C arrays to python objects.
Here's a recent example:
2017-05-31 17:01:22 starting up (PID=29109)
2017-05-31 17:01:24 forked to PID 30320
2017-05-31 17:01:24 PID 30320 running -- report
2017-05-31 17:01:58 finished running QUERY (PID=30320)
resource.struct_rusage(ru_utime=0.730888, ru_stime=0.897863, ru_maxrss=658708,
...
finished running report; rows=304207; resource use:
resource.struct_rusage(ru_utime=9.2036, ru_stime=4.015389, ru_maxrss=2179872,...
2017-05-31 17:03:06 PID 30320 resource use:
resource.struct_rusage(ru_utime=33.039977, ru_stime=4.33634,
ru_maxrss=2202432,...
at .query(), used 658MB
at .getresult() used 2.1GB 4sec usr, 9s sys
following our processing loop around .getresult used 2.1GB 33sec usr, 9s sys
I wondered if it was possible to expose an python iterator to conserve RAM
relative to getresult (), and found #48:
http://trac.pygresql.org:8000/pgtracker/ticket/48
I think pygres should implement an iterator for DB() (or some member function)
even if it doesn't use a cursor behind the scenes.
While I was looking at the implementation, I found that an "connection" object
is a subset of a "source" object (the name of which is confusing to me, but it
appears to be intended to be a cursor object?)
I did "#define queryObject sourceObject" and it worked fine :)
In my tests, converting one row at a time from C structures to python saves a
good amount of RAM:
pryzbyj@pryzbyj:~/src/pygres$ tail /tmp/tp[yz]
==> /tmp/tpy <==
import pg
for a in pg.DB().query('SELECT generate_series(1,9999999)'): print a
==> /tmp/tpz <==
import pg
for a in pg.DB().query('SELECT generate_series(1,9999999)').getresult(): print a
pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command
time -v python /tmp/tpy|wc
Command being timed: "python /tmp/tpy"
User time (seconds): 29.65
System time (seconds): 0.65
Percent of CPU this job got: 85%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.30
Maximum resident set size (kbytes): 328540
Minor (reclaiming a frame) page faults: 67480
Voluntary context switches: 17306
Involuntary context switches: 7028
9999999 9999999 108888885
pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command
time -v python /tmp/tpz|wc
Command being timed: "python /tmp/tpz"
User time (seconds): 22.78
System time (seconds): 1.85
Percent of CPU this job got: 83%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.40
Maximum resident set size (kbytes): 1281112
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 287766
Voluntary context switches: 20765
Involuntary context switches: 4565
9999999 9999999 108888885
Of course the 328MB RAM used by the first was almost entirely from .query() and
the 1.2GB RAM by the 2nd was the same ~328MB from .query() plus another ~900MB
from .getresult() !
One thing is that we'd currently have to be able to iterate over the results
multiple times. I implemented resetIter() for this, but I'm not sure if it's
normal (?). It seems like a good idea, since the resultset is already loaded
to libpq at that point.
On the other hand, it'd be a shame to introduce an new interface which ends up
needing to change when pygres supports servers-side cursors.
Patch attached
Justin
disclaimer - I am not a python programmer
Index: docs/contents/changelog.rst
===================================================================
--- docs/contents/changelog.rst (revision 901)
+++ docs/contents/changelog.rst (working copy)
@@ -57,7 +57,7 @@ Version 5.0 (2016-03-20)
- The classic interface got two new methods get_as_list() and get_as_dict()
returning a database table as a Python list or dict. The amount of data
returned can be controlled with various parameters.
- - A method upsert() has been added to the DB wrapper class that utilitses
+ - A method upsert() has been added to the DB wrapper class that utilizes
the "upsert" feature that is new in PostgreSQL 9.5. The new method nicely
complements the existing get/insert/update/delete() methods.
- When using insert/update/upsert(), you can now pass PostgreSQL arrays as
@@ -346,7 +346,7 @@ Major improvements and clean-up in classic pg modu
- Fixes to quoting function
- Add checks for valid database connection to methods
- Improved namespace support, handle `search_path` correctly
-- Removed old dust and unnessesary imports, added docstrings
+- Removed old dust and unnecessary imports, added docstrings
- Internal sql statements as one-liners, smoothed out ugly code
Version 3.6.2 (2005-02-23)
Index: pgmodule.c
===================================================================
--- pgmodule.c (revision 901)
+++ pgmodule.c (working copy)
@@ -170,14 +170,8 @@ typedef struct
} noticeObject;
#define is_noticeObject(v) (PyType(v) == ¬iceType)
-typedef struct
-{
- PyObject_HEAD
- connObject *pgcnx; /* parent connection object */
- PGresult *result; /* result content */
- int encoding; /* client encoding */
-} queryObject;
-#define is_queryObject(v) (PyType(v) == &queryType)
+// XXX: should just use a source obj ??
+#define queryObject sourceObject
#ifdef LARGE_OBJECTS
typedef struct
@@ -2369,6 +2363,8 @@ connQuery(connObject *self, PyObject *args)
/* stores result and returns object */
Py_XINCREF(self);
npgobj->pgcnx = self;
+ npgobj->current_row = 0;
+ npgobj->max_row = PQntuples(result);
npgobj->result = result;
npgobj->encoding = encoding;
return (PyObject *) npgobj;
@@ -4598,6 +4594,84 @@ queryFieldNumber(queryObject *self, PyObject *args
return PyInt_FromLong(num);
}
+PyObject* queryGetIter(queryObject *self)
+{
+ Py_INCREF(self);
+ return (PyObject*)self;
+}
+
+/* gets fields number from name in last result */
+static char queryResetIter__doc__[] =
+"queryResetIter() -- reset iterator to beginning of returned result set";
+
+static
+PyObject*
+queryResetIter(queryObject *self)
+{
+ // TODO: check "self" ?
+ self->current_row=0;
+ Py_INCREF(Py_None);
+ return Py_None;
+}
+
+static PyObject *
+queryNext(queryObject *self, PyObject *noargs)
+{
+ PyObject *res=NULL;
+ int j, *col_types;
+ int n = PQnfields(self->result);
+
+ if (self->current_row>=self->max_row) {
+ PyErr_SetNone(PyExc_StopIteration);
+ return NULL;
+ }
+ if (!(res = PyTuple_New(n))) return NULL;
+ if (!(col_types = get_col_types(self->result, n))) return NULL;
+
+ for (j = 0; j<n; ++j)
+ {
+ PyObject * val;
+ if (PQgetisnull(self->result, self->current_row, j))
+ {
+ Py_INCREF(Py_None);
+ val = Py_None;
+ continue;
+ }
+ char *s = PQgetvalue(self->result, self->current_row, j);
+ int type = col_types[j];
+ int encoding = self->encoding;
+
+ if (type & PYGRES_ARRAY)
+ val = cast_array(s, PQgetlength(self->result,
self->current_row, j),
+ encoding, type, NULL, 0);
+ else if (type == PYGRES_BYTEA)
+ val = cast_bytea_text(s);
+ else if (type == PYGRES_OTHER)
+ val = cast_other(s,
+ PQgetlength(self->result, self->current_row,
j), encoding,
+ PQftype(self->result, j),
self->pgcnx->cast_hook);
+ else if (type & PYGRES_TEXT)
+ val = cast_sized_text(s, PQgetlength(self->result,
self->current_row, j),
+ encoding, type);
+ else
+ val = cast_unsized_simple(s, type);
+
+ if (!val)
+ {
+ Py_DECREF(res);
+ res = NULL;
+ goto exit;
+ }
+
+ PyTuple_SET_ITEM(res, j, val);
+ }
+
+exit:
+ PyMem_Free(col_types);
+ if (res) ++self->current_row;
+ return res;
+}
+
/* retrieves last result */
static char queryGetResult__doc__[] =
"getresult() -- Get the result of a query\n\n"
@@ -4918,6 +4992,8 @@ static PyTypeObject noticeType = {
/* query object methods */
static struct PyMethodDef queryMethods[] = {
+ {"resetIter", (PyCFunction) queryResetIter, METH_NOARGS,
+ queryResetIter__doc__},
{"getresult", (PyCFunction) queryGetResult, METH_NOARGS,
queryGetResult__doc__},
{"dictresult", (PyCFunction) queryDictResult, METH_NOARGS,
@@ -4957,17 +5033,18 @@ static PyTypeObject queryType = {
PyObject_GenericGetAttr, /* tp_getattro */
0, /*
tp_setattro */
0, /*
tp_as_buffer */
- Py_TPFLAGS_DEFAULT, /* tp_flags */
+ Py_TPFLAGS_DEFAULT|Py_TPFLAGS_HAVE_ITER, /*
tp_flags */
0, /*
tp_doc */
0, /*
tp_traverse */
0, /*
tp_clear */
0, /*
tp_richcompare */
0, /*
tp_weaklistoffset */
- 0, /*
tp_iter */
- 0, /*
tp_iternext */
- queryMethods, /* tp_methods */
+ (getiterfunc)queryGetIter, /*
tp_iter */
+ (iternextfunc)queryNext, /*
tp_iternext */
+ queryMethods, /*
tp_methods */
};
+
/* --------------------------------------------------------------------- */
/* MODULE FUNCTIONS */
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql