[PyGreSQL] pg.DB() pseudo-cursor to reduce RAM use

Justin Pryzby Mon, 05 Jun 2017 09:28:53 -0700

We've run into the problem several times where our customers run a large but
not unreasonable report returning 100k of rows.  Most of the RAM is spent after
.query(), and during .getresult(), converting many C arrays to python objects.


Here's a recent example:

2017-05-31 17:01:22 starting up (PID=29109)
2017-05-31 17:01:24 forked to PID 30320
2017-05-31 17:01:24 PID 30320 running -- report
2017-05-31 17:01:58 finished running QUERY (PID=30320) 
resource.struct_rusage(ru_utime=0.730888, ru_stime=0.897863, ru_maxrss=658708, 
...
finished running report; rows=304207; resource use: 
resource.struct_rusage(ru_utime=9.2036, ru_stime=4.015389, ru_maxrss=2179872,...
2017-05-31 17:03:06 PID 30320 resource use: 
resource.struct_rusage(ru_utime=33.039977, ru_stime=4.33634, 
ru_maxrss=2202432,...

at .query(), used 658MB
at .getresult() used 2.1GB 4sec usr, 9s sys
following our processing loop around .getresult used 2.1GB 33sec usr, 9s sys

I wondered if it was possible to expose an python iterator to conserve RAM
relative to getresult (), and found #48:
http://trac.pygresql.org:8000/pgtracker/ticket/48

I think pygres should implement an iterator for DB() (or some member function)
even if it doesn't use a cursor behind the scenes.

While I was looking at the implementation, I found that an "connection" object
is a subset of a "source" object (the name of which is confusing to me, but it
appears to be intended to be a cursor object?)

I did "#define queryObject sourceObject" and it worked fine :)

In my tests, converting one row at a time from C structures to python saves a
good amount of RAM:

pryzbyj@pryzbyj:~/src/pygres$ tail /tmp/tp[yz]
==> /tmp/tpy <==
import pg
for a in pg.DB().query('SELECT generate_series(1,9999999)'): print a

==> /tmp/tpz <==
import pg
for a in pg.DB().query('SELECT generate_series(1,9999999)').getresult(): print a

pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command 
time -v python /tmp/tpy|wc
        Command being timed: "python /tmp/tpy"
        User time (seconds): 29.65
        System time (seconds): 0.65
        Percent of CPU this job got: 85%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.30
        Maximum resident set size (kbytes): 328540
        Minor (reclaiming a frame) page faults: 67480
        Voluntary context switches: 17306
        Involuntary context switches: 7028
9999999 9999999 108888885
pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command 
time -v python /tmp/tpz|wc
        Command being timed: "python /tmp/tpz"
        User time (seconds): 22.78
        System time (seconds): 1.85
        Percent of CPU this job got: 83%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.40
        Maximum resident set size (kbytes): 1281112
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 287766
        Voluntary context switches: 20765
        Involuntary context switches: 4565
9999999 9999999 108888885

Of course the 328MB RAM used by the first was almost entirely from .query() and
the 1.2GB RAM by the 2nd was the same ~328MB from .query() plus another ~900MB
from .getresult() !

One thing is that we'd currently have to be able to iterate over the results
multiple times.  I implemented resetIter() for this, but I'm not sure if it's
normal (?).  It seems like a good idea, since the resultset is already loaded
to libpq at that point.

On the other hand, it'd be a shame to introduce an new interface which ends up
needing to change when pygres supports servers-side cursors.

Patch attached

Justin
disclaimer - I am not a python programmer

Index: docs/contents/changelog.rst
===================================================================
--- docs/contents/changelog.rst (revision 901)
+++ docs/contents/changelog.rst (working copy)
@@ -57,7 +57,7 @@ Version 5.0 (2016-03-20)
     - The classic interface got two new methods get_as_list() and get_as_dict()
       returning a database table as a Python list or dict. The amount of data
       returned can be controlled with various parameters.
-    - A method upsert() has been added to the DB wrapper class that utilitses
+    - A method upsert() has been added to the DB wrapper class that utilizes
       the "upsert" feature that is new in PostgreSQL 9.5. The new method nicely
       complements the existing get/insert/update/delete() methods.
     - When using insert/update/upsert(), you can now pass PostgreSQL arrays as
@@ -346,7 +346,7 @@ Major improvements and clean-up in classic pg modu
 - Fixes to quoting function
 - Add checks for valid database connection to methods
 - Improved namespace support, handle `search_path` correctly
-- Removed old dust and unnessesary imports, added docstrings
+- Removed old dust and unnecessary imports, added docstrings
 - Internal sql statements as one-liners, smoothed out ugly code
 
 Version 3.6.2 (2005-02-23)
Index: pgmodule.c
===================================================================
--- pgmodule.c  (revision 901)
+++ pgmodule.c  (working copy)
@@ -170,14 +170,8 @@ typedef struct
 }      noticeObject;
 #define is_noticeObject(v) (PyType(v) == &noticeType)
 
-typedef struct
-{
-       PyObject_HEAD
-       connObject *pgcnx;                      /* parent connection object */
-       PGresult   *result;                     /* result content */
-       int                     encoding;               /* client encoding */
-}      queryObject;
-#define is_queryObject(v) (PyType(v) == &queryType)
+// XXX: should just use a source obj ??
+#define queryObject sourceObject
 
 #ifdef LARGE_OBJECTS
 typedef struct
@@ -2369,6 +2363,8 @@ connQuery(connObject *self, PyObject *args)
        /* stores result and returns object */
        Py_XINCREF(self);
        npgobj->pgcnx = self;
+       npgobj->current_row = 0;
+       npgobj->max_row = PQntuples(result);
        npgobj->result = result;
        npgobj->encoding = encoding;
        return (PyObject *) npgobj;
@@ -4598,6 +4594,84 @@ queryFieldNumber(queryObject *self, PyObject *args
        return PyInt_FromLong(num);
 }
 
+PyObject* queryGetIter(queryObject *self)
+{
+    Py_INCREF(self);
+    return (PyObject*)self;
+}
+
+/* gets fields number from name in last result */
+static char queryResetIter__doc__[] =
+"queryResetIter() -- reset iterator to beginning of returned result set";
+
+static
+PyObject*
+queryResetIter(queryObject *self)
+{
+       // TODO: check "self" ?
+       self->current_row=0;
+       Py_INCREF(Py_None);
+       return Py_None;
+}
+
+static PyObject *
+queryNext(queryObject *self, PyObject *noargs)
+{
+       PyObject   *res=NULL;
+       int                     j, *col_types;
+       int n = PQnfields(self->result);
+
+       if (self->current_row>=self->max_row) {
+               PyErr_SetNone(PyExc_StopIteration);
+               return NULL;
+       }
+       if (!(res = PyTuple_New(n))) return NULL;
+       if (!(col_types = get_col_types(self->result, n))) return NULL;
+
+       for (j = 0; j<n; ++j)
+       {
+               PyObject * val;
+               if (PQgetisnull(self->result, self->current_row, j))
+               {
+                       Py_INCREF(Py_None);
+                       val = Py_None;
+                       continue;
+               }
+               char   *s = PQgetvalue(self->result, self->current_row, j);
+               int             type = col_types[j];
+               int                     encoding = self->encoding;
+
+               if (type & PYGRES_ARRAY)
+                       val = cast_array(s, PQgetlength(self->result, 
self->current_row, j),
+                               encoding, type, NULL, 0);
+               else if (type == PYGRES_BYTEA)
+                       val = cast_bytea_text(s);
+               else if (type == PYGRES_OTHER)
+                       val = cast_other(s,
+                               PQgetlength(self->result, self->current_row, 
j), encoding,
+                               PQftype(self->result, j), 
self->pgcnx->cast_hook);
+               else if (type & PYGRES_TEXT)
+                       val = cast_sized_text(s, PQgetlength(self->result, 
self->current_row, j),
+                               encoding, type);
+               else
+                       val = cast_unsized_simple(s, type);
+
+               if (!val)
+               {
+                       Py_DECREF(res);
+                       res = NULL;
+                       goto exit;
+               }
+
+               PyTuple_SET_ITEM(res, j, val);
+       }
+
+exit:
+       PyMem_Free(col_types);
+       if (res) ++self->current_row;
+       return res;
+}
+
 /* retrieves last result */
 static char queryGetResult__doc__[] =
 "getresult() -- Get the result of a query\n\n"
@@ -4918,6 +4992,8 @@ static PyTypeObject noticeType = {
 
 /* query object methods */
 static struct PyMethodDef queryMethods[] = {
+       {"resetIter", (PyCFunction) queryResetIter, METH_NOARGS,
+                       queryResetIter__doc__},
        {"getresult", (PyCFunction) queryGetResult, METH_NOARGS,
                        queryGetResult__doc__},
        {"dictresult", (PyCFunction) queryDictResult, METH_NOARGS,
@@ -4957,17 +5033,18 @@ static PyTypeObject queryType = {
        PyObject_GenericGetAttr,                /* tp_getattro */
        0,                                                              /* 
tp_setattro */
        0,                                                              /* 
tp_as_buffer */
-       Py_TPFLAGS_DEFAULT,                             /* tp_flags */
+       Py_TPFLAGS_DEFAULT|Py_TPFLAGS_HAVE_ITER,                        /* 
tp_flags */
        0,                                                              /* 
tp_doc */
        0,                                                              /* 
tp_traverse */
        0,                                                              /* 
tp_clear */
        0,                                                              /* 
tp_richcompare */
        0,                                                              /* 
tp_weaklistoffset */
-       0,                                                              /* 
tp_iter */
-       0,                                                              /* 
tp_iternext */
-       queryMethods,                                   /* tp_methods */
+       (getiterfunc)queryGetIter,                                      /* 
tp_iter */
+       (iternextfunc)queryNext,                                        /* 
tp_iternext */
+       queryMethods,                                                   /* 
tp_methods */
 };
 
+
 /* --------------------------------------------------------------------- */
 
 /* MODULE FUNCTIONS */

_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

[PyGreSQL] pg.DB() pseudo-cursor to reduce RAM use

Reply via email to