I'm hoping for some feedback on this - if the interface seems okay and the implementation otherwise fine, I can clean up some by pulling out the duplicated code from queryGet/Dict/Result and queryNext().
On Mon, Jun 05, 2017 at 11:28:29AM -0500, Justin Pryzby wrote: > We've run into the problem several times where our customers run a large but > not unreasonable report returning 100k of rows. Most of the RAM is spent > after > .query(), and during .getresult(), converting many C arrays to python objects. > > Here's a recent example: > > 2017-05-31 17:01:22 starting up (PID=29109) > 2017-05-31 17:01:24 forked to PID 30320 > 2017-05-31 17:01:24 PID 30320 running -- report > 2017-05-31 17:01:58 finished running QUERY (PID=30320) > resource.struct_rusage(ru_utime=0.730888, ru_stime=0.897863, > ru_maxrss=658708, ... > finished running report; rows=304207; resource use: > resource.struct_rusage(ru_utime=9.2036, ru_stime=4.015389, > ru_maxrss=2179872,... > 2017-05-31 17:03:06 PID 30320 resource use: > resource.struct_rusage(ru_utime=33.039977, ru_stime=4.33634, > ru_maxrss=2202432,... > > at .query(), used 658MB > at .getresult() used 2.1GB 4sec usr, 9s sys > following our processing loop around .getresult used 2.1GB 33sec usr, 9s sys > > I wondered if it was possible to expose an python iterator to conserve RAM > relative to getresult (), and found #48: > http://trac.pygresql.org:8000/pgtracker/ticket/48 > > I think pygres should implement an iterator for DB() (or some member function) > even if it doesn't use a cursor behind the scenes. > > While I was looking at the implementation, I found that an "connection" object > is a subset of a "source" object (the name of which is confusing to me, but it > appears to be intended to be a cursor object?) > > I did "#define queryObject sourceObject" and it worked fine :) > > In my tests, converting one row at a time from C structures to python saves a > good amount of RAM: > > pryzbyj@pryzbyj:~/src/pygres$ tail /tmp/tp[yz] > ==> /tmp/tpy <== > import pg > for a in pg.DB().query('SELECT generate_series(1,9999999)'): print a > > ==> /tmp/tpz <== > import pg > for a in pg.DB().query('SELECT generate_series(1,9999999)').getresult(): > print a > > pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command > time -v python /tmp/tpy|wc > Command being timed: "python /tmp/tpy" > User time (seconds): 29.65 > System time (seconds): 0.65 > Percent of CPU this job got: 85% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.30 > Maximum resident set size (kbytes): 328540 > Minor (reclaiming a frame) page faults: 67480 > Voluntary context switches: 17306 > Involuntary context switches: 7028 > 9999999 9999999 108888885 > pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command > time -v python /tmp/tpz|wc > Command being timed: "python /tmp/tpz" > User time (seconds): 22.78 > System time (seconds): 1.85 > Percent of CPU this job got: 83% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.40 > Maximum resident set size (kbytes): 1281112 > Major (requiring I/O) page faults: 0 > Minor (reclaiming a frame) page faults: 287766 > Voluntary context switches: 20765 > Involuntary context switches: 4565 > 9999999 9999999 108888885 > > Of course the 328MB RAM used by the first was almost entirely from .query() > and > the 1.2GB RAM by the 2nd was the same ~328MB from .query() plus another ~900MB > from .getresult() ! > > One thing is that we'd currently have to be able to iterate over the results > multiple times. I implemented resetIter() for this, but I'm not sure if it's > normal (?). It seems like a good idea, since the resultset is already loaded > to libpq at that point. > > On the other hand, it'd be a shame to introduce an new interface which ends up > needing to change when pygres supports servers-side cursors. > > Patch attached > > Justin > disclaimer - I am not a python programmer > Index: docs/contents/changelog.rst > =================================================================== > --- docs/contents/changelog.rst (revision 901) > +++ docs/contents/changelog.rst (working copy) > @@ -57,7 +57,7 @@ Version 5.0 (2016-03-20) > - The classic interface got two new methods get_as_list() and > get_as_dict() > returning a database table as a Python list or dict. The amount of data > returned can be controlled with various parameters. > - - A method upsert() has been added to the DB wrapper class that utilitses > + - A method upsert() has been added to the DB wrapper class that utilizes > the "upsert" feature that is new in PostgreSQL 9.5. The new method > nicely > complements the existing get/insert/update/delete() methods. > - When using insert/update/upsert(), you can now pass PostgreSQL arrays > as > @@ -346,7 +346,7 @@ Major improvements and clean-up in classic pg modu > - Fixes to quoting function > - Add checks for valid database connection to methods > - Improved namespace support, handle `search_path` correctly > -- Removed old dust and unnessesary imports, added docstrings > +- Removed old dust and unnecessary imports, added docstrings > - Internal sql statements as one-liners, smoothed out ugly code > > Version 3.6.2 (2005-02-23) > Index: pgmodule.c > =================================================================== > --- pgmodule.c (revision 901) > +++ pgmodule.c (working copy) > @@ -170,14 +170,8 @@ typedef struct > } noticeObject; > #define is_noticeObject(v) (PyType(v) == ¬iceType) > > -typedef struct > -{ > - PyObject_HEAD > - connObject *pgcnx; /* parent connection object */ > - PGresult *result; /* result content */ > - int encoding; /* client encoding */ > -} queryObject; > -#define is_queryObject(v) (PyType(v) == &queryType) > +// XXX: should just use a source obj ?? > +#define queryObject sourceObject > > #ifdef LARGE_OBJECTS > typedef struct > @@ -2369,6 +2363,8 @@ connQuery(connObject *self, PyObject *args) > /* stores result and returns object */ > Py_XINCREF(self); > npgobj->pgcnx = self; > + npgobj->current_row = 0; > + npgobj->max_row = PQntuples(result); > npgobj->result = result; > npgobj->encoding = encoding; > return (PyObject *) npgobj; > @@ -4598,6 +4594,84 @@ queryFieldNumber(queryObject *self, PyObject *args > return PyInt_FromLong(num); > } > > +PyObject* queryGetIter(queryObject *self) > +{ > + Py_INCREF(self); > + return (PyObject*)self; > +} > + > +/* gets fields number from name in last result */ > +static char queryResetIter__doc__[] = > +"queryResetIter() -- reset iterator to beginning of returned result set"; > + > +static > +PyObject* > +queryResetIter(queryObject *self) > +{ > + // TODO: check "self" ? > + self->current_row=0; > + Py_INCREF(Py_None); > + return Py_None; > +} > + > +static PyObject * > +queryNext(queryObject *self, PyObject *noargs) > +{ > + PyObject *res=NULL; > + int j, *col_types; > + int n = PQnfields(self->result); > + > + if (self->current_row>=self->max_row) { > + PyErr_SetNone(PyExc_StopIteration); > + return NULL; > + } > + if (!(res = PyTuple_New(n))) return NULL; > + if (!(col_types = get_col_types(self->result, n))) return NULL; > + > + for (j = 0; j<n; ++j) > + { > + PyObject * val; > + if (PQgetisnull(self->result, self->current_row, j)) > + { > + Py_INCREF(Py_None); > + val = Py_None; > + continue; > + } > + char *s = PQgetvalue(self->result, self->current_row, j); > + int type = col_types[j]; > + int encoding = self->encoding; > + > + if (type & PYGRES_ARRAY) > + val = cast_array(s, PQgetlength(self->result, > self->current_row, j), > + encoding, type, NULL, 0); > + else if (type == PYGRES_BYTEA) > + val = cast_bytea_text(s); > + else if (type == PYGRES_OTHER) > + val = cast_other(s, > + PQgetlength(self->result, self->current_row, > j), encoding, > + PQftype(self->result, j), > self->pgcnx->cast_hook); > + else if (type & PYGRES_TEXT) > + val = cast_sized_text(s, PQgetlength(self->result, > self->current_row, j), > + encoding, type); > + else > + val = cast_unsized_simple(s, type); > + > + if (!val) > + { > + Py_DECREF(res); > + res = NULL; > + goto exit; > + } > + > + PyTuple_SET_ITEM(res, j, val); > + } > + > +exit: > + PyMem_Free(col_types); > + if (res) ++self->current_row; > + return res; > +} > + > /* retrieves last result */ > static char queryGetResult__doc__[] = > "getresult() -- Get the result of a query\n\n" > @@ -4918,6 +4992,8 @@ static PyTypeObject noticeType = { > > /* query object methods */ > static struct PyMethodDef queryMethods[] = { > + {"resetIter", (PyCFunction) queryResetIter, METH_NOARGS, > + queryResetIter__doc__}, > {"getresult", (PyCFunction) queryGetResult, METH_NOARGS, > queryGetResult__doc__}, > {"dictresult", (PyCFunction) queryDictResult, METH_NOARGS, > @@ -4957,17 +5033,18 @@ static PyTypeObject queryType = { > PyObject_GenericGetAttr, /* tp_getattro */ > 0, /* > tp_setattro */ > 0, /* > tp_as_buffer */ > - Py_TPFLAGS_DEFAULT, /* tp_flags */ > + Py_TPFLAGS_DEFAULT|Py_TPFLAGS_HAVE_ITER, /* > tp_flags */ > 0, /* > tp_doc */ > 0, /* > tp_traverse */ > 0, /* > tp_clear */ > 0, /* > tp_richcompare */ > 0, /* > tp_weaklistoffset */ > - 0, /* > tp_iter */ > - 0, /* > tp_iternext */ > - queryMethods, /* tp_methods */ > + (getiterfunc)queryGetIter, /* > tp_iter */ > + (iternextfunc)queryNext, /* > tp_iternext */ > + queryMethods, /* > tp_methods */ > }; > > + > /* --------------------------------------------------------------------- */ > > /* MODULE FUNCTIONS */ -- Justin Pryzby System Administrator Telsasoft +1-952-707-8581 _______________________________________________ PyGreSQL mailing list [email protected] https://mail.vex.net/mailman/listinfo.cgi/pygresql
