Re: [PyGreSQL] pg.DB() pseudo-cursor to reduce RAM use

Justin Pryzby Wed, 05 Jul 2017 14:13:08 -0700

I'm hoping for some feedback on this - 

if the interface seems okay and the implementation otherwise fine, I can clean
up some by pulling out the duplicated code from queryGet/Dict/Result and
queryNext().


On Mon, Jun 05, 2017 at 11:28:29AM -0500, Justin Pryzby wrote:
> We've run into the problem several times where our customers run a large but
> not unreasonable report returning 100k of rows.  Most of the RAM is spent 
> after
> .query(), and during .getresult(), converting many C arrays to python objects.
> 
> Here's a recent example:
> 
> 2017-05-31 17:01:22 starting up (PID=29109)
> 2017-05-31 17:01:24 forked to PID 30320
> 2017-05-31 17:01:24 PID 30320 running -- report
> 2017-05-31 17:01:58 finished running QUERY (PID=30320) 
> resource.struct_rusage(ru_utime=0.730888, ru_stime=0.897863, 
> ru_maxrss=658708, ...
> finished running report; rows=304207; resource use: 
> resource.struct_rusage(ru_utime=9.2036, ru_stime=4.015389, 
> ru_maxrss=2179872,...
> 2017-05-31 17:03:06 PID 30320 resource use: 
> resource.struct_rusage(ru_utime=33.039977, ru_stime=4.33634, 
> ru_maxrss=2202432,...
> 
> at .query(), used 658MB
> at .getresult() used 2.1GB 4sec usr, 9s sys
> following our processing loop around .getresult used 2.1GB 33sec usr, 9s sys
> 
> I wondered if it was possible to expose an python iterator to conserve RAM
> relative to getresult (), and found #48:
> http://trac.pygresql.org:8000/pgtracker/ticket/48
> 
> I think pygres should implement an iterator for DB() (or some member function)
> even if it doesn't use a cursor behind the scenes.
> 
> While I was looking at the implementation, I found that an "connection" object
> is a subset of a "source" object (the name of which is confusing to me, but it
> appears to be intended to be a cursor object?)
> 
> I did "#define queryObject sourceObject" and it worked fine :)
> 
> In my tests, converting one row at a time from C structures to python saves a
> good amount of RAM:
> 
> pryzbyj@pryzbyj:~/src/pygres$ tail /tmp/tp[yz]
> ==> /tmp/tpy <==
> import pg
> for a in pg.DB().query('SELECT generate_series(1,9999999)'): print a
> 
> ==> /tmp/tpz <==
> import pg
> for a in pg.DB().query('SELECT generate_series(1,9999999)').getresult(): 
> print a
> 
> pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command 
> time -v python /tmp/tpy|wc
>         Command being timed: "python /tmp/tpy"
>         User time (seconds): 29.65
>         System time (seconds): 0.65
>         Percent of CPU this job got: 85%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:35.30
>         Maximum resident set size (kbytes): 328540
>         Minor (reclaiming a frame) page faults: 67480
>         Voluntary context switches: 17306
>         Involuntary context switches: 7028
> 9999999 9999999 108888885
> pryzbyj@pryzbyj:~/src/pygres$ PYTHONPATH=build/lib.linux-x86_64-2.7 command 
> time -v python /tmp/tpz|wc
>         Command being timed: "python /tmp/tpz"
>         User time (seconds): 22.78
>         System time (seconds): 1.85
>         Percent of CPU this job got: 83%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.40
>         Maximum resident set size (kbytes): 1281112
>         Major (requiring I/O) page faults: 0
>         Minor (reclaiming a frame) page faults: 287766
>         Voluntary context switches: 20765
>         Involuntary context switches: 4565
> 9999999 9999999 108888885
> 
> Of course the 328MB RAM used by the first was almost entirely from .query() 
> and
> the 1.2GB RAM by the 2nd was the same ~328MB from .query() plus another ~900MB
> from .getresult() !
> 
> One thing is that we'd currently have to be able to iterate over the results
> multiple times.  I implemented resetIter() for this, but I'm not sure if it's
> normal (?).  It seems like a good idea, since the resultset is already loaded
> to libpq at that point.
> 
> On the other hand, it'd be a shame to introduce an new interface which ends up
> needing to change when pygres supports servers-side cursors.
> 
> Patch attached
> 
> Justin
> disclaimer - I am not a python programmer

> Index: docs/contents/changelog.rst
> ===================================================================
> --- docs/contents/changelog.rst       (revision 901)
> +++ docs/contents/changelog.rst       (working copy)
> @@ -57,7 +57,7 @@ Version 5.0 (2016-03-20)
>      - The classic interface got two new methods get_as_list() and 
> get_as_dict()
>        returning a database table as a Python list or dict. The amount of data
>        returned can be controlled with various parameters.
> -    - A method upsert() has been added to the DB wrapper class that utilitses
> +    - A method upsert() has been added to the DB wrapper class that utilizes
>        the "upsert" feature that is new in PostgreSQL 9.5. The new method 
> nicely
>        complements the existing get/insert/update/delete() methods.
>      - When using insert/update/upsert(), you can now pass PostgreSQL arrays 
> as
> @@ -346,7 +346,7 @@ Major improvements and clean-up in classic pg modu
>  - Fixes to quoting function
>  - Add checks for valid database connection to methods
>  - Improved namespace support, handle `search_path` correctly
> -- Removed old dust and unnessesary imports, added docstrings
> +- Removed old dust and unnecessary imports, added docstrings
>  - Internal sql statements as one-liners, smoothed out ugly code
>  
>  Version 3.6.2 (2005-02-23)
> Index: pgmodule.c
> ===================================================================
> --- pgmodule.c        (revision 901)
> +++ pgmodule.c        (working copy)
> @@ -170,14 +170,8 @@ typedef struct
>  }    noticeObject;
>  #define is_noticeObject(v) (PyType(v) == &noticeType)
>  
> -typedef struct
> -{
> -     PyObject_HEAD
> -     connObject *pgcnx;                      /* parent connection object */
> -     PGresult   *result;                     /* result content */
> -     int                     encoding;               /* client encoding */
> -}    queryObject;
> -#define is_queryObject(v) (PyType(v) == &queryType)
> +// XXX: should just use a source obj ??
> +#define queryObject sourceObject
>  
>  #ifdef LARGE_OBJECTS
>  typedef struct
> @@ -2369,6 +2363,8 @@ connQuery(connObject *self, PyObject *args)
>       /* stores result and returns object */
>       Py_XINCREF(self);
>       npgobj->pgcnx = self;
> +     npgobj->current_row = 0;
> +     npgobj->max_row = PQntuples(result);
>       npgobj->result = result;
>       npgobj->encoding = encoding;
>       return (PyObject *) npgobj;
> @@ -4598,6 +4594,84 @@ queryFieldNumber(queryObject *self, PyObject *args
>       return PyInt_FromLong(num);
>  }
>  
> +PyObject* queryGetIter(queryObject *self)
> +{
> +    Py_INCREF(self);
> +    return (PyObject*)self;
> +}
> +
> +/* gets fields number from name in last result */
> +static char queryResetIter__doc__[] =
> +"queryResetIter() -- reset iterator to beginning of returned result set";
> +
> +static
> +PyObject*
> +queryResetIter(queryObject *self)
> +{
> +     // TODO: check "self" ?
> +     self->current_row=0;
> +     Py_INCREF(Py_None);
> +     return Py_None;
> +}
> +
> +static PyObject *
> +queryNext(queryObject *self, PyObject *noargs)
> +{
> +     PyObject   *res=NULL;
> +     int                     j, *col_types;
> +     int n = PQnfields(self->result);
> +
> +     if (self->current_row>=self->max_row) {
> +             PyErr_SetNone(PyExc_StopIteration);
> +             return NULL;
> +     }
> +     if (!(res = PyTuple_New(n))) return NULL;
> +     if (!(col_types = get_col_types(self->result, n))) return NULL;
> +
> +     for (j = 0; j<n; ++j)
> +     {
> +             PyObject * val;
> +             if (PQgetisnull(self->result, self->current_row, j))
> +             {
> +                     Py_INCREF(Py_None);
> +                     val = Py_None;
> +                     continue;
> +             }
> +             char   *s = PQgetvalue(self->result, self->current_row, j);
> +             int             type = col_types[j];
> +             int                     encoding = self->encoding;
> +
> +             if (type & PYGRES_ARRAY)
> +                     val = cast_array(s, PQgetlength(self->result, 
> self->current_row, j),
> +                             encoding, type, NULL, 0);
> +             else if (type == PYGRES_BYTEA)
> +                     val = cast_bytea_text(s);
> +             else if (type == PYGRES_OTHER)
> +                     val = cast_other(s,
> +                             PQgetlength(self->result, self->current_row, 
> j), encoding,
> +                             PQftype(self->result, j), 
> self->pgcnx->cast_hook);
> +             else if (type & PYGRES_TEXT)
> +                     val = cast_sized_text(s, PQgetlength(self->result, 
> self->current_row, j),
> +                             encoding, type);
> +             else
> +                     val = cast_unsized_simple(s, type);
> +
> +             if (!val)
> +             {
> +                     Py_DECREF(res);
> +                     res = NULL;
> +                     goto exit;
> +             }
> +
> +             PyTuple_SET_ITEM(res, j, val);
> +     }
> +
> +exit:
> +     PyMem_Free(col_types);
> +     if (res) ++self->current_row;
> +     return res;
> +}
> +
>  /* retrieves last result */
>  static char queryGetResult__doc__[] =
>  "getresult() -- Get the result of a query\n\n"
> @@ -4918,6 +4992,8 @@ static PyTypeObject noticeType = {
>  
>  /* query object methods */
>  static struct PyMethodDef queryMethods[] = {
> +     {"resetIter", (PyCFunction) queryResetIter, METH_NOARGS,
> +                     queryResetIter__doc__},
>       {"getresult", (PyCFunction) queryGetResult, METH_NOARGS,
>                       queryGetResult__doc__},
>       {"dictresult", (PyCFunction) queryDictResult, METH_NOARGS,
> @@ -4957,17 +5033,18 @@ static PyTypeObject queryType = {
>       PyObject_GenericGetAttr,                /* tp_getattro */
>       0,                                                              /* 
> tp_setattro */
>       0,                                                              /* 
> tp_as_buffer */
> -     Py_TPFLAGS_DEFAULT,                             /* tp_flags */
> +     Py_TPFLAGS_DEFAULT|Py_TPFLAGS_HAVE_ITER,                        /* 
> tp_flags */
>       0,                                                              /* 
> tp_doc */
>       0,                                                              /* 
> tp_traverse */
>       0,                                                              /* 
> tp_clear */
>       0,                                                              /* 
> tp_richcompare */
>       0,                                                              /* 
> tp_weaklistoffset */
> -     0,                                                              /* 
> tp_iter */
> -     0,                                                              /* 
> tp_iternext */
> -     queryMethods,                                   /* tp_methods */
> +     (getiterfunc)queryGetIter,                                      /* 
> tp_iter */
> +     (iternextfunc)queryNext,                                        /* 
> tp_iternext */
> +     queryMethods,                                                   /* 
> tp_methods */
>  };
>  
> +
>  /* --------------------------------------------------------------------- */
>  
>  /* MODULE FUNCTIONS */


-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] pg.DB() pseudo-cursor to reduce RAM use

Reply via email to