Re: [Cython] String interning and Python 3

Lisandro Dalcin Sat, 10 May 2008 14:07:00 -0700

Stefan , If  (4) is doable with no much effort, then I believe this is
the right way.


As you said, one of the points of string interning is saving memory.
But there is also another very important benefit. Look at this for
Python 2.6 sources in stringobject.c

int
_PyString_Eq(PyObject *o1, PyObject *o2)
{
        PyStringObject *a = (PyStringObject*) o1;
        PyStringObject *b = (PyStringObject*) o2;
        return Py_SIZE(a) == Py_SIZE(b)
          && *a->ob_sval == *b->ob_sval
          && memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0;
}


As you can see the line with '*a->ob_sval == *b->ob_sval' provides a
fast path for string equality. So if o1 and o2 are the same (interned)
strings then with a pointer comparison you avoid at all the memcmp
call. And this is very, very important in dictionary lookups to make
that operation faster in the case the keys are strings.



On 5/10/08, Stefan Behnel <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I'm wondering how to continue the support for this feature given the fact 
> that
>  identifiers are Unicode strings in Py3. We currently only intern byte strings
>  that look like Python identifiers, so in Py3, they simply no longer look like
>  identifiers, as they are not Unicode strings.
>
>  I can see four ways how to deal with this:
>
>  1) drop string interning completely
>
>  2) disable string interning in Py3 and use normally created byte strings 
> instead
>
>  3) keep separate sets of identifier-like byte strings and unicode strings in
>  the compiler and write them into the C file. Then, depending on the Python
>  version, either intern the byte strings or the unicode strings, and create 
> the
>  other set as un-interned strings.
>
>  4) keep the information if a string should be interned for all strings we 
> deal
>  with (bytes and unicode), remove the intern tab and merge it with the general
>  string tab by adding an additional field "intern". Then __Pyx_InitStrings()
>  would create the strings differently depending on the compile time Python
>  version, i.e., it would intern Unicode identifiers in Py3 and byte string
>  identifiers in Py2, and create everything else as normal strings.
>
>  Personally, I favour 4) - although I could live with 1) - but since I'm not
>  quite sure what the original intention of string interning was (saving
>  memory?), I'd like to hear other opinions first.
>
>  Stefan
>  _______________________________________________
>  Cython-dev mailing list
>  [email protected]
>  http://codespeak.net/mailman/listinfo/cython-dev
>


-- 
Lisandro Dalcín
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] String interning and Python 3

Reply via email to