On May 10, 2008, at 2:06 PM, Lisandro Dalcin wrote:
> Stefan , If (4) is doable with no much effort, then I believe this is
> the right way.
>
> As you said, one of the points of string interning is saving memory.
> But there is also another very important benefit. Look at this for
> Python 2.6 sources in stringobject.c
>
> int
> _PyString_Eq(PyObject *o1, PyObject *o2)
> {
> PyStringObject *a = (PyStringObject*) o1;
> PyStringObject *b = (PyStringObject*) o2;
> return Py_SIZE(a) == Py_SIZE(b)
> && *a->ob_sval == *b->ob_sval
> && memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0;
> }
>
>
> As you can see the line with '*a->ob_sval == *b->ob_sval' provides a
> fast path for string equality. So if o1 and o2 are the same (interned)
> strings then with a pointer comparison you avoid at all the memcmp
> call. And this is very, very important in dictionary lookups to make
> that operation faster in the case the keys are strings.
I want to second this, we want to keep interned strings if at all
possible for the above reason. Making everything a dictionary lookup
is one of the ways Python is so dynamic, but it means that the speed
of dictionary lookups is extremely important (in fact I would say
this is one of the primary bottlenecks of Python). It also offers the
advantage that the lookup strings don't need to be re-allocated each
time they're needed.
> On 5/10/08, Stefan Behnel <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I'm wondering how to continue the support for this feature given
>> the fact that
>> identifiers are Unicode strings in Py3. We currently only intern
>> byte strings
>> that look like Python identifiers, so in Py3, they simply no
>> longer look like
>> identifiers, as they are not Unicode strings.
>>
>> I can see four ways how to deal with this:
>>
>> 1) drop string interning completely
>>
>> 2) disable string interning in Py3 and use normally created byte
>> strings instead
>>
>> 3) keep separate sets of identifier-like byte strings and unicode
>> strings in
>> the compiler and write them into the C file. Then, depending on
>> the Python
>> version, either intern the byte strings or the unicode strings,
>> and create the
>> other set as un-interned strings.
>>
>> 4) keep the information if a string should be interned for all
>> strings we deal
>> with (bytes and unicode), remove the intern tab and merge it with
>> the general
>> string tab by adding an additional field "intern". Then
>> __Pyx_InitStrings()
>> would create the strings differently depending on the compile
>> time Python
>> version, i.e., it would intern Unicode identifiers in Py3 and
>> byte string
>> identifiers in Py2, and create everything else as normal strings.
>>
>> Personally, I favour 4) - although I could live with 1) - but
>> since I'm not
>> quite sure what the original intention of string interning was
>> (saving
>> memory?), I'd like to hear other opinions first.
>>
>> Stefan
>> _______________________________________________
>> Cython-dev mailing list
>> [email protected]
>> http://codespeak.net/mailman/listinfo/cython-dev
>>
>
>
> --
> Lisandro Dalcín
> ---------------
> Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
> Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
> Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
> PTLC - Güemes 3450, (3000) Santa Fe, Argentina
> Tel/Fax: +54-(0)342-451.1594
> _______________________________________________
> Cython-dev mailing list
> [email protected]
> http://codespeak.net/mailman/listinfo/cython-dev
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev