On May 10, 2008, at 2:06 PM, Lisandro Dalcin wrote:
> Stefan , If  (4) is doable with no much effort, then I believe this is
> the right way.
>
> As you said, one of the points of string interning is saving memory.
> But there is also another very important benefit. Look at this for
> Python 2.6 sources in stringobject.c
>
> int
> _PyString_Eq(PyObject *o1, PyObject *o2)
> {
>       PyStringObject *a = (PyStringObject*) o1;
>       PyStringObject *b = (PyStringObject*) o2;
>         return Py_SIZE(a) == Py_SIZE(b)
>           && *a->ob_sval == *b->ob_sval
>           && memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0;
> }
>
>
> As you can see the line with '*a->ob_sval == *b->ob_sval' provides a
> fast path for string equality. So if o1 and o2 are the same (interned)
> strings then with a pointer comparison you avoid at all the memcmp
> call. And this is very, very important in dictionary lookups to make
> that operation faster in the case the keys are strings.

I want to second this, we want to keep interned strings if at all  
possible for the above reason. Making everything a dictionary lookup  
is one of the ways Python is so dynamic, but it means that the speed  
of dictionary lookups is extremely important (in fact I would say  
this is one of the primary bottlenecks of Python). It also offers the  
advantage that the lookup strings don't need to be re-allocated each  
time they're needed.

> On 5/10/08, Stefan Behnel <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>>  I'm wondering how to continue the support for this feature given  
>> the fact that
>>  identifiers are Unicode strings in Py3. We currently only intern  
>> byte strings
>>  that look like Python identifiers, so in Py3, they simply no  
>> longer look like
>>  identifiers, as they are not Unicode strings.
>>
>>  I can see four ways how to deal with this:
>>
>>  1) drop string interning completely
>>
>>  2) disable string interning in Py3 and use normally created byte  
>> strings instead
>>
>>  3) keep separate sets of identifier-like byte strings and unicode  
>> strings in
>>  the compiler and write them into the C file. Then, depending on  
>> the Python
>>  version, either intern the byte strings or the unicode strings,  
>> and create the
>>  other set as un-interned strings.
>>
>>  4) keep the information if a string should be interned for all  
>> strings we deal
>>  with (bytes and unicode), remove the intern tab and merge it with  
>> the general
>>  string tab by adding an additional field "intern". Then  
>> __Pyx_InitStrings()
>>  would create the strings differently depending on the compile  
>> time Python
>>  version, i.e., it would intern Unicode identifiers in Py3 and  
>> byte string
>>  identifiers in Py2, and create everything else as normal strings.
>>
>>  Personally, I favour 4) - although I could live with 1) - but  
>> since I'm not
>>  quite sure what the original intention of string interning was  
>> (saving
>>  memory?), I'd like to hear other opinions first.
>>
>>  Stefan
>>  _______________________________________________
>>  Cython-dev mailing list
>>  [email protected]
>>  http://codespeak.net/mailman/listinfo/cython-dev
>>
>
>
> -- 
> Lisandro Dalcín
> ---------------
> Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
> Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
> Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
> PTLC - Güemes 3450, (3000) Santa Fe, Argentina
> Tel/Fax: +54-(0)342-451.1594
> _______________________________________________
> Cython-dev mailing list
> [email protected]
> http://codespeak.net/mailman/listinfo/cython-dev

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to