Vitja Makarov, 06.12.2010 10:56:
> 2010/12/6 Stefan Behnel:
>> Vitja Makarov, 06.12.2010 09:56:
>>> 2010/12/6 Stefan Behnel:
>>>> Vitja Makarov, 05.12.2010 20:12:
>>>>> About temps:
>>>>> now temps are allocated in closure using declare_var() and it works fine.
>>>>
>>>> I think that could seriously slow down things inside of the generator,
>>>> though. It means that it needs to jump through the closure indirection for
>>>> every little value that it keeps around temporarily, including lots of
>>>> intermediate C results. Also, I would expect that a lot more temps are used
>>>> inside of the generator body than what actually needs to be kept alive
>>>> during yields, so a single yield inside of a lengthy generator body could
>>>> have a tremendous impact on the overall runtime and closure size.
>>>>
>>>> If you only store away the live temp values before a yield, you don't even
>>>> need to do any reference counting for them. It's just a quick bunch of C
>>>> assignments on yield and resume. That's about the smallest impact you can 
>>>> get.
>>>>
>>>> BTW, if you change the object struct of the generator closure anyway, maybe
>>>> a dynamically sized PyVarObject would be a way to incorporate the temps?
>>>
>>> I don't understand why do you think about performance problems?
>>>
>>> It seems to me that if temps are stored in tuple (or another kind of
>>> object) you have to manully save/restore them, and this code will
>>> really slow down generator.
>>
>> But it only has to be done when suspending/resuming, not every time a temp
>> is used. Given that there is quite some overhead involved in executing a
>> yield/resume cycle, it doesn't matter if we add a couple of fast C
>> assignments to that.
>>
>>> On the other hand when all temps are declared in closure you just have
>>> to say "closure->pyx_temp_v1"
>>>
>>> Do you think that closure->    will be much slower then stack operations?
>>
>> Yes, *much* slower. Temps are not meant to live on the stack. Most of the
>> time, they will just live in a CPU register, especially on x86_64 and other
>> architectures with plenty of registers. When you put them into the closure,
>> the C compiler must assume that it's an external operation when accessing
>> them and can't know that the next time it reads the variable it still has
>> the same value that it just wrote. That's kills tons of possible C
>> optimisations.
>
> Hmm. It is not really necessery to check each for value each time,
> it's not defined as volatile.

My knowledge of C is admittedly limited. Why should the C compiler be 
allowed to assume that a value stored somewhere in a foreign memory 
location doesn't change its value between separate store and read operations?


> Compiler will try to put it in registers and write back when reqired.

The difference is, when it's not in the closure, the function body (or 
current C scope) owns it, so the C compiler doesn't have to generate code 
to write it anywhere in the first place. Most temps are really just there 
for convenience when generating the C code, not because there really is 
something to store away. The C compiler is expected to drop almost all of them.


> Btw it's better to try both ways and choose the fastest one.

I know, predicting performance is hard at best, but in this case, it's 
clear enough to me that letting all temps live in the closure suffers from 
two serious drawbacks: space *and* time.


> How should save/restore code look like?
>
> PyObject *all_temps[NUMBER_OF_PYOBJECT_TEMPS];
>
> // save
> memcpy(closure->all_temps, all_temps, sizeof(all_temps));?
>
> Or:
>
> for (int i=0; i<  NUMBER_OF_PYOBJECT_TEMPS; i++) {
>     if (all_temps[i]) {
>       // this will create tuple (n, all_temps[i]) and add it to
> closure->temps_tuple
>       add_temp_to_closure(closure, i, all_temps[i]);
>     }
> }
>
> How should it handle native C types?

The CEP has some comments on this.

I think it would be best to generate the straight assignment code from 
within of the YieldExprNode:

     closure->place_for_int_temp[0] = __pyx_t_2
     closure->place_for_int_temp[1] = __pyx_t_5
     closure->place_for_object_temp[0] = __pyx_t_9
     return result
     resume:
     __pyx_t_2 = closure->place_for_int_temp[0]
     ...

For starters, I wouldn't mind if the closure fields were all "size_t" and 
the assignments just did an ugly cast on the RHS. Might be easier than 
properly allocating separate arrays of correctly sized fields.

BTW, I just noticed that this won't work with exception handling yet. In 
try-except and try-finally statements, we keep the current exception state 
in block local variables, those would have to get stored away as well. 
However, the problem is that this can be in nested block scopes, so the 
outer variables may not even be available to the C code. Another thing to 
deal with once this works...

Stefan
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to