Apparently ptlsim won't be set up for the AMD K8 unless one actually
replaces certain files in the stable distribution with their AMD K8
versions (it is not enough to just add these files to the distribution
- they need to be renamed). So not much wonder I didn't get 11 cycles.

Of course replacing the relevant files causes ptlsim to not build.

But that's ok because some guy in Australia has a one year old patch
for that (which is not on the website)....

... which applies, but doesn't work.

:-(

Bill.

2008/12/4 Bill Hart <[EMAIL PROTECTED]>:
> 2008/12/4 David Harvey <[EMAIL PROTECTED]>:
>>
>>
>> On Dec 4, 2008, at 10:59 AM, Bill Hart wrote:
>>
>>> A quick update on ptlsim.
>>>
>>> Someone on the ptlsim list pointed out to me that when priming the
>>> cache, one needs to do this *during* the simulator run, not
>>> immediately before, as ptlsim also simulates the caches and has no way
>>> of accessing actual processor caches (or apparently of tracking what
>>> should be in them).
>>>
>>> This sped things up considerably. My 11 cycle per limb code now only
>>> take 16.4 cycles per limb on ptlsim.
>>
>> I assume you mean 11 cycles per loop, not per limb.
>
> Yes, sorry.
>
>>
>>> A guy who works for AMD's open source initiative also mentions that he
>>> has found ptlsim to be off by a bit. His group has a version of ptlsim
>>> called ptlsim/asf which is supposedly better at scheduling accurately.
>>> However a quick look at their website suggests this is designed to
>>> mimic a K10 not a K8.
>>
>> Well, maybe with sufficient effort, you can get pltsim down to 10
>> cycles/loop, and then you'll be done ---- who cares what happens on
>> the actual chip :-P
>>
>
> I think I will leave that to the experts. At present hopefully the
> addmul_1 example provides them with something to look into.
>
>> BTW something else to watch out for: sometimes if the input buffers
>> are badly aligned, the loop can run more slowly. On the K8 this is
>> related to L1 bank conflicts (see AMD manuals), i.e. if two addresses
>> are congruent mod 64 (but unequal) then the chip can access only one
>> per cycle, not two as usual. So for addmul_1 there are 8
>> possibilities to consider for each of the inputs, so 64 altogether.
>> Ideally the loop should have the same performance for all alignments.
>> I wonder if ptlsim knows about this sort of thing.
>>
>
> It is supposed to know about all such things. It models the cache
> hardware in a "cycle accurate" way. Of course cycle accurate is
> defined by a long paper that someone wrote. :-)
>
> Bill.
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to