Hi Jared,
Thanks for you very useful debugging tips.

I have tracked down that CPU 0 is getting stalled. The reason for this 
is: CPU 0 issues a Store Request which misses in the L1. The L1 sends a 
Write Request to L2. Then somehow the request is lost completely. It 
never shows up in the L2, and so CPU 0 never gets a response. 
Ultimately, the watchdog expires and causes an assert.

However, while CPU 0 is stalled, CPU 1 is making requests and getting 
replies from the memory system.

Another thing I noted is that this CPU 0-getting-stuck-problem occurs 
only if there is OS activity in my test application. What I am trying to 
say is:
        In my *original* test application, there is a 'for' loop in each 
thread, and in the for loop, there is a 'printf' statement (these 
printfs cause the information to show up in the yellow "system 
console"). CPU 0 gets stuck for this application.
        However, for experiment sake, I removed the printfs inside the for 
loop. With this application the simulation completes much, much faster, 
and it runs to completion (i.e. simulation is terminated by flexus).


Have you seen such behavior before? Any ideas as to what may be the 
problem and how I can go about finding and fixing it.

Thanks in advance for your help
Regards
- Mrinal


Jared C. Smolens wrote:
> Hi Mrinal,
> 
> Yes, messages like this do need to be on the snoop channel to prevent 
> races with downgrades, invalidates, and evictions.
> 
> Watchdog timeouts occur for two general reasons: deadlock or absurdly 
> long request-reply latencies.  
> 
> Deadlocks are most likely due to resource leaks or protocol bugs.  I 
> doubt you have introduced a protocol bug with your change, but it's 
> relatively easy to forget to free a resource.  If I had to guess, I'd 
> look at MAF entries.  If allocated, these must be freed along all return 
> paths from the cache, even if you don't send an ack.  See below for 
> debugging techniques.
> 
> The latter is unlikely, but easy to test: simply extend the tiemout and 
> see if things eventually pick up again.  If so, you might want to look 
> for long quueing delays.  20K cycles in a CMP is a reasonable upper bound 
> here.
> 
> My general workflow for solving deadlocks is to identify the first 
> processor that has stalled and figure out what physical address request 
> it is waiting on (you may have to compile with a higher debugging mode, 
> such as -trace or -verb to determine this).  I then re-run the job with 
> trace-level debugging to determine what happened to that message.   Tip: 
> once you know the address, you can significantly speed this up by setting 
> the -$CACHE_NAME:trace_address option to the DECIMAL value of the 
> physical address on each level of cache.    
> 
> Cheers,
> 
> Jared
> 
> Excerpts From Mrinal Nath <[email protected]>:
>  [Simflex] A question related to sen: Mrinal Nath <[email protected]>
>> Hi,
>> I am trying the make the L1 caches write-through. I have added a "write 
>> through request" which is sent by the L1 to the L2 when a write-hit 
>> occurs in the L1. All I have done is used "enqueueMessage" to send the 
>> write through message on the BackSideOut_Snoop port of the L1.
>>
>> The L2 does *not* send back any acknowledgment of the write through 
>> (since I have made the L2 inclusive).
>>
>> I am using the snoop channel to send this message (if I use the request 
>> channel, then I get some races between the write-through message and 
>> invalidate-acks or downgrade-acks). Using the snoop channel avoids the 
>> races between the write-through message and the acks.
>>
>> However, I have a big problem now: the watchdog timer expires because 
>> one of my CPUs does not make any forward progress. I have tried tracking 
> 
>> the problem, but it is very difficult to track, since the timer expires 
>> after a long time.
>>
>> If I use the request channel, I don't get such a timer related problem 
>> (I only have the races mentioned above).
>>
>> Can anyone provide some insight about why adding a simple message on the 
> 
>> snoop channel from L1 to L2 is causing the processor to loose forward 
>> progress?
>>
>> Have I missed out some important/special requirement related to handling 
> 
>>  snoop messages going from L1 to L2 ?
>>
>> Any insight will be greatly appreciated.
>> Thanks,
>> - Mrinal
> 
> 
> Jared Smolens ----------- Electrical and Computer Engineering
> www.rabidpenguin.org ------------- Carnegie Mellon University
> jsmolens AT ece.cmu.edu ------ HH A-313 ------ Pittsburgh, PA
> 
> _______________________________________________
> SimFlex mailing list
> [email protected]
> https://sos.ece.cmu.edu/mailman/listinfo/simflex
> SimFlex web page: http://www.ece.cmu.edu/~simflex
From zebchuk at eecg.toronto.edu  Wed Oct 18 09:59:16 2006
From: zebchuk at eecg.toronto.edu (Jason Zebchuk)
List-Post: [email protected]
Date: Wed Oct 18 09:59:27 2006
Subject: [Simflex] StatUniqueCounter
Message-ID: <[email protected]>

Hi guys,

I'm trying to count the number of unique memory blocks accessed during a 
simulation.  I'm only really interested in the number of different 
blocks accessed, so I don't need a histogram or anything like that.

Following my intuition, I thought that Stat::StatUniqueCounter might do 
exactly what I want.

Some of my results aren't quite making sense, so I thought I'd ask you 
exactly what StatUniqueCounter does.  I tried looking at the code to 
figure it out, but, well, that just leaves me with a headache.


Thanks,

Jason
From jsmolens+ at ece.cmu.edu  Wed Oct 18 11:12:35 2006
From: jsmolens+ at ece.cmu.edu (Jared C. Smolens)
List-Post: [email protected]
Date: Wed Oct 18 11:12:40 2006
Subject: [Simflex] A question related to sending adding messages on the
Message-ID: <[email protected]>


Hi Mrinal,

The write request is getting queued at the L2, but it cannot be processd.

I'd suggest three things:

First, inspect all of the input queues and cache controller resources 
(e.g., the MAF) to see if anything has filled up and is preventing other 
requests.   A DBG_print() which displays this for the L2 on each cycle is 
useful here.  If you have a banked cache, it may also help to simplify 
the problem by switching to a single back and reproducing the deadlock 
there. 

Second, look at doNewRequests() and make sure that write request is 
actually being considered for processing.  This is the first place where 
messages are inspected in the L2.

Finally, look for prior requests to that address and make sure they have 
completed.  The shared L2 is a serialization point, so a prior request 
for the same address can stall future requests.

Printf()'s cause I/O and require relatively expensive OS activity to 
complete.  They can significantly extend the runtime of short 
microbenchmarks.  They'll also hit more varied code which can triggers 
latent bugs.  There is nothing in the memory system that depends on user 
vs. system memory requests, so I wouldn't read anything special into your 
observation.

- Jared

Excerpts From Mrinal Nath <[email protected]>:
 Re: [Simflex] A question related to: Mrinal Nath <[email protected]>
>Hi Jared,
>Thanks for you very useful debugging tips.
>
>I have tracked down that CPU 0 is getting stalled. The reason for this 
>is: CPU 0 issues a Store Request which misses in the L1. The L1 sends a 
>Write Request to L2. Then somehow the request is lost completely. It 
>never shows up in the L2, and so CPU 0 never gets a response. 
>Ultimately, the watchdog expires and causes an assert.
>
>However, while CPU 0 is stalled, CPU 1 is making requests and getting 
>replies from the memory system.
>
>Another thing I noted is that this CPU 0-getting-stuck-problem occurs 
>only if there is OS activity in my test application. What I am trying to 

>say is:
>       In my *original* test application, there is a 'for' loop in each 
>thread, and in the for loop, there is a 'printf' statement (these 
>printfs cause the information to show up in the yellow "system 
>console"). CPU 0 gets stuck for this application.
>       However, for experiment sake, I removed the printfs inside the for 
>loop. With this application the simulation completes much, much faster, 
>and it runs to completion (i.e. simulation is terminated by flexus).
>
>
>Have you seen such behavior before? Any ideas as to what may be the 
>problem and how I can go about finding and fixing it.
>
>Thanks in advance for your help
>Regards
>- Mrinal


Jared Smolens ----------- Electrical and Computer Engineering
www.rabidpenguin.org ------------- Carnegie Mellon University
jsmolens AT ece.cmu.edu ------ HH A-313 ------ Pittsburgh, PA

Reply via email to