Thank you so much for the kind and detailed explanations!

Just to clarify: I can use the APU config (apu_se.py) and switch out to an
O3 CPU, and I would still have the detailed GPU model, and the disconnected
Ruby model that synchronizes between CPU and GPU at the system-level
directory -- is that correct?

Last question: when using the APU config for simulating HeteroSync which,
for example, has a sleep mutex primitive that invokes a
__builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
mode's emulation of those syscalls inexorably sacrifice any fidelity that
could be argued leads to inaccurate evaluations of heterogeneous coherence
implementations? Or are any there other factors of insufficient fidelity
that might be important in this regard?


On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <mattdsinclair.w...@gmail.com>
wrote:

> Just to follow-up on 4 and 5:
>
> 4.  The synchronization should happen at the directory-level here, since
> this is the first level of the memory system where both the CPU and GPU are
> connected.  However, I have not tested if the programmer sets the GLC bit
> (which should perform the atomic at the GPU's LLC) if Ruby has the
> functionality to send invalidations as appropriate to allow this.  I
> suspect it would work as is, but would have to check ...
>
> 5.  Yeah, for the reasons Matt P already stated O3 is not currently
> supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
> use the apu_se.py script as the base script for running GPUSE experiments.
> There are a number of examples on gem5-resources for how to get started
> with this (including HeteroSync), but I normally recommend starting with
> square if you haven't used the GPU model before:
> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
> In terms of support for synchronization at different levels of the memory
> hierarchy, but default the GPU VIPER coherence protocol assumes that all
> synchronization happens at the system-level (at the directory, in the
> current implementation).  However, one of my students will be pushing
> updates (hopefully today) that allow non-system level support (e.g., the
> GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
> the cache hierarchy and coherence protocol to add another level of cache
> (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
> need to change the current Ruby support to add this additional level and
> the appropriate transitions to do so.  However, if you instead meant that
> you are thinking of the directory level as synchronizing between the CPU
> and GPU, then you could use the support as is without any changes (I think).
>
> Hope this helps,
> Matt S.
>
> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
> gem5-users@gem5.org> wrote:
>
>> [Public]
>>
>> Hi,
>>
>>
>>
>>
>>
>> No worries about the questions! I will try to answer them all, so this
>> will be a long email 😊:
>>
>>
>>
>> The disconnected (or disjoint) Ruby network is essentially the same as
>> the APU Ruby network used in SE mode -  That is, it combines two Ruby
>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
>> disjointed because there are no paths / network links between the GPU and
>> CPU side, simulating a discrete GPU. These protocols work together because
>> they use the same network messages / virtual channels to the directory –
>> Basically you cannot simply drop in another CPU protocol and have it work.
>>
>>
>>
>> Atomic CPU is working **very** recently – As in this week.  It is on
>> review board right now and I believe might be part of the gem5 v23.0
>> release.  However, the reason Atomic and KVM CPUs are required is because
>> they use the atomic_noncaching memory mode and basically bypass the CPU
>> cache. The timing CPUs (timing and O3) are trying to generate routes to the
>> GPU side which is causing deadlocks.  I have not had any time to look into
>> this further, but that is the status.
>>
>>
>>
>> | are the GPU applications run on KVM?
>>
>>
>>
>> The CPU portion of GPU applications runs on KVM.  The GPU is simulated in
>> timing mode so the compute units, cache, memory, etc. are all simulated
>> with events.  For an application that simply launches GPU kernels, the CPU
>> is just waiting for the kernels to finish.
>>
>>
>>
>> For your other questions:
>>
>> 1.  Unfortunately no, it is not this easy. There is an issue with timing
>> CPUs that is still an outstanding bug – we focused on atomic CPU recently
>> as a way to allow users who aren’t able to use KVM to be able to use the
>> GPU model.
>>
>> 2.  KVM exits whenever there is a memory request outside of its VM range.
>> The PCI address range is outside the VM range, so for example when the CPU
>> writes to PCI space it will trigger an event for the GPU. The only Ruby
>> involvement here is that Ruby will send all requests outside of its memory
>> range to the IO bus (KVM or not).
>>
>> 3.  The MMIO trace is only to load the GPU driver and not used in
>> applications. It basically contains some reasonable register values for
>> anything that is not modeled in gem5 so that we do not need to model them
>> (e.g., graphics, power management, video encode/decode, etc.).  This is not
>> required for compute-only GPU variants but that is a different topic.
>>
>> 4.  I’m not familiar enough with this particular application to answer
>> this question.
>>
>> 5.  I think you will need to use SE mode to do what you are trying to
>> do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
>> currently does not support any APU-like devices. SE mode is able to do this
>> by making use of an emulated driver.
>>
>>
>>
>>
>>
>> -Matt
>>
>>
>>
>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org>
>> *Sent:* Friday, June 30, 2023 8:43 AM
>> *To:* The gem5 Users mailing list <gem5-users@gem5.org>
>> *Cc:* Anoop Mysore <mysan...@gmail.com>
>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS
>>
>>
>>
>> *Caution:* This message originated from an External Source. Use proper
>> caution when opening attachments, clicking links, or responding.
>>
>>
>>
>> It appears the host part of GPU applications are indeed executed on KVM,
>> from:
>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
>> .
>>
>> A few more questions:
>>
>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported
>> -- would that be as easy as changing the `cpu_type` in the config file and
>> running? I intend to run with the latest O3 CPU config I have (an Intel
>> CPU).
>> 2. The Ruby network that's used -- is it intercepting (perhaps just MMIO)
>> memory operations from the KVM CPU? Could you please briefly describe how
>> Ruby is working with both KVM and GPU (or point me to any document)?
>> 3. The GPU MMIO trace we pass during simulator invocation -- what exactly
>> is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU,
>> how is it portable across different programs within a benchmark-suite --
>> HeteroSync, for example?
>> 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
>> in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
>> KVM CPU, where do the synchronizations happen?
>>
>> 5. If I want to move to an integrated GPU model with an O3 CPU (the only
>> requirement is the shared LLC) -- are there any resources that can help me?
>> I do see a bootcamp that uses the apu_se.py -- can this be utilized at
>> least partially to support full system O3 CPU + integrated GPU? Are there
>> any modifications that need to be made to support synchronizations in L3?
>>
>>
>>
>> Please excuse the jumbled questions, I am in the process of gaining more
>> clarity.
>>
>>
>>
>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysan...@gmail.com> wrote:
>>
>> According to the GPU-FS blog
>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$>
>> ,
>>
>>     "*Currently KVM and X86 are required to run full system. Atomic and
>> Timing CPUs are not yet compatible with the disconnected Ruby network
>> required for GPUFS and is a work in progress*."
>>
>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU
>> applications run on KVM? Also, what does "disconnected" Ruby network mean
>> there?
>>
>> If so, is there any work in progress that I can use to develop on, or a
>> (noob-friendly) documentation of what needs to be done to extend the
>> support to Atomic/O3 CPU?
>>
>> For a project I'm working on, I need complete visibility into the CPU+GPU
>> cache hierarchy + perhaps a few more custom probes; could you comment on
>> whether this would be restrictive if going with KVM in the meantime given
>> that it leverages the host for the virtualized HW?
>>
>>
>>
>> Please let me know if I have got any of this wrong or if there are other
>> details you think would be useful.
>>
>> _______________________________________________
>> gem5-users mailing list -- gem5-users@gem5.org
>> To unsubscribe send an email to gem5-users-le...@gem5.org
>>
>
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org

Reply via email to