On 12/17/14 09:28, Alexander Graf wrote: > > > >> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <ler...@redhat.com>: >> >>> On 12/16/14 21:41, Peter Maydell wrote: >>>> On 16 December 2014 at 19:00, Laszlo Ersek <ler...@redhat.com> wrote: >>>> The root of this question is what each of >>>> >>>> enum device_endian { >>>> DEVICE_NATIVE_ENDIAN, >>>> DEVICE_BIG_ENDIAN, >>>> DEVICE_LITTLE_ENDIAN, >>>> }; >>>> >>>> means. >>> >>> An opening remark: endianness is a horribly confusing topic and >>> support of more than one endianness is even worse. I may have made >>> some inadvertent errors in this reply; if you think so please >>> let me know and I'll have another stab at it. >>> >>> That said: the device_endian options indicate what a device's >>> internal idea of its endianness is. This is most relevant if >>> a device accepts accesses at wider than byte width >>> (for instance, if you can read a 32-bit value from >>> address offset 0 and also an 8-bit value from offset 3 then >>> how do those values line up? If you read a 32-bit value then >>> which way round is it compared to what the device's io read/write >>> function return?). >>> >>> NATIVE_ENDIAN means "same order as the CPU's main data bus's >>> natural representation". (Note that this is not necessarily >>> the same as "the endianness the CPU currently has"; on ARM >>> you can flip the CPU between LE and BE at runtime, which >>> is basically inserting a byte-swizzling step between data >>> accesses and the CPU's data bus, which is always LE for ARMv7+.) >>> >>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN >>> (because the guest vcpu reads it directly with the h/w cpu). >>> >>>> Consider the following call tree, which implements the splitting / >>>> combining of an MMIO read: >>>> >>>> memory_region_dispatch_read() [memory.c] >>>> memory_region_dispatch_read1() >>>> access_with_adjusted_size() >>>> memory_region_big_endian() >>>> for each byte in the wide access: >>>> memory_region_read_accessor() >>>> fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c] >>>> fw_cfg_read() >>>> adjust_endianness() >>>> memory_region_wrong_endianness() >>>> bswapXX() >>>> >>>> The function access_with_adjusted_size() always iterates over the MMIO >>>> address range in incrementing address order. However, it can calculate >>>> the shift count for memory_region_read_accessor() in two ways. >>>> >>>> When memory_region_big_endian() returns "true", the shift count >>>> decreases as the MMIO address increases. >>>> >>>> When memory_region_big_endian() returns "false", the shift count >>>> increases as the MMIO address increases. >>> >>> Yep, because this is how the device has told us that it thinks >>> of accesses as being put together. >>> >>> The column in your table "host value" is the 16 bit value read from >>> the device, ie what we have decided (based on device_endian) that >>> it would have returned us if it had supported a 16 bit read directly >>> itself. BE devices compose 16 bit values with the high byte first, >>> LE devices with the low byte first, and native-endian devices >>> in the same order as guest-endianness. >>> >>>> In memory_region_read_accessor(), the shift count is used for a logical >>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric >>>> values and hides (ie. accounts for) host endianness. >>>> >>>> Let's consider >>>> - an array of two bytes, [0xaa, 0xbb], >>>> - a uint16_t access made from the guest, >>>> - and all twelve possible cases. >>>> >>>> In the table below, the "host", "guest" and "device_endian" columns are >>>> input variables. The columns to the right are calculated / derived >>>> values. The arrows above the columns show the data dependencies. >>>> >>>> After memory_region_dispatch_read1() constructs the value that is >>>> visible in the "host value" column, it is passed to adjust_endianness(). >>>> If memory_region_wrong_endianness() returns "true", then the host value >>>> is byte-swapped. The result is then passed to the guest. >>>> >>>> >>>> +---------------------------------------------------------------------------------------------------------------+----------+ >>>> | >>>> | | >>>> +---- >>>> ------+-------------------------------------------------------------------------+ >>>> | | >>>> | | | >>>> | | | >>>> >>>> +----------------------------------------------------------+---------+ >>>> | | | >>>> | | | | | >>>> | | | | >>>> | +-----------+-------------------+ +----------------+ | >>>> | +------------------------+-------------------+ | | >>>> | | | | | | | | | | >>>> | | | | | | | >>>> | | | | | | v | v | >>>> v | v | v | v >>>> # host guest device_endian memory_region_big_endian() host value >>>> host repr. memory_region_wrong_endianness() guest repr. guest value >>>> -- ---- ----- ------------- -------------------------- ---------- >>>> ------------ -------------------------------- ------------ ----------- >>>> 1 LE LE native 0 0xbbaa >>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>>> 2 LE LE BE 1 0xaabb >>>> [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa >>>> 3 LE LE LE 0 0xbbaa >>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa >>>> >>>> 4 LE BE native 1 0xaabb >>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>>> 5 LE BE BE 1 0xaabb >>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa >>>> 6 LE BE LE 0 0xbbaa >>>> [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa >>>> >>>> 7 BE LE native 0 0xbbaa >>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>>> 8 BE LE BE 1 0xaabb >>>> [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb >>>> 9 BE LE LE 0 0xbbaa >>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb >>>> >>>> 10 BE BE native 1 0xaabb >>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>>> 11 BE BE BE 1 0xaabb >>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb >>>> 12 BE BE LE 0 0xbbaa >>>> [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb >>> >>> The column you have labelled 'guest repr' here is the returned data >>> from io_mem_read, whose API contract is "write the data from the >>> device into this host C uint16_t (or whatever) such that it is the >>> value returned by the device read as a native host value". It's >>> not related to the guest order at all. >>> >>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(), >>> which passes it a local variable "val". So now "val" has the >>> "guest repr" column's bytes in it, and (as a host C variable) the >>> value: >>> 1,2,3 : 0xbbaa >>> 4,5,6 : 0xaabb >>> 7,8,9 : 0xbbaa >>> 10,11,12 : 0xaabb >>> >>> As you can see, this depends on the "guest endianness" (which is >>> kind of the endianness of the bus): a BE guest 16 bit access to >>> this device would return the 16 bit value 0xaabb, and an LE guest >>> 0xbbaa, and we have exactly those values in our host C variable. >>> If this is TCG, then we'll copy that 16 bit host value into the >>> CPUState struct field corresponding to the destination guest >>> register as-is. (TCG CPUState fields are always in host-C-order.) >>> >>> However, to pass them up to KVM we have to put them into a >>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So: >>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store >>> in target CPU endianness"], so now buf has the bytes: >>> 1,2,3 : [0xaa, 0xbb] >>> 4,5,6 : [0xaa, 0xbb] >>> 7,8,9 : [0xaa, 0xbb] >>> 10,11,12 : [0xaa, 0xbb] >>> >>> ...which is the same for every case. >>> >>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics >>> are "the value as it would appear if the VCPU performed a load or store >>> of the appropriate width directly to the byte array". Which is what we >>> want -- your device has two bytes in order 0xaa, 0xbb, and we did >>> a 16 bit load in the guest CPU, so we should get the same answer as if >>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be >>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not. >>> >>> I think the fact that all of these things come out to the same >>> set of bytes in the mmio.data buffer is the indication that all >>> QEMU's byte swapping is correct. >>> >>>> Looking at the two rightmost columns, we must conclude: >>>> >>>> (a) The "device_endian" field has absolutely no significance wrt. what >>>> the guest sees. In each triplet of cases, when we go from >>>> DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN, >>>> the guest sees the exact same value. >>>> >>>> I don't understand this result (it makes me doubt device_endian >>>> makes any sense). What did I do wrong? >>> >>> I think it's because you defined your device as only supporting >>> byte accesses that you didn't see any difference between the >>> various device_endian settings. If a device presents itself as >>> just an array of bytes then it doesn't have an endianness, really. >>> >>> As Paolo says, if you make your device support wider accesses >>> directly and build up the value yourself then you'll see that >>> setting the device_endian to LE vs BE vs native does change >>> the values you see in the guest (and that you'll need to set it >>> to LE and interpret the words in the guest as LE to get the >>> right behaviour). >>> >>>> This means that this interface is *value preserving*, not >>>> representation preserving. In other words: when host and guest >>>> endiannesses are identical, the *array* is transferred okay (the >>>> guest array matches the initial host array [0xaa, 0xbb]). When guest >>>> and host differ in endianness, the guest will see an incorrect >>>> *array*. >>> >>> Think of a device which supports only byte accesses as being >>> like a lump of RAM. A big-endian guest CPU which reads 32 bits >>> at a time will get different values in registers to an LE guest >>> which does the same. >>> >>> *However*, if both CPUs just do "read 32 bits; write 32 bits to >>> RAM" (ie a kind of memcpy but with the source being the mmio >>> register rather than some other bit of RAM) then you should get >>> a bytewise-correct copy of the data in RAM. >>> >>> So I *think* that would be a viable approach: have your QEMU >>> device code as it is now, and just make sure that if the guest >>> is doing wider-than-byte accesses it does them as >>> do { >>> load word from mmio register; >>> store word to RAM; >>> } while (count); >>> >>> and doesn't try to interpret the byte order of the values while >>> they're in the CPU register in the middle of the loop. >>> >>> Paolo's suggestion would work too, if you prefer that. >>> >>>> I apologize for wasting everyone's time, but I think both results are >>>> very non-intuitive. >>> >>> Everything around endianness handling is non-intuitive -- >>> it's the nature of the problem, I'm afraid. (Some of this is >>> also QEMU's fault for not having good documentation of the >>> semantics of each of the layers involved in memory accesses. >>> I have on my todo list the vague idea of trying to write these >>> all up as a blog post.) >> >> Thanks for taking the time to write this up. My analysis must have >> missed at least two important things then: >> - the device's read/write function needs to consider address & size, and >> return values that match host byte order. fw_cfg doesn't conform ATM. >> - there's one more layer outside the call tree that I checked that can >> perform endianness conversion. >> >> I'll try to implement Paolo's suggestion (ie. support wide accesses in >> fw_cfg internally, not relying on splitting/combining by memory.c). > > Awesome :). Please define it as device little endian while you go as well. > That should give us fewer headaches if we want to support wide access on ppc.
Definitely; that was actually the first part of Paolo's suggestion. :) Thanks! Laszlo