Re: [PATCH v2 0/9] Add cache coloring API and arm64 support

'Marco Solieri' via Jailhouse Wed, 29 Jul 2020 08:47:28 -0700

On Wed, Jul 29, 2020 at 09:53:27AM +0200, Jan Kiszka wrote:
> On 29.07.20 00:41, Marco Solieri wrote:
> > On Tue, Jul 28, 2020 at 01:30:37PM +0200, Jan Kiszka wrote:
> > > On 28.07.20 13:09, Marco Solieri wrote:
> > > > On Tue, Jul 28, 2020 at 11:26:45AM +0200, Jan Kiszka wrote:
> > > > > On 28.07.20 11:15, Marco Solieri wrote:
> > > > > > On Mon, Jul 27, 2020 at 11:39:48PM +0200, Jan Kiszka wrote:
> > > > > > > On 27.07.20 23:13, Marco Solieri wrote:
> > > > > > > > If we understand correctly your
> > > > > > > > implementation, you are mapping the entire memory region and 
> > > > > > > > then
> > > > > > > > copying blocks of the binary image using what you called 
> > > > > > > > "colored
> > > > > > > > offset" function.  That was very similar to our first 
> > > > > > > > (unpublished)
> > > > > > > > attempt to implement "colored loading", and we soon discovered 
> > > > > > > > it is not
> > > > > > > > very efficient.  Loading time grows considerably and could 
> > > > > > > > impede
> > > > > > > > scaling up to larger images (e.g. an Ubuntu-like rootfs).  We 
> > > > > > > > think that
> > > > > > > > a better alternative is closer to what we proposed in the last 
> > > > > > > > patchset,
> > > > > > > > that is:
> > > > > > > > - create a colored mapping,
> > > > > > > > - perform a "virtually" contiguous copy of the image.
> > > > > > > > 
> > > > > > > > We understand that this logic has to be moved outside the 
> > > > > > > > hypervisor, so
> > > > > > > > we would like to move it to the driver, exploiting Linux 
> > > > > > > > virtual addres
> > > > > > > > space to create the colored mapping.  What do you think about 
> > > > > > > > it?
> > > > > > > 
> > > > > > > Did you examine what was causing this slowdown? It seems highly
> > > > > > > counter-intuitive to me, given that we are only copying from 
> > > > > > > memory to
> > > > > > > memory in 4K chunks, just using different virtual addresses - 
> > > > > > > that' all.
> > > > > > 
> > > > > > I think you are right, indeed.  Doublechecking our old (and slow)
> > > > > > prototype, we discovered we were actually not just copying at each
> > > > > > iteration step, but we were also been ioremapping.  That was most
> > > > > > probably the cause of the slowdown.
> > > > > > 
> > > > > > That leaves only a design point in favour of the contiguous virtual
> > > > > > mapping.  It makes the implementation more robust and elegant, 
> > > > > > since it
> > > > > > enables the copy operation to be independent from coloring and thus
> > > > > > reusable.  This is enough for us to favour this approach.
> > > > > > 
> > > > > 
> > > > > To my understanding, the choice is between:
> > > > > 
> > > > > for_each_colored_chunk
> > > > >       copy_chunk
> > > > > 
> > > > > and
> > > > > 
> > > > > for_each_colored_chunk
> > > > >       remap_chunk
> > > > > copy_whole image
> > > > > drop_mapping
> > > > > 
> > > > > As the first option can be reused for uncolored images as well, I do 
> > > > > not
> > > > > really see the value of option 2. Prove me wrong by code ;).
> > > > 
> > > > I see.  I would easily agree with you if we assume to use the
> > > > `jailhouse_get_colored_offs` (or something similar), but we rather not
> > > > do so.
> > > > 
> > > > Instead, the `next_colored` function should be favored as the
> > > > fundamental coloring algorithm implementation, because it is closer to
> > > > the hardware meaning of coloring, in the sense that it preserves the
> > > > notion of bits in the address, instead of abstracting it away behind the
> > > > notion of offsets in the memory space.
> > > 
> > > I disagree here. First, because this abstraction is a benefit - provided I
> > > didn't miss a case.
> > 
> > Abstraction is good only until something meaningful for you is hidden.
> > Then, it becomes obfuscation.  Coloring is a concept that cannot be
> > separated from its bitwise nature just because of some additional
> > hypothesis (contiguous color assignment or contiguous color bits) that
> > restricts the generality without bringing value.
> > 
> > 
> > > And second, because the algorithm avoids the loop for the calculation
> > > and simplifies the loop for the virtual memory copying or mapping (the
> > > latter is not implemented yet, so just a claim of me so far).
> > 
> > The loop for calculation is useful when you have to deal with non
> > contiguous color assignments, but it can be removed, of course.  I guess
> > that also the other kinds can be removed, but this looks orthogonal to
> > the chosen approach (next_colored or offset).


We explored some optimisation in this sense.  They all boil down to
either memoise the sequence of possibly discontinuous colour
assignments, or reduce the loop iteration length by iterating over the
colour bitmasks, instead of their content.  In all cases the
implementation is heavier and less readable/maintainable.


> > > > As a by-product, the implementation is also more flexible and generic,
> > > > since it could be applied also to different kind of coloring, e.g. to
> > > > bank coloring.  In these cases, nasty placement of useful bits could
> > > > make an "offset-oriented" implementation hard to read (and also very
> > > > difficult to write :-P).  E.g., consider a case where only 12 and 14
> > > > need to be used for a 4-colors platform configuration.
> > > 
> > > If you can point out concrete platforms/SoCs with such properties, it 
> > > would
> > > help finding the best solution. I dislike design decisions that are based 
> > > on
> > > speculation what could come. Usually, they take costs for "extensibility",
> > > and the outcome will still have to be adjusted when the real use case 
> > > comes
> > > along.
> > 
> > I agree with your position against unfounded extensibility arguments,
> > but it does not apply here.
> > 
> > - I am not speculating about unreal possibilities, I was referring to
> >    real hardware. The first result I found is Intel i7-860, where:
> >    - cache coloring is possible on address bits: 12-18;
> >    - bank coloring on: 13-15,21-22.
> >    A discussion about bank/cache coloring on such platform is in
> > 
> >      Lei Liu, Z. Cui, Mingjie Xing, Y. Bao, M. Chen and Chengyong Wu, "A
> >      software memory partition approach for eliminating bank-level
> >      interference in multicore systems," 2012 21st International
> >      Conference on Parallel Architectures and Compilation Techniques
> >      (PACT), Minneapolis, MN, 2012, pp. 367-375.
> > 
> 
> Bank coloring has also been explored in PALLOC in the past
> (https://github.com/heechul/palloc).

Yeah, I know it very well.  There is also another and more recent
discussion about multi-dimensional colouring in ''Coordinated Bank and
Cache Coloring for Temporal Protection of Memory Accesses'
https://ieeexplore.ieee.org/document/6755286 .

> I think Nicholas (CC'ed) and his fellows did a prototype with
> Jailhouse as well.

Interesting.  Nicholas, is the prototype available somewhere online?

> Would be good to collect the requirements from today's CPUs for that
> (i7-860 is probably predating the needs of Jailhouse).

I said it was the first counterexample I found, not the best :-)


> > - Nor the next_colored proposal brings additional costs, since compared
> >    to the offset implementation
> >    - it is simpler and more elementary;
> >    - it has close or equal performance;
> >    - a tested implementation is ready in v2.
> 
> I agree that, if we still face non-contiguous bit ranges, straightforward
> calculations have to be replaced with a bitmap search algorithm like you
> did. Maybe then using ffsl.

We weren't aware of the builtin availability of these functions. Nice!
(I hope it is portable enough.)


> > > How did FreeBSB - or what BSD was it? - model coloring?
> > 
> > FreeBSD supports only a simple form of cache coloring -- I would not
> > consider it as a reference, especially about extensibility beyond cache
> > coloring.
> 
> Good to know.
> 
> > > Any other reference that may have worked on more that one board?
> > 
> > We tested the proposed solution on Xilinx ZU9 and ZU7, Nvidia TX2, and
> > NXP i.MX8 QM.  We have tested a very similar implementation logic also
> > in Xen on Arm v8 (Xilinx ZU9) and x86-64 (Intel i5-5xxx).
> > 
> > Do you have any point in favour of the offset-based alternative
> > implementation?
> 
> As said above: If we need to handle non-contiguous bitmaps, ie. multiple
> jailhouse_cache regions per cell, your approach is needed. Then we should
> improve the implementation (make it arch-agnostic, move to
> jailhouse/coloring.h, use [__builtin_]ffsl).

We are close to finishing a v3 revision, where we have followed this
lead and improved the next_colored code.  Beside the contiguous virtual
mapping, and the revision of the region size semantics, all other
comments have been/are being addressed.  It should be ready in the next
days.

-- 
Marco Solieri, Ph.D.
Research Fellow

High-Performance Real-Time Lab
Università degli Studi di Modena e Reggio Emilia
Ufficio 1.35 - Edificio Matematica - 213/b, via Campi - 41125 Modena
Tel: +39-059-205-55-10  -- OpenPGP: 0x75822E7E

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jailhouse-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/20200729154605.3igcz5ni4ddukbze%40carbon.xt3.it.

signature.asc
Description: PGP signature

Re: [PATCH v2 0/9] Add cache coloring API and arm64 support

Reply via email to