On Mon, Nov 7, 2011 at 12:03 PM, Anthony Liguori <anth...@codemonkey.ws> wrote:
> On 11/07/2011 11:52 AM, Sasha Levin wrote:
>>
>> Hi Anthony,
>>
>> Thank you for your comments!
>>
>> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
>>>
>>> On 11/06/2011 02:40 PM, Sasha Levin wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm planning on doing a small fork of the KVM tool to turn it into a
>>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>>>>
>>>> The idea was discussed briefly couple of months ago, but never got off
>>>> the ground - which is a shame IMO.
>>>>
>>>> It's easy to explain the problem: If an attacker finds a security hole
>>>> in any of the devices which are exposed to the guest, the attacker would
>>>> be able to either crash the guest, or possibly run code on the host
>>>> itself.
>>>>
>>>> The solution is also simple to explain: Split the devices into different
>>>> processes and use seccomp to sandbox each device into the exact set of
>>>> resources it needs to operate, nothing more and nothing less.
>>>>
>>>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>>>> that many legacy devices, I'll focus first on the virtio family for the
>>>> sake of simplicity (and covering 90% of the options).
>>>>
>>>> This is my basic overview of how I'm planning on implementing the
>>>> initial POC:
>>>>
>>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough
>>>> to allow us to focus on the aspects which are important for the POC
>>>> while still covering most bases (i.e. sandbox to single file
>>>> - /dev/urandom and such).
>>>>
>>>> 2. Do it on a one process per device concept, where for each device
>>>> (notice - not device *type*) requested, a new process which handles it
>>>> will be spawned.
>>>>
>>>> 3. That process will be limited exactly to the resources it needs to
>>>> operate, for example - if we run a virtio-blk device, it would be able
>>>> to access only the image file which it should be using.
>>>>
>>>> 4. Connection between hypervisor and devices will be based on unix
>>>> sockets, this should allow for better separation compared to other
>>>> approaches such as shared memory.
>>>>
>>>> 5. While performance is an aspect, complete isolation is more important.
>>>> Security is primary, performance is secondary.
>>>>
>>>> 6. Share as much code as possible with current implementation of virtio
>>>> devices, make it possible to run virtio devices either like it's being
>>>> done now, or by spawning them as separate processes - the amount of
>>>> specific code for the separate process case should be minimal.
>>>>
>>>>
>>>> Thats all I have for now, comments are *very* welcome.
>>>
>>> I thought about this a bit and have some ideas that may or may not help.
>>>
>>> 1) If you add device save/load support, then it's something you can
>>> potentially
>>> use to give yourself quite a bit of flexibility in changing the sandbox.
>>>  At any
>>> point in run time, you can save the device model's state in the sandbox,
>>> destroy
>>> the sandbox, and then build a new sandbox and restore the device to its
>>> former
>>> state.
>>>
>>> This might turn out to be very useful in supporting things like device
>>> hotplug
>>> and/or memory hot plug.
>>>
>>> 2) I think it's largely possible to implement all device emulation
>>> without doing
>>> any dynamic memory allocation.  Since memory allocation DoS is something
>>> you
>>> have to deal with anyway, I suspect most device emulation already uses a
>>> fixed
>>> amount of memory per device.   This can potentially dramatically simplify
>>> things.
>>>
>>> 3) I think virtio can/should be used as a generic "backend to frontend"
>>> transport between the device model and the tool.
>>
>> virtio requires server and client to have shared memory, so if we
>> already go with shared memory we can just let the device manage the
>> actual virtio driver directly, no?
>
> Let's say you're implementing an IDE device model in the sandbox.  You can
> try to implement the block layer in the sandbox but I think that quickly
> will become too difficult.
>
> You can do as Avi suggested and do all DMA accesses from the IDE device
> model as RPCs, or you can map guest memory as shared memory and utilize (1)
> in order to change that mapping as you need to.
>
> At some point, you end up with a struct iovec and an offset that you want to
> read/write to the virtual disk.  You need a way to send that to the
> "frontend" that will then handle that as a raw/qcow2 request.
>
> Well, virtio is great at doing exactly that :-)   So if you increase your
> shared memory to have a little bit extra to stick another vring, you can use
> that for device model -> front end communication without paying an extra
> memcpy.
>
> For notifications, the easiest thing to do is setup an "event channel"
> bitmap and use a single eventfd to multiplex that event channel bitmap.
>  This is pretty much how Xen works btw.  A single interrupt is reserved and
> a bitmap is used to dispatch the actual events.
>
> So the sandbox loop would look like:
>
> void main() {
>  setup_devices();
>
>  read_from_event_channel(main_channel);
>  for i in vrings:
>     check_vring_notification(i);
> }
>
> Once vring would be used for dispatching PIO/MMIO.  The remaining vrings
> could be used for anything really.
>
> Like I mentioned elsewhere, just think of the sandbox as just an extension
> of the guests firmware.  The purpose of the sandbox is to reduce a very
> complicated, legacy device model, into a very simple and easy to audit,
> purely virtio based model.
>
>>
>> Also, things like interrupts would also require some sort of a different
>> IPC, which would complicate things a bit.
>>
>>
>>> 4) Lack of select() is really challenging.  I understand why it's not
>>> there
>>> since it can technically be emulated but it seems like a no-risk syscall
>>> to
>>> whitelist and it would make programming in a sandbox so much easier.
>>>  Maybe
>>> Andrea has some comments here?  I might be missing something here.
>>
>> There are several of these which would be nice to have, and if we can
>> get seccomp filters we have good flexibility with which APIs we allow
>> for each device.
>
> Yeah, filters are nice but I fear that you lose some of the PR benefits of
> sandboxing.  Once the first application claims to use sandboxing, whitelists
> a syscall it shouldn't, you'll start getting slashdot articles about "Linux
> sandbox broken, Linux security hopeless broken".  Then what's the point of
> all of this?

Approaching the limit: since no security code/infrastructure is
perfect, then what's the point of all of this? :)

When I've spoken about seccomp_filter, I've tried to avoid the word
'sandbox' as that comes with more baggage than just creating a means
of reducing the kernel's attack surface.  Ideally, seccomp_filter just
fills the void between read/write/sigreturn/exit and
all-the-system-calls: Don't want select? ok. Want epoll? ok. . . It
does mean that developers will have to determine the tradeoffs
themselves (or with some general guidance).  But, I expect there'd be
quite a few more consumers of seccomp if it was possible to not need
to emulate select() behavior or if, for example, brk() was allowed.

cheers!
will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to