Re: [PATCH] memory: Prevent array overflow in l1 page buffer pool.

2016-09-19 Thread Player, Timmons


On 9/19/16, 7:42 PM, "osv-dev@googlegroups.com on behalf of Nadav Har'El"
 wrote:

>diff --git a/core/mempool.cc b/core/mempool.cc
>index 50c938f..fad046d 100644
>--- a/core/mempool.cc
>+++ b/core/mempool.cc
>@@ -1099,9 +1099,12 @@ struct l1 {
> }
> static void* alloc_page_local();
> static bool free_page_local(void* v);
>-void* pop() { return _pages[--nr]; }
>-void push(void* page) { _pages[nr++] = page; }
>-void* top() { return _pages[nr - 1]; }
>+void* pop() {
>+void *page = nullptr;
>
>
>
>
>Nitpick (not important compared to the bigger comments below):
>
>
>
>I think that setting this nullptr is redundant, and confusing (it led me
>to think there is a reason why it should be set.
>
>If you anyway leave this default value, you could do
>
>_pages.pop(page);
>
>return page;
>
>instead of the ?: expression below. Maybe that's what you intended to do?

Eh.  Just a habit of mine to always initialize variables.

>
>
>
>+return (_pages.pop(page) ? page : nullptr);
>+}
>+bool push(void *page) { return _pages.push(page); }
>+unsigned size() { return _pages.size(); }
> void wake_thread() { _fill_thread.wake(); }
> static void fill_thread();
> static void refill();
>@@ -1110,11 +1113,12 @@ struct l1 {
> static constexpr size_t max = 512;
> static constexpr size_t watermark_lo = max * 1 / 4;
> static constexpr size_t watermark_hi = max * 3 / 4;
>-size_t nr = 0;
>
> private:
> sched::thread _fill_thread;
>-void* _pages[max];
>+
>+typedef ring_spsc page_ring;
>
>
>+page_ring _pages;
>
>
>
>
>I don't understand why you needed to change that array to a ring_spsc.

Yeah, I didn’t really need to.  The first thing I did when debugging this
was to switch the array
to a ring to absolutely prevent any possible overflow when handling pages.
 Then, I worked backward
from the assertion failures.  Once I couldn’t reproduce the issue, I just
kind of left
everything as-is.  I’ll clean it up and send a revised patch.

>@@ -1259,12 +1268,17 @@ void l1::refill()
> {
> SCOPE_LOCK(preempt_lock);
> auto& pbuf = get_l1();
>-while (pbuf.nr  + page_batch::nr_pages < pbuf.max /
>2) {
>-auto* pb = global_l2.alloc_page_batch(pbuf);
>-if (pb) {
>+auto* pb = global_l2.alloc_page_batch(pbuf);
>
>
>
>So alloc_page_batch dropped the preempt_lock momentarily (it did us the
>courtesy of preventing migration of the thread to another CPU, but I
>believe this is actually not necessary - we could have re-gotten get_l1()
>after the alloc_page_batch and not avoid
> migration).
>
>
>Wasn't the code here the real bug, and your fix here the real fix? I
>mean, won't this fix, of testing the room on the array again after
>calling alloc_page_batch(), work correctly without the rest of the
>changes to move to a ring_spsc?

Yes.

>
>
>By the way, looking at the (what I thought was unnecessary) migration
>lock in alloc_page_batch(), I am starting to think that the giving of
>pbuf to this function is unnecessary: It seems pbuf is only used there
>for some test looks completely unnecessary
> to me, now that anyway you do this test, more correctly, in the caller.
>But I guess we can consider this later.

Ok.  I’ll drop the argument from alloc_page_batch here.

>
>+if (pb) {
>+// Another thread might have filled the ring while we waited for
>+// the page batch.  Make sure there is enough room to add the
>pages
>+// we just acquired, otherwise return them.
>
>
>Note that the original code didn't intend to fill the entire array, but
>rather only half, but maybe it doesn't hurt using the entire batch if we
>already got it. So I think what you did is fine.
>
>
>
>+if (pbuf.size() + page_batch::nr_pages <= pbuf.max) {
> for (auto& page : pb->pages) {
>-pbuf.push(page);
>+assert(pbuf.push(page));
>

Cool, that was my thinking as well.

>
>
>I twinge when I see an assert with side-effects (as you know, other
>implementations like KASSERT for example, compile-out the entire assert
>on release builds). But this is actually fine with OSv's assert(). Still
>might be better style to rewrite this as
> two statements?
>
>

Noted.

>
> }
>+} else {
>+global_l2.free_page_batch(pb);
> }
> }
> }
>@@ -1273,10 +1287,12 @@ void l1::unfill()
> {
> SCOPE_LOCK(preempt_lock);
> auto& pbuf = get_l1();
>-while (pbuf.nr  > page_batch::nr_pages + pbuf.max /
>2) {
>-auto* pb = static_cast(pbuf.top());
>-for (size_t i = 0 ; i < page_batch::nr_pages; i++) {
>-pb->pages[i] = pbuf.pop();
>+// Don't bother unfilling unless we are actually above our target
>size.
>+if (pbuf.size() > pbuf.max / 2) {
>
>
>
>
>Here you changed the right hand side of the inequality a bit. Why?
>
>

Actually, I don’t remember.  Probably no good reason.

As I said, I’ll send an updated

Re: [PATCH] epoll: Bump the file reference count while polling.

2016-09-19 Thread Nadav Har'El
On Fri, Sep 9, 2016 at 9:26 PM, Timmons C. Player <
timmons.pla...@spirent.com> wrote:

> There is a potential lock order inversion issue here.  Closing
> the file from the network side requires taking the file lock
> after taking the socket lock.  Conversely, polling a socket from
> epoll takes the file lock and then takes the socket lock.
>
> Bumping the reference count here ensures that the networking
> side won't attempt to close the socket while we're polling, and
> hence won't need to take the file lock.
>


I still need to think about this issue, but something is strange for me:

Commit 38e361ff64c2c4328d8a90a48e329063305a379d suggests that:

The normal lock order is the socket lock, then file::f_lock, but
socket_file::epoll_add() inverts that order.

So I wonder why does polling a socket from epoll do (according to what you
said) the opposite of what is described here as the "normal lock order",
and why wasn't this noticed in the above commit?


>
> Signed-off-by: Timmons C. Player 
> ---
>  core/epoll.cc | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/core/epoll.cc b/core/epoll.cc
> index f425c4a..9d7b890 100644
> --- a/core/epoll.cc
> +++ b/core/epoll.cc
> @@ -187,7 +187,9 @@ public:
>  epoll_event& evt = found->second;
>  int active = 0;
>  if (evt.events) {
> +fhold(key._file);
>  active = key._file->poll(events_epoll_
> to_poll(evt.events));
> +fdrop(key._file);
>  }
>  active = events_poll_to_epoll(active);
>  if (!active || (evt.events & EPOLLET)) {
> --
> 2.7.4
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PATCH] memory: Prevent array overflow in l1 page buffer pool.

2016-09-19 Thread Nadav Har'El
Thanks for this patch. Looks serious, and I wonder if it explains
https://github.com/cloudius-systems/osv/issues/755

Some comments and questions below.

On Fri, Sep 9, 2016 at 8:12 PM, Timmons C. Player <
timmons.pla...@spirent.com> wrote:

> Under high levels of memory page churn, the array of pages in the
> l1 page buffer pool can overflow and overwrite the data in adjacent
> memory pages.  This change replaces the array with a fixed sized ring.
>
> It also removes the looping behavior found in the refill and unfill
> functions.  Those functions are called by multiple threads and now
> only do the minimum amount of work necessary to allow the callers to
> continue.  Previously, the looping behavior in the refill thread caused
> the array overflow due to a race on the array size variable.
>
> The pool fill thread is now solely responsible for maintaining
> the target pool size.
>
> Signed-off-by: Timmons C. Player 
> ---
>  core/mempool.cc | 67 ++
> ++-
>  1 file changed, 38 insertions(+), 29 deletions(-)
>
> diff --git a/core/mempool.cc b/core/mempool.cc
> index 50c938f..fad046d 100644
> --- a/core/mempool.cc
> +++ b/core/mempool.cc
> @@ -1099,9 +1099,12 @@ struct l1 {
>  }
>  static void* alloc_page_local();
>  static bool free_page_local(void* v);
> -void* pop() { return _pages[--nr]; }
> -void push(void* page) { _pages[nr++] = page; }
> -void* top() { return _pages[nr - 1]; }
> +void* pop() {
> +void *page = nullptr;
>

Nitpick (not important compared to the bigger comments below):

I think that setting this nullptr is redundant, and confusing (it led me to
think there is a reason why it should be set.
If you anyway leave this default value, you could do
_pages.pop(page);
return page;
instead of the ?: expression below. Maybe that's what you intended to do?


> +return (_pages.pop(page) ? page : nullptr);
> +}
> +bool push(void *page) { return _pages.push(page); }
> +unsigned size() { return _pages.size(); }
>  void wake_thread() { _fill_thread.wake(); }
>  static void fill_thread();
>  static void refill();
> @@ -1110,11 +1113,12 @@ struct l1 {
>  static constexpr size_t max = 512;
>  static constexpr size_t watermark_lo = max * 1 / 4;
>  static constexpr size_t watermark_hi = max * 3 / 4;
> -size_t nr = 0;
>
>  private:
>  sched::thread _fill_thread;
> -void* _pages[max];
> +
> +typedef ring_spsc page_ring;

+page_ring _pages;
>

I don't understand why you needed to change that array to a ring_spsc.

First about the boundary checking - sure, the push() and pop()
implementation look scary - not checking their boundries (we should at
least had an assert there...), but the code which calls push() for example
is l1::free_page_local, and that would not call push() if already reached
the maximum size.
Second the FIFO vs LIFO order - I assume that's not why you switched to the
ring.

So the most dramatic change is that you changed the non-atomic array to a
ring_spsc, which allows two different threads to read and write into this
buffer concurrently. But I less I'm misunderstanding something, the
intention of this code was that the so-called L1 page pool (per-cpu
page-to-be-allocated pool) would only be handled by threads running on a
single CPU, and these threads use preemption-disable (preempt_lock) as a
way to ensure that the different consumers (callers to
l1::alloc_page_local()) and different producers (callers to
l1::free_page_local()) will never run concurrently.

So unless I'm misunderstanding something, changing to ring_spsc is not
necessary here. We might have a different bug in - somewhere were we do not
hold the preempt_lock long enough while modifying the buffer.

Looking forward in the code a bit, I suspect the DROP_LOCK in
alloc_page_batch(). But I'll continue reading your patch now...


>  };
>
>  struct page_batch {
> @@ -1156,7 +1160,7 @@ public:
>  page_batch* pb;
>  while (!(pb = try_alloc_page_batch()) &&
>  // Check again since someone else might change pbuf.nr
> when we sleep
> -(pbuf.nr + page_batch::nr_pages < pbuf.max / 2)) {
> +(pbuf.size() + page_batch::nr_pages < pbuf.max / 2)) {
>  WITH_LOCK(migration_lock) {
>  DROP_LOCK(preempt_lock) {
>  refill();
> @@ -1243,14 +1247,19 @@ void l1::fill_thread()
>  for (;;) {
>  sched::thread::wait_until([&] {
>  WITH_LOCK(preempt_lock) {
> -return pbuf.nr < pbuf.watermark_lo || pbuf.nr >
> pbuf.watermark_hi;
> +auto nr = pbuf.size();
> +return nr < pbuf.watermark_lo || nr >
> pbuf.watermark_hi;
>  }
>  });
> -if (pbuf.nr < pbuf.watermark_lo) {
>

Note (not very relevant to your patch): The fact that the original code
used pbuf.nr without the preempt_lock 

Re: [PATCH] tcp_input: net channel state fix

2016-09-19 Thread Player, Timmons


On 9/19/16, 5:23 PM, "Nadav Har'El"  wrote:

>
>So what worries me is that if this netchannel hasn't been torn down for a
>
>reason - or being torn down in parallel - maybe it's not a good idea to
>tear it
>
>down here. Or maybe there is a case where we "leak" netchannels (never
>
>tear them down) and here we're avoiding the leak in just one case but
>leaving
>
>it in other places?
>
>
>Again, my worries likely come from not understanding this code enough,
>sorry :-(
>
>

Yeah, no worries.  For reference, in my particular use case both local and
remote
connections are being torn down at nearly the same time.

Timmons





Spirent Communications e-mail confidentiality.

This e-mail contains confidential and / or privileged information belonging to 
Spirent Communications plc, its affiliates and / or subsidiaries. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution and / or the taking of any action based upon reliance on 
the contents of this transmission is strictly forbidden. If you have received 
this message in error please notify the sender by return e-mail and delete it 
from your system.

Spirent Communications plc
Northwood Park, Gatwick Road, Crawley, West Sussex, RH10 9XN, United Kingdom.
Tel No. +44 (0) 1293 767676
Fax No. +44 (0) 1293 767677

Registered in England Number 470893
Registered at Northwood Park, Gatwick Road, Crawley, West Sussex, RH10 9XN, 
United Kingdom.

Or if within the US,

Spirent Communications,
27349 Agoura Road, Calabasas, CA, 91301, USA.
Tel No. 1-818-676- 2300

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PATCH] tcp_input: net channel state fix

2016-09-19 Thread Nadav Har'El
On Mon, Sep 19, 2016 at 4:37 PM, Player, Timmons  wrote:

> Well, KASSERT only actually fails in debug builds, so you’d never see a
> problem with a release build
> (since NDEBUG is defined in conf/release.mk).
>

So this patch looks like it is doing reasonable things to me, and I don't
see a
reason not to apply it, but I don't know if we're not just plastering over
some more
serious bug which causes this receiving-on-a-closed-connection case.

I can only think of some very unlikely situations where reordered
retransmissions
cause some duplicate packets to be received for a connection after the
FIN-ACK
was already received, but I wonder if this is really what happens in your
tests
(or in https://github.com/cloudius-systems/osv/issues/726) or we're seeing
evidence for a real bug in netchannels somewhere.

Avi, perhaps you can also look into this patch?



> Well, the only reason you end up in this state is that the net channel
> should have been torn down
> already, but hasn’t (though it could be in the process of doing so).
> Additionally, since the network driver calls directly into this function,
> you
> could potentially have a burst of packets destined for this no longer
> valid net channel.  So, it
> made sense to just close the channel here instead of allowing more packets
> to take this path
> into the stack.
>

So what worries me is that if this netchannel hasn't been torn down for a
reason - or being torn down in parallel - maybe it's not a good idea to
tear it
down here. Or maybe there is a case where we "leak" netchannels (never
tear them down) and here we're avoiding the leak in just one case but
leaving
it in other places?

Again, my worries likely come from not understanding this code enough,
sorry :-(


>
> Handing the packet off to netisr_dispatch allows the stack to generate a
> reset.
>

Sorry, I wasn't familiar with this "netisr_dispatch()" function.
Apparently ether_input() calls it, and ifp->if_input is ether_input().
So I guess it is indeed the right function to call.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Current State of C++ Unikernels

2016-09-19 Thread Ian Eyberg
Interesting - thanks for the notes!

I wouldn't read too much into the comparison slide - it was more of an
overview slide. I think one of the areas the unikernel ecosystem at
large is severely lacking in is education. Many of the people I talk
to depending on their sets of communities they belong to will only
recognize one unikernel and not even know there are 10+ others and
this is extracted from the set of people that don't dump unikernels
into the container ecosystem or from people that even recognize the
term. Someone at the meetup had asked what it would take to get
unikernel adoption going strong and my simple answer was education -
the technology is not a problem (speaking to a room full of
engineers).

The size slides were referring to disk image size. That was just a
simple hello world tutorial I found on the website. To be honest I
have not worked much with OSv yet compared to some others, but it's
definitely sparked my interest and I'm going to spend some more time.
re: education - for instance most of last year I was unaware that it
supported more than the JVM ecosystem.

I don't think the base image size is a huge limiting factor as long as
it's not outrageous. That is to say - deploys should be relatively
quick and not take 10-15minutes to upload a gig+ size image. Other
than that the image size is not that big of a deal to me as most
applications will use more 'expensive' resources in terms of
allocation such as ram. The disk size arguments make much more sense
when you start talking about the more usecase specific implementations
out there that service things for NFV and the like - not necessarily
web applications.

On Sun, Sep 18, 2016 at 9:08 AM, Nadav Har'El  wrote:
>
> On Fri, Sep 16, 2016 at 3:28 AM, Ian Eyberg  wrote:
>
>> Hi All:
>>
>>   I gave a presentation last night at the ACCU meetup on the 'Current
>> State of C++ Unikernels'. Dor happened to be around and mentioned that this
>> list might be interested in seeing the slides I was presenting so here they
>> are:
>> https://speakerdeck.com/eyberg/the-current-state-of-c-plus-plus-unikernels -
>> I don't know how many people showed up but I'm guessing there was somewhere
>> around 40-50. Also - if you're in the Bay Area we have regular unikernel
>> meetups and definitely are looking for more speakers -
>> http://www.meetup.com/San-Francisco-Unikernels/ .
>>
>> - Ian
>>
>
> Thanks. Very interesting.
>
> It seems that one of the places that OSv is outdone by the other unikernels
> you surveyed is in its size.
>
> This is partly to blame for the current approach of OSv to always include
> the entire kitchen sink - the entire Posix, Linux, etc., file system,
> networking, etc. - even if the application doesn't really needs them. Unlike
> Linux we don't have a huge amount of irrelevant stuff (like floppy disk
> drivers), but some applications may nonetheless find whole chunks of OSv to
> be irrelevant for them, and may want a smaller OSv without them.
>
> For example, we currently support building OSv without a permanent
> filesystem (only a ram drive) so it compiles without ZFS in the kernel, and
> without ZFS-related tools, which lowers both the image size and memory use.
> If you build the silly game "rogue" without a permanent filesystem,
> ""scripts/build image=rogue fs=ramfs", the resulting image size is only 6.6
> MB.
>
> In your slide you listed OSv's "size" as 29 MB. I don't know if this refers
> to to the image size (which, as I said above, can be much lower) or to the
> memory size. OSv's memory size is also "excessive" (in Super Mario
> standards) because of some silly compromises we took, such as:
>
>  1. We have a silly lower limit on the memory we use because of reasons
> explained in https://github.com/cloudius-systems/osv/issues/195. This lower
> limit can be made configurable - currently it is set to about 25 MB.
>
>  2. All the kernel code is loaded into memory, so the more non-useful stuff
> we compile in, the more memory we need.
>
>  3. Moreover, some areas of the code use too much memory - for example the
> ZFS system starting a bunch of threads (see
> https://github.com/cloudius-systems/osv/issues/247) and also a generous
> cache, some code uses generous amounts of buffers, etc.
>
>  4. pthread stacks are allocated in advance: see
> https://github.com/cloudius-systems/osv/issues/143
>
> If anyone is interested to work on these issues, or other directions of
> making OSv smaller, I'd be happy :-)

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [PATCH] tcp_input: net channel state fix

2016-09-19 Thread Player, Timmons
I think I can answer a few of these.  See answers inlined below…

Timmons

On 9/19/16, 8:53 AM, "Nadav Har'El"  wrote:

>I'm having a hard time understanding this patch. Maybe Avi (CCed)
>remembers some of the code involved (which he wrote) and might have
>better comments.
>
>Some questions inline below:
>
>
>--
>Nadav Har'El
>
>n...@scylladb.com
>
>
>
>
>
>
>
>On Fri, Sep 9, 2016 at 9:08 PM, Timmons C. Player
> wrote:
>
>Check the TCP state of net_channel directed packets to ensure that
>tcp_do_segment can process them.  If not, hand the packet off to
>netisr_dispatch and teardown the net channel.  This prevents triggering
>assertion failures in tcp_do_segment.
>
>Also update the lock acquisition check at the beginning of tcp_do_segment
>to match the assertion check below it.
>
>Signed-off-by: Timmons C. Player 
>---
> bsd/sys/netinet/tcp_input.cc | 13 -
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
>diff --git a/bsd/sys/netinet/tcp_input.cc b/bsd/sys/netinet/tcp_input.cc
>index 642cfeb..c780d46 100644
>--- a/bsd/sys/netinet/tcp_input.cc
>+++ b/bsd/sys/netinet/tcp_input.cc
>@@ -1048,7 +1048,9 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th,
>struct socket *so,
> * if we fail, drop the packet.  FIXME: invert the lock order so
>we don't
> * have to drop packets.
> */
>-   if (tp->get_state() != TCPS_ESTABLISHED && ti_locked ==
>TI_UNLOCKED) {
>+   if ((tp->get_state() != TCPS_ESTABLISHED
>+   || (thflags & (TH_SYN | TH_FIN | TH_RST) != 0))
>+  && ti_locked == TI_UNLOCKED) {
>
>
>
>
>This change makes sense to me, as even for an established connection, a
>FIN or RST might require that global lock (I don't see how we can have a
>SYN for an established connection).
>
>
>
>
>However I don't understand, if this is what this patch does, how this the
>connection closing ever work before? Wouldn't it fail when it noticed the
>lock was not taken?
>
>
>(note to self: the code modified here came from commit
>128af65368a421e28bea16c303aee68b8b34de24)
>
>

Well, KASSERT only actually fails in debug builds, so you’d never see a
problem with a release build
(since NDEBUG is defined in conf/release.mk).

>
>
>if (INP_INFO_TRY_WLOCK(&V_tcbinfo)) {
>ti_locked = TI_WLOCKED;
>} else {
>@@ -3188,6 +3190,15 @@ tcp_newreno_partial_ack(struct tcpcb *tp, struct
>tcphdr *th)
> static void
> tcp_net_channel_packet(tcpcb* tp, mbuf* m)
> {
>+   if (tp->get_state() <= TCPS_LISTEN) {
>+   // We can't hand this packet off to tcp_do_segment due to
>the
>+   // current connection state.Drop the channel and
>handle the
>+   // packet via the slow path.
>+   tcp_teardown_net_channel(tp);
>+   netisr_dispatch(NETISR_ETHER, m);
>+   return;
>+   }
>
>
>
>
>I am guessing your problem wasn't with the listening state but rather the
>"closed" state?
>
>If the socket is closed, why do we call this netisr_dispatch() function,
>what would it do?
>
>Also why do we need to tcp_teardown_net_channel() here? Doesn't it get
>called elsewhere already?


Yes, the primary problem occurs in the closed state, but tcp_do_segment
asserts that the tcp state
is > TCP_LISTEN, so it seemed like a reasonable check for the net channel,
too.

Well, the only reason you end up in this state is that the net channel
should have been torn down
already, but hasn’t (though it could be in the process of doing so).
Additionally, since the network driver calls directly into this function,
you
could potentially have a burst of packets destined for this no longer
valid net channel.  So, it
made sense to just close the channel here instead of allowing more packets
to take this path
into the stack.

Handing the packet off to netisr_dispatch allows the stack to generate a
reset.

>
>
>Maybe this code is correct, but I simply am not familiar enough with it
>to tell... Avi?
>
>
>
>
>+
>log_packet_handling(m, NETISR_ETHER);
>caddr_t start = m->m_hdr.mh_data;
>auto h = start;
>--
>2.7.4
>
>--
>You received this message because you are subscribed to the Google Groups
>"OSv Development" group.
>To unsubscribe from this group and stop receiving emails from it, send an
>email to
>osv-dev+unsubscr...@googlegroups.com
>.
>For more options, visit
>https://groups.google.com/d/optout .
>
>
>
>
>
>





Spirent Communications e-mail confidentiality.

This e-mail contains confidential and / or privileged information belonging to 
Spirent Communications plc, its affiliates and / or subsidiaries. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution and / or the taking of any action based upon reliance on 
the contents of this transmission is strictly forbidden. If you ha

Re: [PATCH] tcp_input: net channel state fix

2016-09-19 Thread Nadav Har'El
I'm having a hard time understanding this patch. Maybe Avi (CCed) remembers
some of the code involved (which he wrote) and might have better comments.
Some questions inline below:


--
Nadav Har'El
n...@scylladb.com

On Fri, Sep 9, 2016 at 9:08 PM, Timmons C. Player <
timmons.pla...@spirent.com> wrote:

> Check the TCP state of net_channel directed packets to ensure that
> tcp_do_segment can process them.  If not, hand the packet off to
> netisr_dispatch and teardown the net channel.  This prevents triggering
> assertion failures in tcp_do_segment.
>
> Also update the lock acquisition check at the beginning of tcp_do_segment
> to match the assertion check below it.
>
> Signed-off-by: Timmons C. Player 
> ---
>  bsd/sys/netinet/tcp_input.cc | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/bsd/sys/netinet/tcp_input.cc b/bsd/sys/netinet/tcp_input.cc
> index 642cfeb..c780d46 100644
> --- a/bsd/sys/netinet/tcp_input.cc
> +++ b/bsd/sys/netinet/tcp_input.cc
> @@ -1048,7 +1048,9 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th,
> struct socket *so,
>  * if we fail, drop the packet.  FIXME: invert the lock order so
> we don't
>  * have to drop packets.
>  */
> -   if (tp->get_state() != TCPS_ESTABLISHED && ti_locked ==
> TI_UNLOCKED) {
> +   if ((tp->get_state() != TCPS_ESTABLISHED
> +   || (thflags & (TH_SYN | TH_FIN | TH_RST) != 0))
> +  && ti_locked == TI_UNLOCKED) {
>

This change makes sense to me, as even for an established connection, a FIN
or RST might require that global lock (I don't see how we can have a SYN
for an established connection).

However I don't understand, if this is what this patch does, how this the
connection closing ever work before? Wouldn't it fail when it noticed the
lock was not taken?

(note to self: the code modified here came from commit
128af65368a421e28bea16c303aee68b8b34de24)



> if (INP_INFO_TRY_WLOCK(&V_tcbinfo)) {
> ti_locked = TI_WLOCKED;
> } else {
> @@ -3188,6 +3190,15 @@ tcp_newreno_partial_ack(struct tcpcb *tp, struct
> tcphdr *th)
>  static void
>  tcp_net_channel_packet(tcpcb* tp, mbuf* m)
>  {
> +   if (tp->get_state() <= TCPS_LISTEN) {
> +   // We can't hand this packet off to tcp_do_segment due to
> the
> +   // current connection state.Drop the channel and
> handle the
> +   // packet via the slow path.
> +   tcp_teardown_net_channel(tp);
> +   netisr_dispatch(NETISR_ETHER, m);
> +   return;
> +   }
>

I am guessing your problem wasn't with the listening state but rather the
"closed" state?
If the socket is closed, why do we call this netisr_dispatch() function,
what would it do?
Also why do we need to tcp_teardown_net_channel() here? Doesn't it get
called elsewhere already?

Maybe this code is correct, but I simply am not familiar enough with it to
tell... Avi?


> +
> log_packet_handling(m, NETISR_ETHER);
> caddr_t start = m->m_hdr.mh_data;
> auto h = start;
> --
> 2.7.4
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[PATCH] sched: only allocate sched::thread objects on the heap

2016-09-19 Thread Nadav Har'El
In issue #790, Timmons C. Player discovered something dangerous with
on-stack sched::thread objects: The stack may be mmap()ed, so the scheduler
now needs to access an application's mmap()ed memory area. Those accesses
can potentially have all sort of problems (like page faults in the scheduler,
if it weren't for issue #143), but more concretely: The "lazy TLB flush"
code added in commit 7e38453 means that the scheduler may see old pages
for application-mmap()ed pages, so it cannot work with a sched::thread in an
mmap()ed area.

This patch prevents on-stack sched::thread construction by hiding the
constructor, and only allowing a thread to be created through the function
sched::thread::make(...), which creates the object on the heap.

This patch is long but it more-or-less contains just the following
types of changes:

1. The change to sched.hh to hide the sched::thread constructor, and
   instead offer a sched::thread::make(...) function.

2. Every class which used a "sched::thread" member was replaced by
   a std::unique_ptr member.

   One "casualty" of this change is that a class that used to hold a
   sched::thread member will now hold a sched::thread pointer, and use an
   extra allocation and indirection - even if we know for sure that the
   containing object will always be allocated on the heap (such as the
   case in pthread.cc's pthread::_thread, for example). This can be worked
   around by making the exceptional class a friend of sched::thread to be
   able to use the hidden constructor - but for now I decided the tiny
   performance benefit isn't important enough to make this exception.

3. Every place which used "new sched::thread(...)" to create the thread,
   now uses "sched::thread::make(...)" instead.

   Sadly we no longer allow new sched::thread(...), although there is
   nothing wrong with it, because disabling the constructor also disables
   the new expression.

4. One test in misc-wake.cc (which wasn't enabled by default anyway)
   was commented out because it relied on creating thread objects in
   mmap()ed areas.

One of the places modified is rcu_flush(), whose on-stack sched::thread
objects are what caused issue #790.

Fixes #790.

Signed-off-by: Nadav Har'El 
---
 drivers/virtio-net.hh |  6 ++--
 drivers/virtio-rng.hh |  2 +-
 drivers/vmxnet3.hh|  4 +--
 include/osv/mempool.hh|  2 +-
 include/osv/percpu_xmit.hh|  2 +-
 include/osv/sched.hh  | 12 
 tests/tst-bsd-synch.hh| 12 
 tests/tst-malloc.hh   |  4 +--
 tests/tst-rwlock.hh   |  8 ++---
 tests/tst-threads.hh  |  4 +--
 tests/tst-timer.hh|  2 +-
 arch/x64/xen_intr.cc  |  2 +-
 bsd/porting/callout.cc|  2 +-
 bsd/porting/kthread.cc|  4 +--
 bsd/sys/net/netisr1.cc|  2 +-
 core/async.cc | 12 
 core/dhcp.cc  |  2 +-
 core/mempool.cc   | 28 -
 core/pagecache.cc |  6 ++--
 core/percpu-worker.cc |  2 +-
 core/rcu.cc   | 20 ++---
 core/sched.cc | 26 
 core/trace.cc |  8 ++---
 drivers/acpi.cc   | 12 
 drivers/ahci.cc   |  2 +-
 drivers/console-driver.cc |  2 +-
 drivers/virtio-assign.cc  |  2 +-
 drivers/virtio-blk.cc |  2 +-
 drivers/virtio-net.cc |  2 +-
 drivers/virtio-rng.cc |  6 ++--
 drivers/virtio-scsi.cc|  2 +-
 drivers/vmw-pvscsi.cc |  2 +-
 drivers/vmxnet3.cc|  4 +--
 libc/pthread.cc   | 28 -
 libc/signal.cc|  4 +--
 libc/timerfd.cc   | 10 +++
 tests/misc-free-perf.cc   |  2 +-
 tests/misc-leak.cc|  4 +--
 tests/misc-lfring.cc  | 12 
 tests/misc-loadbalance.cc |  2 +-
 tests/misc-mutex.cc   |  2 +-
 tests/misc-sockets.cc |  8 ++---
 tests/misc-tcp-hash-srv.cc|  2 +-
 tests/misc-wake.cc|  2 ++
 tests/tst-af-local.cc | 16 +-
 tests/tst-bsd-tcp1-zrcv.cc|  4 +--
 tests/tst-bsd-tcp1-zsnd.cc|  4 +--
 tests/tst-bsd-tcp1-zsndrcv.cc |  4 +--
 tests/tst-bsd-tcp1.cc |  4 +--
 tests/tst-condvar.cc  | 10 +++
 tests/tst-fpu.cc  |  6 ++--
 tests/tst-mmap.cc |  4 +--
 tests/tst-pin.cc  | 70 +--
 tests/tst-preempt.cc  |  2 +-
 tests/tst-rcu-hashtable.cc|  2 +-
 tests/tst-rcu-list.cc |  2 +-
 tests/tst-threadcomplete.cc   |  4 +--
 tests/tst-vfs.cc  |  2 +-
 tests/tst-wait-for.cc | 30 +--
 tests/tst-yield.cc|  2 +-
 60 files changed, 232 insertions(+), 218 deletions(-)

diff --git a/drivers/virtio-net.hh b/drivers/virtio-net.hh
index cf2d6f2..ad323e6 100644
--- a/drivers/virtio-net.hh
+++ b/drivers/virtio-net.hh
@@ -302,10 +302,10 @@ private:

Re: Lazy TLB flushing

2016-09-19 Thread Gleb Natapov
On Mon, Sep 19, 2016 at 01:30:15PM +0300, Nadav Har'El wrote:
> On Mon, Sep 19, 2016 at 1:11 PM, Gleb Natapov  wrote:
> 
> > > So if I understand correctly, both your and Gleb's sentiment is there is
> > no
> > > need not to make the scheduler more forgiving for page table changes, but
> > > rather require that sched::threads objects live in a mapping which could
> > > never change - which more-or-less means in the OSv case, that
> > sched::thread
> > > objects should be allocated only on the heap.
> > >
> > As most other things this is trade-off. Allowing sched::threads to live
> > in a stack is not worth removing the optimization for, but may be something
> > else will provide so much benefit that allowing non app threads accessing
> > mmaped area will be worth supporting. Optimization may still avoid
> > sending IPI to idle cpus.
> >
> 
> So was that the main benefit of that lazy TLB patch - idle CPUs, not
> "system threads" in general?
> 
Both. If we'll allow to run multiple apps one day in separate namespaces we
can flush only cpus that run threads of the same application (this is
what Linux does as you probably guess).

> 
> >
> > > So a patch which will enforce sched::thread to be allocated only on the
> > > heap is the right way to go here? (I sent such a patch already, but I'll
> > > try to send a slightly less ugly one. In any case it will be a long
> > patch).
> > >
> > I think so. BTW can this be fixed by forcing tlb flush when sending new
> > thread to another cpu somehow?
> >
> 
> This is more-or-less what Timmons' patch did: before the scheduler on the
> target CPU wants to inspect the incoming threads, it checks if it needs to
> flush
> the TLB.
> 
> What bothered me with that patch was what I wrote in my previous email -
> that
> the check is done using that lazy_tlb_flush flag - but the existing code
> zeros this
> flag as soon as it decides to send an IPI, so at that point the target
> scheduler will
> no longer know that it needs to flush the TLB. So I think for this to be
> correct we
> need to leave the flag set until we actually flushed the TLB (so the target
> CPU, not
> the source CPU, will zero this flag). That's also easy to change, I think.
> 
It is. Will cause superfluous flushes though.

> So now I need to decide whether to "fix" the support for on-stack thread
> objects,
> or just enforce sched::thread to always be dynamically allocated.
> 
I think the later is very reasonable restriction.

> On-stack threads (or more generally, sched::thread as a member, rather than
> a
> pointer - even if not on the stack) are a strange thing for other reasons
> as well -
> e.g., their detach() support is incorrect and there is no way to NOT wait
> for them.
> So perhaps it makes sense to get rid of them for other reasons as well.
> 
> 
> >
> > > By the way, it is obvious that a sched::thread cannot live in a paged-out
> > > mapping, as the scheduler cannot sleep to wait for it. But the issue here
> > > is more subtle: stack memory is guaranteed to be always paged in (for
> > > reasons explained here https://github.com/cloudius-
> > systems/osv/issues/143)
> > > - and nontheless, it is still not fine to put a sched::thread there
> > because
> > > of the TLB flushing issue.
> >
> > --
> > Gleb.
> >

--
Gleb.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Nadav Har'El
On Mon, Sep 19, 2016 at 1:11 PM, Gleb Natapov  wrote:

> > So if I understand correctly, both your and Gleb's sentiment is there is
> no
> > need not to make the scheduler more forgiving for page table changes, but
> > rather require that sched::threads objects live in a mapping which could
> > never change - which more-or-less means in the OSv case, that
> sched::thread
> > objects should be allocated only on the heap.
> >
> As most other things this is trade-off. Allowing sched::threads to live
> in a stack is not worth removing the optimization for, but may be something
> else will provide so much benefit that allowing non app threads accessing
> mmaped area will be worth supporting. Optimization may still avoid
> sending IPI to idle cpus.
>

So was that the main benefit of that lazy TLB patch - idle CPUs, not
"system threads" in general?


>
> > So a patch which will enforce sched::thread to be allocated only on the
> > heap is the right way to go here? (I sent such a patch already, but I'll
> > try to send a slightly less ugly one. In any case it will be a long
> patch).
> >
> I think so. BTW can this be fixed by forcing tlb flush when sending new
> thread to another cpu somehow?
>

This is more-or-less what Timmons' patch did: before the scheduler on the
target CPU wants to inspect the incoming threads, it checks if it needs to
flush
the TLB.

What bothered me with that patch was what I wrote in my previous email -
that
the check is done using that lazy_tlb_flush flag - but the existing code
zeros this
flag as soon as it decides to send an IPI, so at that point the target
scheduler will
no longer know that it needs to flush the TLB. So I think for this to be
correct we
need to leave the flag set until we actually flushed the TLB (so the target
CPU, not
the source CPU, will zero this flag). That's also easy to change, I think.

So now I need to decide whether to "fix" the support for on-stack thread
objects,
or just enforce sched::thread to always be dynamically allocated.

On-stack threads (or more generally, sched::thread as a member, rather than
a
pointer - even if not on the stack) are a strange thing for other reasons
as well -
e.g., their detach() support is incorrect and there is no way to NOT wait
for them.
So perhaps it makes sense to get rid of them for other reasons as well.


>
> > By the way, it is obvious that a sched::thread cannot live in a paged-out
> > mapping, as the scheduler cannot sleep to wait for it. But the issue here
> > is more subtle: stack memory is guaranteed to be always paged in (for
> > reasons explained here https://github.com/cloudius-
> systems/osv/issues/143)
> > - and nontheless, it is still not fine to put a sched::thread there
> because
> > of the TLB flushing issue.
>
> --
> Gleb.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Gleb Natapov
On Mon, Sep 19, 2016 at 12:30:53PM +0300, Nadav Har'El wrote:
> On Mon, Sep 19, 2016 at 11:55 AM, Avi Kivity  wrote:
> 
> >
> > On 09/19/2016 11:49 AM, Nadav Har'El wrote:
> >
> >
> > On Mon, Sep 19, 2016 at 10:52 AM, Gleb Natapov  wrote:
> >
> >>
> >> > 2. I'm afraid the scheduler thing might only be the tip of the iceberg
> >> of
> >> > problems caused by this lazy TLB thing. Could we have, for example (and
> >> > this is just a hypothetical example) one thread doing a write() to disk
> >> of
> >> > some data from an mmap'ed area, and this data is supposed to be read by
> >> a
> >> > ZFS thread which runs on a different CPU - and because it is labeled a
> >> > "system thread", it won't do a TLB flush before reading the mmap'ed
> >> area?
> >> write() copies data into ZFS ARC.
> >>
> >
> >> > Why are we confident that "system threads" never need to read user's
> >> > mmap'ed data?
> >> >
> >> If they do this is a bug as you discovered. They may access mmaped
> >> memory, but they should do so through their own mappings.
> >
> >
> > So basically, "system threads" (and the scheduler) are not allowed to read
> > any user memory, because any user memory may be mmap'ed.
> > Whatever user memory is needed in a system thread, must first be *copied*
> > in the original user thread...
> >
> >
> > Or, the mapping must be pinned (and paged in) for the duration of the
> > access.
> >
> 
> 
> So if I understand correctly, both your and Gleb's sentiment is there is no
> need not to make the scheduler more forgiving for page table changes, but
> rather require that sched::threads objects live in a mapping which could
> never change - which more-or-less means in the OSv case, that sched::thread
> objects should be allocated only on the heap.
> 
As most other things this is trade-off. Allowing sched::threads to live
in a stack is not worth removing the optimization for, but may be something
else will provide so much benefit that allowing non app threads accessing
mmaped area will be worth supporting. Optimization may still avoid
sending IPI to idle cpus.

> So a patch which will enforce sched::thread to be allocated only on the
> heap is the right way to go here? (I sent such a patch already, but I'll
> try to send a slightly less ugly one. In any case it will be a long patch).
> 
I think so. BTW can this be fixed by forcing tlb flush when sending new
thread to another cpu somehow?

> By the way, it is obvious that a sched::thread cannot live in a paged-out
> mapping, as the scheduler cannot sleep to wait for it. But the issue here
> is more subtle: stack memory is guaranteed to be always paged in (for
> reasons explained here https://github.com/cloudius-systems/osv/issues/143)
> - and nontheless, it is still not fine to put a sched::thread there because
> of the TLB flushing issue.

--
Gleb.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Nadav Har'El
On Mon, Sep 19, 2016 at 11:55 AM, Avi Kivity  wrote:

>
> On 09/19/2016 11:49 AM, Nadav Har'El wrote:
>
>
> On Mon, Sep 19, 2016 at 10:52 AM, Gleb Natapov  wrote:
>
>>
>> > 2. I'm afraid the scheduler thing might only be the tip of the iceberg
>> of
>> > problems caused by this lazy TLB thing. Could we have, for example (and
>> > this is just a hypothetical example) one thread doing a write() to disk
>> of
>> > some data from an mmap'ed area, and this data is supposed to be read by
>> a
>> > ZFS thread which runs on a different CPU - and because it is labeled a
>> > "system thread", it won't do a TLB flush before reading the mmap'ed
>> area?
>> write() copies data into ZFS ARC.
>>
>
>> > Why are we confident that "system threads" never need to read user's
>> > mmap'ed data?
>> >
>> If they do this is a bug as you discovered. They may access mmaped
>> memory, but they should do so through their own mappings.
>
>
> So basically, "system threads" (and the scheduler) are not allowed to read
> any user memory, because any user memory may be mmap'ed.
> Whatever user memory is needed in a system thread, must first be *copied*
> in the original user thread...
>
>
> Or, the mapping must be pinned (and paged in) for the duration of the
> access.
>


So if I understand correctly, both your and Gleb's sentiment is there is no
need not to make the scheduler more forgiving for page table changes, but
rather require that sched::threads objects live in a mapping which could
never change - which more-or-less means in the OSv case, that sched::thread
objects should be allocated only on the heap.

So a patch which will enforce sched::thread to be allocated only on the
heap is the right way to go here? (I sent such a patch already, but I'll
try to send a slightly less ugly one. In any case it will be a long patch).

By the way, it is obvious that a sched::thread cannot live in a paged-out
mapping, as the scheduler cannot sleep to wait for it. But the issue here
is more subtle: stack memory is guaranteed to be always paged in (for
reasons explained here https://github.com/cloudius-systems/osv/issues/143)
- and nontheless, it is still not fine to put a sched::thread there because
of the TLB flushing issue.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Gleb Natapov
On Mon, Sep 19, 2016 at 11:49:41AM +0300, Nadav Har'El wrote:
> On Mon, Sep 19, 2016 at 10:52 AM, Gleb Natapov  wrote:
> 
> >
> > > 2. I'm afraid the scheduler thing might only be the tip of the iceberg of
> > > problems caused by this lazy TLB thing. Could we have, for example (and
> > > this is just a hypothetical example) one thread doing a write() to disk
> > of
> > > some data from an mmap'ed area, and this data is supposed to be read by a
> > > ZFS thread which runs on a different CPU - and because it is labeled a
> > > "system thread", it won't do a TLB flush before reading the mmap'ed area?
> > write() copies data into ZFS ARC.
> >
> 
> > > Why are we confident that "system threads" never need to read user's
> > > mmap'ed data?
> > >
> > If they do this is a bug as you discovered. They may access mmaped
> > memory, but they should do so through their own mappings.
> 
> 
> So basically, "system threads" (and the scheduler) are not allowed to read
> any user memory, because any user memory may be mmap'ed.
> Whatever user memory is needed in a system thread, must first be *copied*
> in the original user thread...
> 
No, it may be accessed through global kernel mapping.

> In most cases we already need to do this (such is the network API which
> copies, and such is the filesystem), but I'm worried if we really checked
> all the cases, and moreover worried that future developers will not be
> aware of this restriction.
This is the same restriction that exists in Linux kernel.

> 
> Forcing a copy of user data was very natural in OSs which have a
> userspace/kernel separation, but in OSv it's not really natural. For
> example, imagine that in the future we implement zero-copy AIO - won't it
> read/write directly into user memory, which may be mmap'ed? Couldn't that
> read/write happen in a different CPU?
> 
OSs with userspace/kernel separation implement zero-copy AIO without
much problem. The memory is all there accessible through kernel mapping.

> 
> 
> > This is the
> > design, not something that has to be this way.
> >
> 
> The design of what? Of this specific optimization?
> 
The design of the kernel that allows this optimization to exist.

> 
> >
> > > 3.  This commit 7e38453 starts flush_tlb_all() with setting the
> > > lazy_flush_tlb flag to true, but resets it back to false when it decides
> > to
> > > send an IPI. If the other CPU is right now in the scheduler we can have
> > the
> > > code leave the flag at false (if the out-going thread was an app thread)
> > > and send an IPI which will be delayed - so the scheduler has no way of
> > > knowing it needs to do a TLB flush before accessing the sched::thread.
> > > Couldn't we live the flag at true *in addition* to the IPI? The IPI
> > handler
> > > could then zero it (if not already zero)?
> > >
> > If IPI can be delayed why the same bug cannot happen without
> > lazy_flush_tlb optimization at all? Thread A mmaps its stack, sends
> > flush IPI which is delayed, allocates B's thread struct on the stack,
> > cpu 1 tries to access it -> boom.
> >
> 
> This is a good point.
I am not sure this can happen though. How cpu 1 will know B exists
before getting the IPI?

> 
> Which brings me back again to the point that if we want to correctly
> support on-stack threads (that's separate question - I already have patches
> forbidding on-stack thread objects if we want to), we probably need to use
> this *flag* variable - not the IPI - as a signal to the scheduler that it
> needs to flush the TLB.
> So we need to clear this flag only once the target CPU actually flushed the
> TLB - not as the code currently does, as soon as the source CPU sent it an
> IPI. So I'm wondering if there was an important reason why the current code
> needs to zero the flag as soon as we decide to send an IPI ?
> 
No, this was to avoid redundant local tlb flush on next reschedule, but
I do not think it is a good idea to reuse this flag to avoid IPI race
which I am not sure can happen at all.

--
Gleb.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Avi Kivity



On 09/19/2016 11:49 AM, Nadav Har'El wrote:


On Mon, Sep 19, 2016 at 10:52 AM, Gleb Natapov > wrote:



> 2. I'm afraid the scheduler thing might only be the tip of the
iceberg of
> problems caused by this lazy TLB thing. Could we have, for
example (and
> this is just a hypothetical example) one thread doing a write()
to disk of
> some data from an mmap'ed area, and this data is supposed to be
read by a
> ZFS thread which runs on a different CPU - and because it is
labeled a
> "system thread", it won't do a TLB flush before reading the
mmap'ed area?
write() copies data into ZFS ARC.


> Why are we confident that "system threads" never need to read user's
> mmap'ed data?
>
If they do this is a bug as you discovered. They may access mmaped
memory, but they should do so through their own mappings.


So basically, "system threads" (and the scheduler) are not allowed to 
read any user memory, because any user memory may be mmap'ed.
Whatever user memory is needed in a system thread, must first be 
*copied* in the original user thread...


Or, the mapping must be pinned (and paged in) for the duration of the 
access.




In most cases we already need to do this (such is the network API 
which copies, and such is the filesystem), but I'm worried if we 
really checked all the cases, and moreover worried that future 
developers will not be aware of this restriction.


Forcing a copy of user data was very natural in OSs which have a 
userspace/kernel separation, but in OSv it's not really natural. For 
example, imagine that in the future we implement zero-copy AIO - won't 
it read/write directly into user memory, which may be mmap'ed? 
Couldn't that read/write happen in a different CPU?


Zero-copy happens without copying.  You pin the physical page (usually 
pinning the mapping too in the process), and give the physical address 
to the device to perform DMA.




This is the
design, not something that has to be this way.


The design of what? Of this specific optimization?


> 3.  This commit 7e38453 starts flush_tlb_all() with setting the
> lazy_flush_tlb flag to true, but resets it back to false when it
decides to
> send an IPI. If the other CPU is right now in the scheduler we
can have the
> code leave the flag at false (if the out-going thread was an app
thread)
> and send an IPI which will be delayed - so the scheduler has no
way of
> knowing it needs to do a TLB flush before accessing the
sched::thread.
> Couldn't we live the flag at true *in addition* to the IPI? The
IPI handler
> could then zero it (if not already zero)?
>
If IPI can be delayed why the same bug cannot happen without
lazy_flush_tlb optimization at all? Thread A mmaps its stack, sends
flush IPI which is delayed, allocates B's thread struct on the stack,
cpu 1 tries to access it -> boom.


This is a good point.

Which brings me back again to the point that if we want to correctly 
support on-stack threads (that's separate question - I already have 
patches forbidding on-stack thread objects if we want to), we probably 
need to use this *flag* variable - not the IPI - as a signal to the 
scheduler that it needs to flush the TLB.
So we need to clear this flag only once the target CPU actually 
flushed the TLB - not as the code currently does, as soon as the 
source CPU sent it an IPI. So I'm wondering if there was an important 
reason why the current code needs to zero the flag as soon as we 
decide to send an IPI ?


Nadav.
--
You received this message because you are subscribed to the Google 
Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to osv-dev+unsubscr...@googlegroups.com 
.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Nadav Har'El
On Mon, Sep 19, 2016 at 10:52 AM, Gleb Natapov  wrote:

>
> > 2. I'm afraid the scheduler thing might only be the tip of the iceberg of
> > problems caused by this lazy TLB thing. Could we have, for example (and
> > this is just a hypothetical example) one thread doing a write() to disk
> of
> > some data from an mmap'ed area, and this data is supposed to be read by a
> > ZFS thread which runs on a different CPU - and because it is labeled a
> > "system thread", it won't do a TLB flush before reading the mmap'ed area?
> write() copies data into ZFS ARC.
>

> > Why are we confident that "system threads" never need to read user's
> > mmap'ed data?
> >
> If they do this is a bug as you discovered. They may access mmaped
> memory, but they should do so through their own mappings.


So basically, "system threads" (and the scheduler) are not allowed to read
any user memory, because any user memory may be mmap'ed.
Whatever user memory is needed in a system thread, must first be *copied*
in the original user thread...

In most cases we already need to do this (such is the network API which
copies, and such is the filesystem), but I'm worried if we really checked
all the cases, and moreover worried that future developers will not be
aware of this restriction.

Forcing a copy of user data was very natural in OSs which have a
userspace/kernel separation, but in OSv it's not really natural. For
example, imagine that in the future we implement zero-copy AIO - won't it
read/write directly into user memory, which may be mmap'ed? Couldn't that
read/write happen in a different CPU?



> This is the
> design, not something that has to be this way.
>

The design of what? Of this specific optimization?


>
> > 3.  This commit 7e38453 starts flush_tlb_all() with setting the
> > lazy_flush_tlb flag to true, but resets it back to false when it decides
> to
> > send an IPI. If the other CPU is right now in the scheduler we can have
> the
> > code leave the flag at false (if the out-going thread was an app thread)
> > and send an IPI which will be delayed - so the scheduler has no way of
> > knowing it needs to do a TLB flush before accessing the sched::thread.
> > Couldn't we live the flag at true *in addition* to the IPI? The IPI
> handler
> > could then zero it (if not already zero)?
> >
> If IPI can be delayed why the same bug cannot happen without
> lazy_flush_tlb optimization at all? Thread A mmaps its stack, sends
> flush IPI which is delayed, allocates B's thread struct on the stack,
> cpu 1 tries to access it -> boom.
>

This is a good point.

Which brings me back again to the point that if we want to correctly
support on-stack threads (that's separate question - I already have patches
forbidding on-stack thread objects if we want to), we probably need to use
this *flag* variable - not the IPI - as a signal to the scheduler that it
needs to flush the TLB.
So we need to clear this flag only once the target CPU actually flushed the
TLB - not as the code currently does, as soon as the source CPU sent it an
IPI. So I'm wondering if there was an important reason why the current code
needs to zero the flag as soon as we decide to send an IPI ?

Nadav.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Lazy TLB flushing

2016-09-19 Thread Gleb Natapov
On Sun, Sep 18, 2016 at 10:37:22PM +0200, Nadav Har'El wrote:
> Hi Gleb, I have a couple of questions (CCed to the OSv mailing list) about
> your OSv commit 7e38453, maybe you remember something (or be reminded of
> something when you look at the commit).
> 
> This commit is apparently causing
> https://github.com/cloudius-systems/osv/issues/790 so now we're trying to
> figure out how to most properly fix it. The problem is that it appears that
> the scheduler on one CPU is handed sched::thread objects from another CPU.
> These thread objects might live in mmap()ed areas, but we may have delayed
> the required TLB flush on the target CPU.
> 
> So the questions I find myself asking, perhaps you remember:
> 
> 1. Do you remember if this commit was an important performance advantage
> for a workload, or just an optimistic fix?
> 
It is an optimization that exists on all OSes and doubly important on
guest OSes since IPIs there are much more expensive. If workload is mmap
heavy you will have a lot of IPIs without the optimization.

> 2. I'm afraid the scheduler thing might only be the tip of the iceberg of
> problems caused by this lazy TLB thing. Could we have, for example (and
> this is just a hypothetical example) one thread doing a write() to disk of
> some data from an mmap'ed area, and this data is supposed to be read by a
> ZFS thread which runs on a different CPU - and because it is labeled a
> "system thread", it won't do a TLB flush before reading the mmap'ed area?
write() copies data into ZFS ARC.

> Why are we confident that "system threads" never need to read user's
> mmap'ed data?
> 
If they do this is a bug as you discovered. They may access mmaped
memory, but they should do so through their own mappings. This is the
design, not something that has to be this way.

> 3.  This commit 7e38453 starts flush_tlb_all() with setting the
> lazy_flush_tlb flag to true, but resets it back to false when it decides to
> send an IPI. If the other CPU is right now in the scheduler we can have the
> code leave the flag at false (if the out-going thread was an app thread)
> and send an IPI which will be delayed - so the scheduler has no way of
> knowing it needs to do a TLB flush before accessing the sched::thread.
> Couldn't we live the flag at true *in addition* to the IPI? The IPI handler
> could then zero it (if not already zero)?
> 
If IPI can be delayed why the same bug cannot happen without
lazy_flush_tlb optimization at all? Thread A mmaps its stack, sends
flush IPI which is delayed, allocates B's thread struct on the stack,
cpu 1 tries to access it -> boom.

--
Gleb.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Current State of C++ Unikernels

2016-09-19 Thread Nadav Har'El
On Mon, Sep 19, 2016 at 2:10 AM, Waldek Kozaczuk 
wrote:

> OSv has already a concept of module but I wonder if its core can be made
> modular as well. For example there could be something like osv.base,
> osv.io (osv.io.read, osv.io.write, etc), osv.net, etc  I do not know
> enough it it makes sense but even POSIX systems calls can be grouped into
> modules and then pulled like lego blocks depending on what one needs. I am
> looking at the modularity of Java 9 runtime (http://openjdk.java.net/jeps/
> 220) which I am trying to draw an parallel to.
>
> Would it be possible? Is it a big effort?
>

Yes, it is possible to split off many parts of the OSv kernel into separate
shared objects. For example I can imagine moving the entire ZFS code into a
separate zfs.so, and loading this object will cause the ZFS support to be
added to the kernel, the ZFS partition to be mounted, and so on.

However, it will be an effort, and it's not clear what granularity even
makes sense: In your example above, you seems to have split the read() and
write() features to two different shared objects - I'm not sure that is
practical. It's also not clear to me how fine this division should be.
Should each individual function (or several related functions) be in a
separate object? Or just split it into several large ones?

Of course, to make any of this splitting effort worthwhile, one needs to
come up with a scenario where the current on-disk size of OSv is excessive.

Nadav.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.