Re: why not flow control in wl_connection_flush?

2024-03-02 Thread jleivent
On Fri, 1 Mar 2024 11:59:36 +0200
Pekka Paalanen  wrote:

> ...
> The real problem here is though, how do you architect your app or
> toolkit so that it can stop and postpone what it is doing with Wayland
> when you detect that the socket is not draining fast enough? You might
> be calling into Wayland using libraries that do not support this.
> 
> Returning to the main event loop is the natural place to check and
> postpone, but this whole issue stems from the reality that apps do not
> return to their main loop often enough or do not know how to wait with
> sending even in the main loop.

I am concluding from this discussion that I don't think clients would
be constructed not to cause problems if they attempt to send too fast.

I think I may add an option to wl_connection_flush in my copy of
libwayland so that I can turn on client waiting on flush from an env
var.  It looks like it the change would be pretty small.  Unless you
think it would be worth making this a MR on its own?

If the client is single threaded, this will cause the whole client to
wait, which probably won't be a problem, considering the type of
clients that might try to be that fast.

If the client isn't single threaded, then it may cause a thread to wait
that the client doesn't expect to wait, which could be a problem for
that client, admittedly.



Re: why not flow control in wl_connection_flush?

2024-02-29 Thread jleivent
On Tue, 27 Feb 2024 11:01:18 +0200
Pekka Paalanen  wrote:


> > But suppose I'm writing a client that has the possibility of
> > sending a rapid stream of msgs to the compositor that might be,
> > depending on what other clients are doing, too fast for the
> > compositor to handle, and I'd like to implement some flow control.
> > I don't want the connection to the compositor to sever or for the
> > condition to cause memory consumption without any ability for me to
> > find out about and control the situation.  Especially if I could
> > slow down that rapid stream of updates without too high a cost to
> > what my client is trying to accomplish.
> > 
> > Is there a way I could do that?  
> 
> Get the Wayland fd with wl_display_get_fd() and poll it for writable.
> If it's not writable, you're sending too fast.

That's what I was looking for!  I think... Maybe?

> 
> That's what any client should always do. Usually it would be prompted
> by wl_display_flush() erroring out with EAGAIN as your cue to start
> polling for writable. It's even documented.

But, calling wl_display_flush too often is bad for throughput, right?
Isn't it better to allow the ring buffer to determine itself when to
flush based on being full, and batch send many msgs?  Obviously
sometimes the client has nothing more to send (for a while), so
wl_display_flush then makes sense.  But in this case, it does have more
to send and wants to know if it should attempt to do so or hold back.

I could instead wait for the display fd to be writable before
attempting each msg send.  But the display fd may be writable merely
because the ring buffer hasn't tried flushing yet.  But the ring buffer
could have less than enough space for the msg I'm about to send.  And
the socket buffer could have very little space - just enough for it to
say its writable.

Which means that sometimes polling the display fd will return writable
when an attempt to send a msg is still going to result in ring buffer
growth or client disconnect.

So... back to calling wl_display_flush ... sometimes.

I guess I could call wl_display_flush after what I think is about 4K
worth of msg content.  Wl_display_flush returns the amount sent, so
that helps keep the extra state I need to maintain.

Is there currently a way I could get the size of contents in the output
ring buffer?



Re: why not flow control in wl_connection_flush?

2024-02-26 Thread jleivent
On Mon, 26 Feb 2024 15:12:23 +0200
Pekka Paalanen  wrote:

...
> > What is the advantage to having the impacted clients grow their send
> > buffers while waiting?  They wait either way.  
> 
> They are not waiting if they are growing their send buffers.

I meant that they must wait for the UI to update corresponding to the
messages they are trying to send to the compositor.

This may as also be about my assumption of a threading model, this time
for the client.  I assume that a client that has some important work to
do that is unrelated to updating the display will do that work in a
distinct thread from the one dedicated to sending display related msgs
to the compositor.

If that's not the case, then indeed causing the client's sending thread
to wait could impact some other computation.  Which might be bad,
depending on what that other computation is trying to do.

But suppose I'm writing a client that has the possibility of sending a
rapid stream of msgs to the compositor that might be, depending on what
other clients are doing, too fast for the compositor to handle, and I'd
like to implement some flow control.  I don't want the connection to
the compositor to sever or for the condition to cause memory
consumption without any ability for me to find out about and control
the situation.  Especially if I could slow down that rapid stream of
updates without too high a cost to what my client is trying to
accomplish.

Is there a way I could do that?


Re: why not flow control in wl_connection_flush?

2024-02-24 Thread jleivent
On Fri, 23 Feb 2024 12:12:38 +0200
Pekka Paalanen  wrote:


> I would think it to be quite difficult for a compositor to dedicate a
> whole thread for each client.

But that means it is possible that the server cannot catch up for long
periods.  And that just having a large number of otherwise friendly
clients can cause their socket buffers to fill up.  And things are
worse on systems with more cores.

What is the advantage to having the impacted clients grow their send
buffers while waiting?  They wait either way.


Re: why not flow control in wl_connection_flush?

2024-02-22 Thread jleivent
Thanks for this response.  I am considering adding unbounded buffering
to my Wayland middleware project, and wanted to consider the flow
control options first.  Walking through the reasonsing here is very
helpful.  I didn't know that there was a built-in expectation that
clients would do some of their own flow control.  I was also operating
under the assumption that blocking flushes from the compositor to
one client would not have an impact on other clients (was assuming an
appropriate threading model in compositors).

The client OOM issue, though: A malicious client can do all kinds of
things to try to get DoS, and moving towards OOM would accomplish that
as well on systems with sufficient speed disadvantages for thrashing.
A buggy client that isn't trying to do anything malicious, but is
trapped in a send loop, that would be a case where causing it to wait
might be better than allowing it to move towards OOM (and thrash).

On Thu, 22 Feb 2024 11:52:28 +0200
Pekka Paalanen  wrote:

> On Wed, 21 Feb 2024 11:08:02 -0500
> jleivent  wrote:
> 
> > Not completely blocking makes sense for the compositor, but why not
> > block the client?  
> 
> Blocking in clients is indeed less of a problem, but:
> 
> - Clients do not usually have requests they *have to* send to the
>   compositor even if the compositor is not responding timely, unlike
>   input events that compositors have; a client can spam surfaces all
> it wants, but it is just throwing work away if it does it faster than
>   the screen can update. So there is some built-in expectation that
>   clients control their sending.
> 
> - I think the Wayland design wants to give full asynchronicity for
>   clients as well, never blocking them unless they explicitly choose
> to wait for an event. A client might have semi-real-time
>   responsibilities as well.
> 
> - A client's send buffer could be infinite. If a client chooses to
> send requests so fast it hits OOM, it is just DoS'ing itself.
> 
> > For the compositor, wouldn't a timeout in the sendmsg make sense?  
> 
> That would make both problems: slight blocking multiplied by number of
> (stalled) clients, and overflows. That could lead to jittery user
> experience while not eliminating the overflow problem.
> 
> 
> Thanks,
> pq
> 


Re: why not flow control in wl_connection_flush?

2024-02-21 Thread jleivent
Not completely blocking makes sense for the compositor, but why not
block the client?

For the compositor, wouldn't a timeout in the sendmsg make sense?

On Wed, 21 Feb 2024 16:39:08 +0100
Olivier Fourdan  wrote:

> Hi,
> 
> On Wed, Feb 21, 2024 at 4:21 PM jleivent  wrote:
> 
> > I've been looking into some of the issues about allowing the
> > socket's kernel buffer to run out of space, and was wondering why
> > not simply remove MSG_DONTWAIT from the sendmsg call in
> > wl_connection_flush?  That should implement flow control by having
> > the sender thread wait until the receiver has emptied the socket's
> > buffer sufficiently.
> >
> > It seems to me that using an unbounded buffer could cause memory
> > resource problems on whichever end was using that buffer.
> >
> > Was removing MSG_DONTWAIT from the sendmsg call considered and
> > abandoned for some reason?
> >  
> 
> See this thread [1] from 2012, it might give some hint on why
> MSG_DONTWAIT was added with commit  b26774da [2].
> 
> HTH
> Olivier
> 
> [1]
> https://lists.freedesktop.org/archives/wayland-devel/2012-February/002394.html
> [2] https://gitlab.freedesktop.org/wayland/wayland/-/commit/b26774da



why not flow control in wl_connection_flush?

2024-02-21 Thread jleivent
I've been looking into some of the issues about allowing the socket's
kernel buffer to run out of space, and was wondering why not simply
remove MSG_DONTWAIT from the sendmsg call in wl_connection_flush?  That
should implement flow control by having the sender thread wait until
the receiver has emptied the socket's buffer sufficiently.

It seems to me that using an unbounded buffer could cause memory
resource problems on whichever end was using that buffer.

Was removing MSG_DONTWAIT from the sendmsg call considered and abandoned
for some reason?


Re: protocol rules question: is an array arg of object ids legal?

2023-12-27 Thread jleivent
Thanks for the prompt answer!

On Wed, 27 Dec 2023 18:17:32 +
Simon Ser  wrote:

> On Wednesday, December 27th, 2023 at 19:09, jleivent
>  wrote:
> 
> > Is it legal for a protocol message to contain an array arg where the
> > contents of the array are Wayland object ids? I don't see any
> > instance of this in any current protocol descriptions I have.  
> 
> Technically nothing prevents this, but this will be pretty awkward
> since client and server will need to convert to/from IDs (plus
> wrapping/unwrapping the wl_proxy for the client) and there won't be
> any type safety. In general it's better to have a request/event
> carrying a single object which can be sent multiple times to
> accumulate a list of objects.



protocol rules question: is an array arg of object ids legal?

2023-12-27 Thread jleivent
Is it legal for a protocol message to contain an array arg where the
contents of the array are Wayland object ids?  I don't see any instance
of this in any current protocol descriptions I have.



Re: aging merge request

2023-12-24 Thread jleivent
Sorry about the typo!  It should be:

https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/339

On Sun, 24 Dec 2023 15:03:04 +
Joshua Ashton  wrote:

> This gives a 404 for me.
> 
> On December 19, 2023 8:22:21 PM GMT, jleivent 
> wrote:
> >I submitted this merge request on October 8th:
> >
> >https://gitlab.freedesktop.org/wayland/wayland/-/merge_request/339
> >
> >Is there any interest in it?
> >
> >Thanks,
> >Jon  
> 
> - Joshie 🐸✨


aging merge request

2023-12-19 Thread jleivent
I submitted this merge request on October 8th:

https://gitlab.freedesktop.org/wayland/wayland/-/merge_request/339

Is there any interest in it?

Thanks,
Jon


Re: what are the protocol rules about uniqueness of event and request names?

2023-12-08 Thread jleivent
On Fri, 8 Dec 2023 12:54:35 +0100
Sebastian Wick  wrote:

> ...
> I think a more useful thing to do would be to add this restriction (an
> interface cannot have an event and a request with the same name) to
> the documentation and to wayland-scanner.
>

Also: an event and request with the same name would probably confuse
anyone using WAYLAND_DEBUG.

But: Would changing wayland-scanner to prevent this be backward
compatible?  Can't someone somewhere already have an event/request pair
with the same name in their own private protocol extension?


Re: what are the protocol rules about uniqueness of event and request names?

2023-12-07 Thread jleivent
On Thu, 7 Dec 2023 22:06:07 +
David Edmundson  wrote:

> The generated C code be full of conflicts. The
> MY_PROTOCOL_REQUESTEVENT_SINCE_VERSION define for a start.
> 
> I think it might compile in C, but other generators exist that might
> not and it's making life much more confusing than it needs to be. I
> would strongly avoid it.
> 
> David

To be clear, I wasn't intending it to sound like I wanted to add an
event and a request with the same name myself.  I'm writing some
middleware that sits between a Wayland compositor and some of its
clients, and I would like to know if it might encounter an interface
that has an event and a request with the same name.

I think you've answered that it's not a good idea for a protocol author
to do that, but it also sounds like it's a possibility that someone
could do it anyway because there's no direct rule against it.  So
maybe I should take the necessary precautions.

Thanks,
Jonathan



what are the protocol rules about uniqueness of event and request names?

2023-12-07 Thread jleivent
Can a single interface have an event and a request with the same name?


new test hangs in test-compositor.c at waitid - any clues?

2023-10-06 Thread jleivent
I have a new test thatt is supposed to encounter an error in
the server, causing the server to abort the client and end the test.
The client is at that point in a sleep waiting to be aborted.

Instead, the test hangs (and eventually times out).

If I run it under gdb, and Ctrl-C break during the hang, I get:

(gdb) bt
#0  0x77e72ac6 in __waitid (idtype=P_PID, id=10135,
infop=0x7fffdd70, options=4)
at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0xde10 in handle_client_destroy (data=0x55567730)
at ../tests/test-compositor.c:110
#2  0x77fa20fe in wl_event_loop_dispatch_idle
(loop=0x55567440)
at ../src/event-loop.c:969
#3  0x77fa256c in wl_event_loop_dispatch (loop=0x55567440,
timeout=-1)
at ../src/event-loop.c:1109
#4  0x77f9ea81 in wl_display_run (display=0x55567350)
at ../src/wayland-server.c:1493
#5  0xe814 in display_run (d=0x55567300) at
../tests/test-compositor.c:401
#6  0xcc36 in server_needs_zombies () at
../tests/display-test.c:1884
#7  0xcf80 in run_test (t=0x555666e0
)
at ../tests/test-runner.c:159
#8  0xd559 in main (argc=2, argv=0x7fffe328) at
../tests/test-runner.c:345

[server_needs_zombies is the name of the new test, which I'm using to
establish that the server needs zombie resources like the client
needs zombie proxies]

Using 'ps xf' I can see that the child client was not a zombie (in the
linux process sense this time, not the wayland object sense) until the
Ctrl-C in gdb, and then immediately becomes a zombie at the Ctrl-C.
Continuing in gdb allows the test to terminate with the expected error
result:

Continuing.
Client 'snz_client_loop' was killed by signal 2
Client 'snz_client_loop' failed
1 child(ren) failed

In other words, for some reason, the abort signal sent to the client was
not delivered until the server (parent process of the client) got
interrupted itself.

Has anyone else observed this inability of the test server to deliver
the abort signal to its client until it is itself interrupted?  Is
there a bug in the test-compositor.c code (or maybe even
wayland-server.c)?

As a workaround, I had the client exit instead of sleep. But in that
case the test passes even though the server encounters the expected
error.  Is there a way to configure the server such that if it
encounters an error, it terminates the test as a failure?


Re: need help writing tests for specific event orderings

2023-10-05 Thread jleivent
On Thu, 5 Oct 2023 13:28:57 +0300
Pekka Paalanen  wrote:

> ...
> If you flush the Wayland connection explicitly, you should be able to
> reliably avoid the deadlocks. Flushing is public stable API.

Thanks!

I will pattern these tests after the fd_passer display-test, since that
is constructed to resemble an actual client-server configuration and
interaction more closely than other tests.  Also, following fd_passer's
lead, they may not need any additional synchronization to force the
issues.


Re: need help writing tests for specific event orderings

2023-10-04 Thread jleivent
On Wed, 4 Oct 2023 11:26:02 +0300
Pekka Paalanen  wrote:

> ...
> For the forked clients, there is stop_display()/display_resume().
> Maybe that helps?

Maybe if I understand their usage correctly.  Is this right?: A client
would send a sequence of requests followed by a stop_display request.
Anything the client sends after that stop_display request will not be
processed in the server until the server issues a display_resume event.

> ...
> If you limit your direct marshalling to sequences that are
> theoretically allowed, doesn't that already help you prove that all
> those cases are handled correctly?

Yes, as long as everyone believes in the "theoretically allowed" part.

> ...
> But I guess your goal is to see if using the API correctly could ever
> trigger an illegal sequence?

That's the goal.  

> ...
> It's also possible to call both server and client APIs from the same
> thread/process on the same Wayland connection, but you need to be
> careful to prove it cannot deadlock. That should be much easier since
> it's all single-threaded, and you just need to make sure the fds have
> data to read and queues have messages to dispatch when you expect
> them.

I've been thinking about that.  Deadlock is the issue, though.

If my understanding of stop_display/display_resume is correct, I might
use that.

Thanks.


need help writing tests for specific event orderings

2023-10-03 Thread jleivent
I am trying to write some tests that provoke errors in libwayland, but
it doesn't seem to me like the existing test suite provides a mechanism
to create specific event orderings that are allowed but not guaranteed
by the asynchrony of the protocol.  Is that correct?  It looks to me
like the tests in the test suite that involve a client and server all
fork the client and allow it and server to run asynchronously without
a way to impose any ordering restriction, but it's hard to tell.  If
there is a mechanism to use to get specific event orderings, where is
it?

I could simulate one side (the side that doesn't encounter the error)
by directly marshalling the messages it would send into the wire to the
other side.  That might be a suitable unit test for after the error is
proven to exist in the field, but it doesn't (conclusively) prove that
the error can exist in the field because of its reliance on simulation
tactics.  That's my fallback - but is there a better way?

Thanks,
Jon


Re: Questions about object ID lifetimes

2023-09-27 Thread jleivent
On Wed, 27 Sep 2023 11:47:37 +0300
Pekka Paalanen  wrote:

> ..
> 
> You just need to tell meson where your build directory is, or cd into
> it first.
> 
> $ meson test -C build
> 
> or
> 
> $ cd build
> $ meson test
> 

Of course!


Re: Questions about object ID lifetimes

2023-09-26 Thread jleivent
On Tue, 26 Sep 2023 11:53:07 +0300
Pekka Paalanen  wrote:

> On Mon, 25 Sep 2023 12:05:30 -0400
> jleivent  wrote:
> 
> > How do I get CI/CD capability turned on?  I tried building the unit
> > tests locally, but get errors that suggest those tests need to be
> > run in CI.  Issue 540 says I need to apply for the guest role - how
> > do I do that?  
> 
> I don't recall libwayland having anything that needs to be
> specifically run in a CI environment, and if it does, it should
> automatically skip outside of CI environment. Weston does this.
> 
> What errors did you get? How did you run them?
> 
> 'meson test' is the command.

I get:

$ meson test

ERROR: No such build data file as
'/home/jil/gits/wayland-idfix/meson-private/build.dat'.

I used this to build and install it:

$ meson build/ --prefix=/home/jil/gits/wayland-idfix/install/
$ ninja -C build/ install

Since that didn't create the needed meson-private/build.dat, I thought
that might get put in by the CI somehow.

> 
> I think applying for the guest role means that you can file an issue
> on the upstream project asking for the permission. At minimum, a
> maintainer needs to know your gitlab handle.

I'll do that.




Re: Questions about object ID lifetimes

2023-09-25 Thread jleivent
On Wed, 20 Sep 2023 10:05:51 -0400
jleivent  wrote:

> ..
> Here's a very wild suggestion that would eliminate it and still
> be compatible with Wayland 1.  Add a delete_id request without
> modifying the existing protocol.

I have a delete_id request hack, enhanced zombies everywhere, a LRU
ring for zombie reuse (when there's no delete_id requests) on the
server, all with full compatibility maintained and no protocol
additions (so it's fully drop-in compatible for clients and
servers) building and running in my limited testing on my
jonleivent/wayland-idfix fork.  My README explains it in depth.  I
would like this to eventually become a pull request, but I need to do
more testing first.  Which brings me to my question:

How do I get CI/CD capability turned on?  I tried building the unit
tests locally, but get errors that suggest those tests need to be run
in CI.  Issue 540 says I need to apply for the guest role - how do I do
that?

Thanks,
Jon


Re: CI/CD privileges for wayland-idfix fork

2023-09-23 Thread jleivent
On Sat, 23 Sep 2023 09:40:20 -0400
jleivent  wrote:

> Could I have CI/CD privileges for
> https://gitlab.freedesktop.org/jonleivent/wayland-idfix
> 
> Thanks
> Jon

Also:
With respect to the caching scheme described in .gitlab-ci.yaml, should
I change my FDO_DISTRIBUTION_TAG to stay out of the way?  Anything else
I need to do before CI is turned on?



CI/CD privileges for wayland-idfix fork

2023-09-23 Thread jleivent
Could I have CI/CD privileges for
https://gitlab.freedesktop.org/jonleivent/wayland-idfix

Thanks
Jon


Re: Questions about object ID lifetimes

2023-09-20 Thread jleivent
On Wed, 20 Sep 2023 10:05:51 -0400
jleivent  wrote:

> ...
> I'm considering forking libwayland and working on one or both of these
> fixes for my own use, because I don't want to implement some even
> crazier things in middleware to compensate for the server ID reuse
> problem.
> 

I keep getting "An error occurred while forking the project.  Please try
again."

Am I locked out of forking wayland?


Re: Questions about object ID lifetimes

2023-09-20 Thread jleivent
On Wed, 20 Sep 2023 11:30:19 +0300
Pekka Paalanen  wrote:

> ..
> > This might help reduce those anomalous messages and be compatible
> > with Wayland 1.  Reduce the greediness of object ID reuse by:
> > 
> > - not reusing any IDs unless at least some minimum number (256?)
> >   are free
> > 
> > - reuse the freed ones in LRU fashion  
> 
> Yeah, the free list could be a FIFO instead of a LIFO.
> 
> > There are other variations of this - the point of all being to
> > increase the time between when any ID becomes free and when it is
> > reused but without causing the ID maps to grow unreasonably large,
> > or causing their maintenance to slow down.
> > 
> > Increasing the time delay between freeing and reuse (such as with a
> > higher minimum free threshold above) would probably lead to lower
> > probability of anomalous messages. You could make this tunable
> > through an environment variable.
> > 
> > Note that the two sides don't have to agree to use this less-greedy
> > ID allocation for either side to use it - and it's really only
> > important for servers anyway.  
> 
> I'm wary of solutions that reduce the risk but do not eliminate it. If
> a protocol interface design turns out racy, it would be best to find
> that out sooner than later, and evaluate fixing it. Reproducibility
> helps analysis.
> 

Here's a very wild suggestion that would eliminate it and still
be compatible with Wayland 1.  Add a delete_id request without
modifying the existing protocol.

There are (at least) two pairs of ping/pong messages in the base
protocol: xdg_wm_base and wl_shell_surface.

From what I can tell (but I only have the wlroots code to look at),
when the client sends a pong that doesn't correspond to the most recent
ping, the server completely ignores it.  Also, the serial arg used in
pings starts low and is incremented.  Also, the servers tend to reset
the serial to 0 often.  So it never increments very high (even if it
never got reset, it's probably never going over 2^31-1).

This means it's possible to use a specially crafted pong as a
delete_id request.

The client could send a pong with the highest bit on (so it won't
accidentally match a real serial and ack a real ping) and the low bits
indicating the object ID whose deletion it is acking.

The server will, when it deletes one of its own objects, keep around at
least the type (interface) until it gets this pong.

There's two versions of this: one is that clients using patched
libwayland libs send the pong on their own after seeing the server-side
object deletion.  Another is that a patched server sends a ping when it
wants to reuse an ID to force a matching pong of an unpatched client
(this one assumes a client won't queue a server-side object deletion
and pong the ping before processing the deletion, hence still be able
to send anomalous messages involving the deleted ID - so it's risky).

If the server is patched to wait for these delete_id pongs, but the
client is not, then at best the server could fall back to using a less
greedy reuse as with my previous suggestion.  A patched client could
signal it is patched by sending an unsolicited specially crafted pong
(serial arg = 0x) early on.

It might be nice to give users the ability to start out with an
unpatched libwayland, but if they think they are seeing clients getting
killed off due to deleted server IDs in their requests, they could
switch to using a patched "unauthorized" libwayland.  It probably
wouldn't be too hard to write a tool that parses WAYLAND_DEBUG output
to see if an issue is due to delete server IDs.  They'd use the patched
libwayland at their own risk (but when isn't that the case?),
understanding that the "fix" is a bit of a hack.

I'm considering forking libwayland and working on one or both of these
fixes for my own use, because I don't want to implement some even
crazier things in middleware to compensate for the server ID reuse
problem.



Re: Questions about object ID lifetimes

2023-09-19 Thread jleivent
On Tue, 19 Sep 2023 10:02:55 -0400
jleivent  wrote:
> ...
> This might help reduce those anomalous messages and be compatible with
> Wayland 1.  Reduce the greediness of object ID reuse by:
> 
> - not reusing any IDs unless at least some minimum number (256?)
>   are free
> 
> - reuse the freed ones in LRU fashion

This also needs something like zombies on the server side.  At least
retain the type info for a free ID until it is reused.


Re: Questions about object ID lifetimes

2023-09-19 Thread jleivent
On Tue, 19 Sep 2023 16:26:37 +0300
Pekka Paalanen  wrote:

> ...
> > But aren't those fast frame updates done through shared fds?  Hence
> > not part of the wire protocol, and would not be impacted by
> > increasing the length of messages on the wire?  
> 
> No. They are messages sent on the wire, telling "there is a new image
> on that other fd I shared with you before, use that now", and so on.
> That is usually a handful of requests per frame.

Didn't realize that.
> 
> I would argue that "speculative" is not the right word here, it was
> never intended.

How about: there are "anomalous" messages and state changes?


> > tl;dr: protocol asynchrony leads to speculation that can result in
> > the two sides disagreeing about the correct state of the world.
> 
> We avoid that with careful protocol design in XML. There is exactly
> that kind of situation in the xdg-family of extensions and it is
> solved by sending a serial with the events and acking that serial
> when the client acts on the events.
> 
> It's a known caveat.

OK.

This might help reduce those anomalous messages and be compatible with
Wayland 1.  Reduce the greediness of object ID reuse by:

- not reusing any IDs unless at least some minimum number (256?)
  are free

- reuse the freed ones in LRU fashion

There are other variations of this - the point of all being to increase
the time between when any ID becomes free and when it is reused but
without causing the ID maps to grow unreasonably large, or causing their
maintenance to slow down.

Increasing the time delay between freeing and reuse (such as with a
higher minimum free threshold above) would probably lead to lower
probability of anomalous messages. You could make this tunable through
an environment variable.

Note that the two sides don't have to agree to use this less-greedy ID
allocation for either side to use it - and it's really only important
for servers anyway.



Re: Questions about object ID lifetimes

2023-09-18 Thread jleivent
On Mon, 18 Sep 2023 14:06:51 +0300
Pekka Paalanen  wrote:

> On Sat, 16 Sep 2023 12:18:35 -0400
> jleivent  wrote:
> 
> > The easiest fix I can think of is to go full-on half duplex.
> > Meaning that each side doesn't send a single message until it has
> > fully processed all messages sent to it in the order they arrive
> > (thankfully, sockets preserve message order, else this would be
> > much harder). Have you considered half duplex?  
> 
> Never crossed my mind at least. I can't even imagine how it could be
> implemented through a socket, because both sides must be able to
> spontaneously send a message at any time.

By taking turns.  Each side would, after queuing up a batch of
messages, add an "Over!" message (from the days of half-duplex
radio communications) to the end of that queue, and then send the whole
queue (retaining its sequence).  Neither side would send a message
until it receives the other side's "Over!" message, and until the
higher levels above libwayland have had a chance to examine all
messages prior to "Over!" in order to avoid sending an inconsistent
message or even committing to a state incompatible with later messages.

> 
> > Certainly, it would mean a loss
> > of some concurrency, hence a potential performance hit.  But
> > probably not that much in this case, as most of the message
> > back-and-forth in Wayland occurs at user-interaction speeds, while
> > the speed-needing stuff happens through fd sharing and similar
> > things outside the protocol. I  
> 
> That user interaction speed can be in the order of a kilohertz, for
> gaming mice, at least in one direction. In the other direction,
> surface update rate is also unlimited, games may want to push out
> frames even if only every tenth gets displayed to reduce latency.
> Also truly tearing screen updates are being developed.

But aren't those fast frame updates done through shared fds?  Hence not
part of the wire protocol, and would not be impacted by increasing the
length of messages on the wire?

> 
> > think it can be made mostly backward compatible. It would probably
> > require some "all done" interaction between libwayland and higher
> > levels on each side, but that's probably (hopefully) not too hard.
> > There may even be a way to automate the "all done" interaction to
> > make this fully backward compatible, because libwayland knows when
> > there are no more messages to be processed on the wire, and it can
> > queue-up the messages on each side before placing them on the wire.
> >  It might need to do things like re-order ping/pong messages with
> > respect to the others to make sure the pinging side (compositor)
> > doesn't declare the client dead while waiting.  But that seems
> > minor, as long as all such ping/pong pairs are opaque to the
> > remainder of the protocol, hence always commute with other
> > messages.  
> 
> If you mean adding new ping/pong stuff, that doesn't sound very nice,
> because Wayland also aims to be power efficient: if truly nothing is
> happening, let the processes sleep. Anyone could still wake up any
> time, and send a message.

Not adding.  Dealing with the already existing (or if any new ones are
added) ping/pong pairs.  Or any messages that really need to be timely,
hence can't wait for messages in front of them to be fully processed.

That could apply to any real-time requirements, like the gaming mice
messages you mentioned above.  But doing this in general is hard unless
the messages are irrelevant to the rest of the protocol (hence commute
with everything else), like ping/pong are.

> 
> 
> On Sun, 17 Sep 2023 15:28:04 -0400
> jleivent  wrote:
> 
> > Has altering the wire format to contain all the info needed for
> > unambiguous decoding of each message entirely within libwayland
> > without needing to know the object ID -> type mapping been
> > considered?  
> 
> Not that I can recall. The wire format is ABI, libwayland is not the
> only implementation of it, so that would be Wayland 2 material.

So no changes to the wire format are possible under any circumstances
in Wayland 1?

> 
> > It would make the messages longer, but this seems like it wouldn't
> > be very bad for performance because wire message transfer is roughly
> > aligned with user interaction speeds.  
> 
> We need to be able to deal with at least a few thousand messages per
> second easily.
> 
> The overhead seems a bit bad if every message would need to carry its
> signature.

Encoding more into the message is only needed if there are no
destructor request acks (the equivalent of wl_display::delete_id, but
in the opposite 

Re: Questions about object ID lifetimes

2023-09-17 Thread jleivent
Has altering the wire format to contain all the info needed for
unambiguous decoding of each message entirely within libwayland without
needing to know the object ID -> type mapping been considered?

It would make the messages longer, but this seems like it wouldn't be
very bad for performance because wire message transfer is roughly
aligned with user interaction speeds.

Also, for any compositor/client pair, as long as they both use the same
version of libwayland, the necessary wire format change would not
result in compatibility issues.  It would for static linked cases,
or similar mismatching cases (flatpak, appimage, snap, etc. unless
the host version is mapped in instead of the packaged one somehow).
There also seem to be unused bits in the existing wire format so that
one could detect an a compositor/client incompatibility at least on one
end.

I'm not suggesting that unambiguous decoding of all messages is a
sufficient fix, but it is a necessary one.  There are still speculative
computation issues that it wouldn't resolve alone.


Re: Questions about object ID lifetimes

2023-09-16 Thread jleivent
Pekka,

After thinking more about what you said, I'm no longer optimistic.

First, you are correct that my observation about opposite-side (side
A-ranged ID vs. side B destructor) only works for middleware, and then
only if the compositor and clients already handle their issues
properly.

Secondly, when thinking about the case of a message that arrives after
an object has been deleted with new_ids in it, it occurs to me that this
is a special case of a greater problem due to the existence of
speculative computation as a result of the protocol's asynchrony.  Any
time there are at least two messages that don't commute with each other
(and destruction is a case of a message that never commutes with any
other message to the same object) where the two messages can be sent
from opposite sides, at least one of them has to be undone somehow.  And
that undoing has to include state changes that preceeded it on its
sending side that didn't take into account the other (non-undone)
message.  This is bad.

It wouldn't be so bad if the protocol used some old-time mutexes or
database read-vs-write transactional consistency preservation
mechanisms. But those require quite a bit of input from higher levels
(above libwayland).  And there's deadlock to deal with.

The easiest fix I can think of is to go full-on half duplex.  Meaning
that each side doesn't send a single message until it has fully
processed all messages sent to it in the order they arrive (thankfully,
sockets preserve message order, else this would be much harder).
Have you considered half duplex?  Certainly, it would mean a loss
of some concurrency, hence a potential performance hit.  But probably
not that much in this case, as most of the message back-and-forth in
Wayland occurs at user-interaction speeds, while the speed-needing stuff
happens through fd sharing and similar things outside the protocol. I
think it can be made mostly backward compatible. It would probably
require some "all done" interaction between libwayland and higher
levels on each side, but that's probably (hopefully) not too hard.
There may even be a way to automate the "all done" interaction to make
this fully backward compatible, because libwayland knows when there are
no more messages to be processed on the wire, and it can queue-up the
messages on each side before placing them on the wire.  It might need
to do things like re-order ping/pong messages with respect to the
others to make sure the pinging side (compositor) doesn't declare the
client dead while waiting.  But that seems minor, as long as all such
ping/pong pairs are opaque to the remainder of the protocol, hence
always commute with other messages.

As for my own middleware project, I think I will try to detect message
decoding issues in all cases by keeping the most recent two types of
each ID, and attempting to decode both ways (most recent first).  There
are fortunately a bunch of internal consistency checks that can be done,
such as length of overall message vs. length of args vs. string length
vs. null string termination, etc.  But if the middleware gets a message
that passes these decoding consistency checks for both of those types,
then depending on what it is trying to do (as in one of my use cases,
securing a sandboxed application), it may have to cut off the client.




Re: Questions about object ID lifetimes

2023-09-14 Thread jleivent
On Thu, 14 Sep 2023 16:32:06 +0300
Pekka Paalanen  wrote:

> ...
> 
> congratulations, I think you may have found everything that is not
> quite right in the fundamental Wayland protocol design. :-)

Oh, you flatter me.  I'm sure there's more!

> 
> As an aside, we collect unfixable issues under
> https://gitlab.freedesktop.org/wayland/wayland/-/issues/?label_name%5B%5D=Protocol-next
> These are issues that are either impossible or very difficult or
> annoying to fix while keeping backward compatibility with both servers
> and clients.

Only 7 of them?

> 
> --
> 
> Object ID re-use is what I would call "aggressive": in the libwayland
> C implementation, the object ID last freed is the first one to be
> allocated next. There are two separate allocation ranges each with its
> own free list: server and client allocated IDs.

After I sent the initial post, I realized that the two separate
ID ranges help in the following way:

For any object ID in the allocation range of side A, a destructor
message from side B does not need acknowledgement.  This is because B
can't introduce a new object bound to that ID, only A can.  Hence, any
new_id arg for that ID is an acknowledgement of the destruction.
However, B has to be careful to ignore messages containing that ID
until it sees one with the ID as a new_id arg.  After the destructor
message from B but before a subsequent new_id for that ID from A, B
should not use the ID as arguments to other messages (and attempts to
do so can be dropped).  And this can be automated provided the
destructor tag can be relied on.

> 
> The C implementation also poses an additional restriction: a new ID
> can be at most the largest ever allocated ID + 1.
> 
> All this is to keep the ID map as compact as possible without a hash
> table. These details are in the implementation of the private 'struct
> wl_map' in libwayland.

Obviouly, that helps middleware as well, for the same reasons.  It also
makes more automatic error detection possible.

> ...
> 
> Your whole above analysis is completely correct!

I was rather hoping things would turn out less complex than they
seemed...

> 
> > However, the other cases are not as easy to identify.
> > 
> > The other cases are:
> > 1. an object created by a client request that has destructor events
> > 2. an object created by the compositor
> > 
> > It might be true that case 1 does not exist.  Is there a general
> > rule against that such cases would never be considered in future
> > expansions of the Wayland protocol?  
> 
> Destructor events do exist. Tagging them as such in the XML was not
> done from the beginning though, it was added later in a
> backward-compatible manner which makes the tag more informational than
> something libwayland could automatically process. The foremost example
> is wl_callback.done event. This is only safe because it is guaranteed
> that the client cannot be sending a request on wl_callback at the same
> time the server is sending 'done' and destroying the object:
> wl_callback has no requests defined at all.

Fortunately, my point above about the advantage of the separate ID
ranges helps here.  If wl_callback is created by the client, then a
wl_callback.done event tagged as a destructor does not need
acknowledgement AND is always safe provided that messages involving the
wl_callback ID (other than it's eventual reuse as a new_id arg) are
ignored above libwayland.

But again, this means the destructor tag is important and not merely
informational.

I did notice that the destructor tagging was added mostly (or
solely) to help with code generation by wayland-scanner implementations
in programming languages where destructors require some specific
syntactic notation.

But maybe destructor tagging is even better than that?  Maybe it would
allow libwayland to automate more in a more robust way AND also allow
for middleware that doesn't have to simulate all of the semantic level
interactions induced by protocol messages in order to merely keep track
of how to decode messages.

> 
> It also requires that nothing passes an existing wl_callback object as
> an argument in any request. We have been merely lucky that no-one has
> done that. It's really hard to imagine a use case where you would want
> to pass an existing wl_callback to anything.

Again, the above separate ID ranges point addresses this, I think.

> 
> Extensions may have similar objects that only deliver some one-off
> events and then "self-destruct" by the final event. All this is simply
> documented and not marked in the XML.

That's what I was hoping to avoid.  If there are object types where
object lifetime can only be understood by simulating all of the
relevant semantic content of the messages involved, then that's not
good for middleware.  Isn't it also problematic towards the goals of
libwayland, because it makes it impossible for libwayland to ensure
that messages are properly decoded without trusting that the client
and/or compositor have implemen

Questions about object ID lifetimes

2023-09-13 Thread jleivent
Forgive the long post.  Tl;dr: what are the rules of object ID lifetime
and reuse in the Wayland protocol?

I am attempting to understand the rules of object ID lifetime within
the Wayland protocol in order to construct Wayland middleware (similar
to some of the tools featured on
https://wayland.freedesktop.org/extras.html).  I could not find a
comprehensive discussion of the details online.  If one exists, I would
greatly appreciate a link!

Middleware tools that wish to decode Wayland messages sent between the
compositor and its clients need to maintain an accurate mapping between
object ID and object interface (type).  This is needed because the wire
protocol's message header includes only the target object ID and an
opcode that is relative to the object's type (the message header also
includes the message length - about which I also have questions - to be
pursued later...).  The message (request or event) and its argument
encoding can only be determined if the object ID -> type and type +
opcode -> message mappings are accurately maintained.  The type +
opcode -> message mapping is static and can be extracted offline from
the protocol XML files.

Since object IDs can be reused, it is important for the middleware to
understand when an ID can be reused and when it cannot be to avoid
errors in the ID -> type mapping.

Because the Wayland protocol is asynchronous, any message that implies
destruction of an object should be acknowledged by the receiver before
the destructed object's ID is reused.

Fortunately, certain events and requests have been tagged as
destructors in the protocol descriptions!

Also fortunately, it appears (based on reading the wl_resource_destroy
code in wayland-server.c) that for many object IDs, specifically for
IDs of objects created by a client request (the ID appears as a new ID
arg of a request, and is thus in the client side of the ID range) and
for which the client makes a destructor request, the compositor will
always send a wl_display::delete_id event (assuming the
display_resource still exists for the client, which apparently would
only not be the case after the client connection is severed) to
acknowledge the destructor request. Any attempt to reuse that ID prior
to the wl_display::delete_id event can lead to confusion, and should be
avoided.  Reuse of the ID after the wl_display::delete_id event should
not result in any confusion.

[BTW: for the purpose of this discussion, an object is "created" when
it is introduced into a protocol message for the first time via a new_id
argument.  It does not refer to the actual allocation of the object in
memory or to its initialization.]

However, the other cases are not as easy to identify.

The other cases are:
1. an object created by a client request that has destructor events
2. an object created by the compositor

It might be true that case 1 does not exist.  Is there a general rule
against that such cases would never be considered in future expansions
of the Wayland protocol?

For objects created by the compositor, there are 2 subcases:

2a. objects with only destructor events
2b. objects with destructor requests

Again, it might be the case that 2b does not exist, as it is analogous
to case 1 above.  But, is there a general rule against such
future cases as well?  Combining 1 and 2b, is there a general rule that
says that only the object creator can initiate an object's destruction
(unprovoked by the other side of the protocol)?

For object IDs created by the compositor and with only destructor
events (case 2a), it may be necessary to understand the details of each
interface in question to decide when the ID can be reused, as there is
no universal destructor acknowledgement request comparable to the
wl_display::delete_id event.  A requirement to understand the details
to that level would make middleware development more difficult.  Insert
extreme sadness emoji here.

Thankfully, it seems that destructor events are themselves
acknowledgements of requests for destruction by the client (such as
wp_drm_lease_device_v1::released event destructor vs.
wp_drm_lease_device_v1::release request), or involve objects with a
very limited lifetime and usage, such as callbacks
(wp_presentation_feedback, zwp_linux_buffer_release, and
zwp_fullscreen_shell_mode_feedback_v1).  These limited lifetime/usage
objects are created with the knowledge that all messages for them are
destructor events, and that they are not involved in any other messages
(as targets or arguments).  Hence their destruction needs no further
acknowledgement because the request for destruction was implied by
their creation.  The destructor event is the acknowledgement of that
request.

Is this a general rule: that a destructor event is is always the
acknowledgement of a (perhaps implied) destruction request?

So there may be two general simple rules that the middleware can follow
to maintain a proper ID -> type mapping through ID reuse cycles:

1. reuse of ID is allowed after w