[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

Marvin W Tue, 21 May 2024 07:39:45 -0700

Hi Goffi,

On Tue, 2024-05-21 at 12:47 +0200, Goffi wrote:
> I know that, I've just ruled out using <message> through the server
> as it has 
> been proposed in another feedback.


Why do you rule that out? Because you don't see a purpose, when my
whole point is that I do see a purpose? Of course I can send whatever
CBOR/JSON you come up with as a base64 blob inside a <message> for my
usecase, but then I wonder why not to handle it in first place.


> From a quick glance at the Wikipedia page, I see "In terms of
> transferring 
> clipboard data, "there is currently no way to transfer text outside
> the 
> Latin-1 character set".[5] A common pseudo-encoding extension solves
> the 
> problem by using UTF-8 in an extended format.[2]: § 7.7.27 ", which
> makes me 
> suspicious though.

RFB definitely is old, so these kind of things are expected. And, while
I see that you added clipboard as a potential future extension, it
seems odd to complain that RFB has a suboptimal implementation of a
feature your proposed XEP currently doesn't have at all.

> One of the design goal of my proposal is to have something really
> simple and 
> straightforward to implement.

RFB isn't really hard to implement either. And ther are a ton of
implementations out there already.


> There is no modifier flag used in the specification. There is the key
> value, and 
> the location number. From my tests, it's consistent and corresponds
> to the 
> documentation for the browsers that I've tried (Firefox and
> Chromium).

I know that your specification doesn't transfer the modifier flags,
probably assuming they are superfluous. However, if your browser client
was to naively send the key events it receives as is without further
checking for plausibility, things will go wrong: I tested pressing the
keys that would logically result in the events meta down, control down,
control up, meta up and here are the results on different browsers:
https://imgur.com/a/zVxDAVa

From what I understand, the state of keyup and keydown events in the
web API doesn't need to be consistent (e.g. there can be keydown
without keyup and vice-versa). Do we want the same behavior for this
protocol or something else?

> 
> > I'm not saying there aren't any cases where low-latency is
> > important,
> > where I disagree is that this is the case in all occasions. If you
> > don't have low latency feedback from the remote device, low latency
> > for
> > input is very likely not crucial.
> 
> I have the feeling that you only see this specification with the
> remote desktop 
> use case point of view. There are other use cases, and one another
> major one 
> is to use a device as input for another one in the same physical
> location: use 
> of a smartphone as ad-hoc touch pad or gamepad for instance. And if
> low 
> latency is easily achieved, I still don't see the point to have other
> mechanism because in some niche case low latency is not that annoying
> (but 
> still is, it's always annoying).

I think you misunderstood my point. Using a smartphone as a touch pad
or gamepad while playing a game on a screen next to you, is low latency
feedback (you can see the screen with low latency). Example for where
you don't need low latency would be when blindly typing into a remote
shell, because you won't get feedback there (except after confirming a
command which is probably not low latency).

> 
> > 
> > Anyway, I remain not convinced that XSF is the place to specify a
> > remote control protocol from scratch (which is what sections 8 and
> > 9 of
> > the XEP are about). Mostly because I feel the XSF does not have the
> > competence for doing so (aka. we will probably do things terribly
> > wrong, due to lack of experience in the field).
> 
> Again, it is not from scratch. It's re-using existing protocols, in a
> simple, 
> working, easy-to-implement, and efficient way.

I was talking about the remote control protocol, which is what runs on
the topmost layer (inside the webrtc datachannel or whatever other
Jingle transport is used). This protocol is mostly from scratch (it's
loosely based on web API events, but then only taking an arbitrarily
picked subset of events and event properties)


> The goal here is to be sure that it will work with web clients, as
> data 
> channels are currently the only way to have direct connection with
> browsers. I 
> can reformulate to only suggest it and get rid of the SHOULD.

Which isn't an issue if web clients are not relevant for my usecase.
And honestly, any kind of pointing to "you should support web clients"
sounds weird to me. It certainly is interesting that we can support web
clients, but really shouldn't siphon into unrelated specifications (and
this one totally is unrelated to web).


> WebRTC has sessions pretty much like Jingle; its ID is what you have
> in the o= 
> line of your SDP.

My point is: Either it's a Jingle session or it's not part of XMPP.
Jingle doesn't use WebRTC. It just happens that WebRTC APIs are
somewhat compatible to Jingle (because they are based on Jingle), but
from XMPP perspective, you never have WebRTC sessions. I don't know
exactly what it means to be in the same WebRTC session, but whatever
you want here, make it more explicit, because people that don't use
WebRTC APIs should not be required to first read the WebRTC specs (or
probably implementations source code) to figure out what you mean by
that.


> The issue is that video feed is used in this case to get the screen
> dimension. 
> Without it, we can't get touch event which use absolute position
> (while for 
> mouse, there is a relative position mode for exactly this use case).

That's a problematic design. As I said, clients might scale the video
to reduce bandwidth use. Dino also has logic to adjust the video
resolution of cameras depending on available bandwidth.

And as I understood for mouse, it's not relative to the screen, but
relative to the previous position, aka a movement vector, like reported
from touchpads.
An screen relative position that is 0,0 is upper left corner, 0.5,0.5
is center of the screen and 1,1 is lower right corner, would work
independent of the target screen resolution.

> An alternative would be to specify screen dimension when establishing
> the 
> remote control session.

Might work, but then you also need to cover the case where the screen
resolution changes during remote control.


> No, its value is in pixels, the same as for the Web API. Its double
> because 
> pixels can be subdivided (High-DPI displays, transformations). I
> realize that, 
> besides the link to MDN, this is not explicitly stated; I'll add a
> notice in 
> future revisions.

The Web API uses double because they did weird things for HiDPI. On the
hardware layer, there are only pixels and if you click on a point on
the screen, it will always be on a pixel (at least in all OS that I am
aware of). The transformation of HiDPI in browsers abstract away from
actual pixels and 1px might be more or less than a physical pixel. But
why would you want to carry this abstraction through the network to a
system that shouldn't care about what browsers can do and what they
think a pixel is?


> It was just to handle the case where no device is accepted, there was
> 2 
> options:
> - reject it totally
> - say it's a simple screen share session.
> 
> I've chosen the later one. But indeed, data channel is then useless.
> Can 
> change it for the other option.

We also don't allow Jingle file transfers of no file or RTP contents
without any codecs. As this protocol is for remote control, it should
remain entirely unused for screen share only.


> - I'm not hard set on technologies, and I'm OK to get rid of CBOR is
> there is 
> consensus on it. I personally still think that it's a superior
> solution.

To me the use of CBOR here feels not well motivated, except for obscure
"better performance" reasons before having done any measurement to back
that claim. From XMPP perspective, something in a Jingle XML stream
would be more canonical (because it reuses the stack we already have in
every XMPP client anyway) and anything diverting from that IMO should
be well reasoned.

If you're reasoning that CBOR provides significant performance gain
over XML, then why is it not a priority to figure out how we use CBOR
instead of XML everywhere in XMPP (e.g. by creating some XML<>CBOR
translation and using that as an optional stream feature).

> - regarding using RFB for input events only, I'll have a deeper look
> at the 
> spec and evaluate it. It may be an option it is comparable in ease of
> implementation, efficiency and flexibility to the current proposal.

I want to repeat that I haven't verified that RFB is particularly good
fit for the purpose, I just know it's very popular.

Best,
Marvin
_______________________________________________
Standards mailing list -- standards@xmpp.org
To unsubscribe send an email to standards-le...@xmpp.org

[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

Reply via email to