Re: [whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Philip Jägenstedt Wed, 11 Aug 2010 05:30:39 -0700

On Wed, 11 Aug 2010 01:43:01 +0200, Silvia Pfeiffer<silviapfeiff...@gmail.com> wrote:

On Tue, Aug 10, 2010 at 7:49 PM, Philip Jägenstedt<phil...@opera.com>wrote:
On Tue, 10 Aug 2010 01:34:02 +0200, Silvia Pfeiffer <
silviapfeiff...@gmail.com> wrote:

 On Tue, Aug 10, 2010 at 12:04 AM, Philip Jägenstedt <phil...@opera.com
>wrote:

 On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
silviapfeiff...@gmail.com> wrote:
I guess this is in support of Henri's proposal of parsing the cueusing
the
HTML fragment parser (same as innerHTML)? That would be easy to
implement,
but how do we then mark up speakers? Using <spanclass="narrator">
around each cue is very verbose. HTML isn't very good for marking up
dialog,
which is quite a limitation when dealing with subtitles...
I actually think that the mechanism is much more flexible
than
what we have in WebSRT right now. If we want multiple speakers to beable
to
speak in the same subtitle, then that's not possible in WebSRT. It's a
little more verbose in HTML, but not massively.

We might be able to add a special markup similar to the <[timestamp]>
markup
that Hixie introduced for Karaoke. This is beyond the innerHTML parserand
I
am not sure if it breaks it. But if it doesn't, then maybe we can also
introduce a <[voice]> marker to be used similarly?
An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>"and"<00:01:30>". Without having read the HTML parsing algorithm I guessthatelements need to begin with a letter or similar. So, it's not possibleto(ab)use the HTML parser to handle inner timestamps of numerical voices,we'd
have to replace those with something else, probably more verbose.
I have checked the parse spec and
http://www.whatwg.org/specs/web-apps/current-work/#tag-open-state indeed
implies that a tag starting with a number is a parse error. Both, the
timestamps and the voice markers thus seem problems when going with an
innerHTML parser. Is there a way to resolve this? I mean: I'd quitehappily
drop the voice markers for a but I am not sure what to do
about the timestamps. We could do what I did in WMML and introduce a <t>
element with the timestamp as a @at attribute, but that is again more
verbose. We could also introduce an @at attribute in which wouldthen
at least end up in the DOM and can be dealt with specially.

What should numerical voices be replaced with? Personally I'd much ratherwrite <philip> and <silvia> to mark up a conversation between us two, as Ithink it'd be quite hard to keep track of the numbers if editing subtitleswith many different speakers. However, going with that and using an HTMLparser is quite a hack. Names like and <li> may already havespecial parsing rules or default CSS.

Going with HTML in the cues, we either have to drop voices and innertimestamps or invent new markup, as HTML can't express either. I don'tthink either of those are really good solutions, so right now I'm notconvinced that reusing the innerHTML parser is a good way forward.

Think for example about the case where we had a requirement that adouble
newline starts a new cue, but now we want to introduce a means wherethe
double newline is escaped and can be made part of a cue.
Other formats keep track of their version, such as MS Word files. Itis to
be hoped that most new features can be introduced without breaking
backwards
compatibility and we can write the parsing requirements such thatcertainthings will be ignored, but in and of itself, WebSRT doesn't provideforthis extensibility. Right now, there is for example extensibility withthe"WebSRT settings parsing" (that's the stuff behind the timestamps)where
further "setting:value" settings can be introduced. But for example the
introduction of new "cue identifiers" (that's the <> marker at thestart
of
a cue) would be difficult without a version string, since anything that
doesn't match the given list will just be parsed as cue-internal tagand
thus end up as part of the cue text where plain text parsing is used.
The bug I filed suggested allowing arbitrary voices, to simplify theparserand to make future extensions possible. For a web format I think thisis abetter approach format than versioning. I haven't done a full review oftheparser, but there are probably more places where it could be moreforgiving
so as to allow future tweaking.
That's a good approach and will reduce the need for breaking
backwards-compatibility. In an xml-based format that need is 0, whilewith atext format where the structure is ad-hoc, that need can never bereduced to0. That's what I am concerned about and that's why I think we need aversion
identifier. If we end up never using/changing the version identifier, the
better so. But I'd much rather we have it now and can identify what
specification a file adheres to than not being able to do so later.

Perhaps I'm too influenced by HTML and its failed attempts at versioning,but I think that if you want to know which version of a spec a document iswritten against, you can run it through a parser for each version. Thisdoesn't tell you the author intent, but I'm not sure that's veryinteresting to know. If the author thinks it's important, perhaps it canbe put in a comment in the header.

On the other hand, keeping the same extension and (unregistered) MIMEtype
as SRT has plenty of benefits, such as immediately being able to use
existing SRT files in browsers without changing their file extensionor
MIME
type.
There is no harm for browsers to accept both MIME types if they aresure
they can parse old srt as well as new websrt. But these two formats are
different enough that they should be given a different extension andmime
type. I do not see a single advantage in stealing the MIME type of an
existing format for a new specification.
But there's no spec for the old SRT, the only thing one could do isparser
it with a WebSRT parser.
I can write that spec in an afternoon and register the mime type withIANA.That really isn't a problem. People have managed to write correct SRTfiles
without having a spec, because it's so trivial. Creating a spec is just a
formality. For now, the wikipedia page really is sufficient.

Having a separate spec isn't really useful unless we expect people toimplement it. Perhaps some new implementations would follow the spec, butbrowsers sure wouldn't implement two different parsers.

That would make text/srt and text/websrt synonymous, which is kind of
pointless.
No, it's only pointless if you are a browser vendor. For everyone elseit isa huge advantage to be able to choose between a guaranteed simple formatand
a complex format with all the bells and whistles.
The advantages of taking text/srt is that all existing software tocreate
SRT can be used to create WebSRT
That's not strictly true. If they load a WebSRT file that was created by
some other software for further editing and that WebSRT file usesadvanced
WebSRT functionality, the authoring software will break.

Right, especially settings appended after the timestamps are quite likelyto be stripped when saving the file.

and servers that already send text/srt don't need to be updated. Ineither
case I think we should support only one mime type.
What's the harm in supporting two mime types but using the same parser to
parse them?

Most content will most likely be plain old SRT without voices, <ruby> orsimilar. People will create them using existing software with the .srtextension and serve them using the text/srt MIME type. When they laterdecide to add some <ruby> or similar, it will just work without changingthe extension or MIME type. The net result is that text/srt andtext/websrt mean exactly the same thing, making it a wasted effort.

  * there is no definition of the "canvas" dimensions that the cues are
 prepared for (width/height) and expected to work with other than
saying
it
is the video dimensions - but these can change and the proportions
should
be
changed with that


 I'm not sure what you're saying here. Should the subtitle file be
hard-coded to a particular size? In the quite peculiar case wherethe
same
subtitles really don't work at two different resolutions, couldn'twe
just
have two files? In what cases would this be needed?
Most subtitles will be created with a specific width and height inmind.
For
example, the width in characters relies on the video canvas having at
least
that size and the number of lines used usually refers to a lowerthird
of
a
video - where that is too small, it might cover the whole video. So,myproposal is not the hard-code the subtitles to a particular size,but to
put
the minimum width and height that are being used for the creation ofthe
subtitles into the file. Then, the file can be scaled below or above
this
size to adjust to the actual available space.
In practice, does this mean scaling font-size by
width_actual/width_intended or similar? Personally, I prefersubtitles to
be
something like 20 screen pixels regardless of video size, as that is
readable. Making them bigger hides more of the video, while makingthemsmaller makes them hard to read. But I guess we could let the CSSmedia
query min-width and similar be evaluated against the size of the
containing
video element, to make it possible anyway.
Have you ever tried to keep the small font size of subtitles on a320x240
video when going full-screen? They are almost unusable at that size.
YouTube
doesn't do a good job at that, incidentally, so you can go check it out
there - go full-screen and see how tiny the captions become then stepback
from your screen to where you'd want to watch the video from and notice
how
the captions are basically unreadable.
When you scale the font-size with the video, you do not hide more ofthe
video - you hide the exact same part of the video. Video and font get
larger
in the same way. And that's exactly the need that we have.
Existing media players have basically two different ways of handlingthis.
The kind you're describing is like MPlayer, where subtitles appear to
actually be rendered on to the video frames and then scaled togetherwith
the video. The kind I've used more is like Totem, where subtitles are
rendered in a separate layer at a fixed size in pixels, regardless of
whether or not you're watching in fullscreen. This means that wordwrapping
will be different depending on screen size.
In the Totem case, does the font size increase with a change in screensize?

Oops, on closer inspection I am completely wrong, the text is actuallyrendered and scaled with the video, just a bit prettier than MPlayer doesit. Maybe the prettiness lead me to believe it was somehow different. Sigh.

My suggestion is to have them in different layers, but there is knowledge
about the intended anchoring, i.e. where is the text supposed to appearon
the video screen. The keep that anchoring intact no matter what the video
size.
I find both MPlayer's and Totem's behavior annoying in some situations,but
personally prefer Totem most of the time.
Do you find MPlayer's behavior annoying because by rescaling already
rendered text, the text loses resolution and becomes less readable? Thisis
definitely not the behaviour I am after.

Scaling with the video is annoying with small videos, as the text ends upbeing huge in fullscreen. I assume we're going to do scaling as well as wecan, so that's not an argument in either direction.

I'll have to withdraw any opinion for now, I don't know how to best dealwith this.


--
Philip Jägenstedt
Core Developer
Opera Software

Re: [whatwg] Fwd: Discussing WebSRT and alternatives/improvements

Reply via email to