On Wed, 11 Aug 2010 01:43:01 +0200, Silvia Pfeiffer <silviapfeiff...@gmail.com> wrote:

On Tue, Aug 10, 2010 at 7:49 PM, Philip Jägenstedt <phil...@opera.com>wrote:

On Tue, 10 Aug 2010 01:34:02 +0200, Silvia Pfeiffer <
silviapfeiff...@gmail.com> wrote:

 On Tue, Aug 10, 2010 at 12:04 AM, Philip Jägenstedt <phil...@opera.com
>wrote:

 On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <
silviapfeiff...@gmail.com> wrote:


I guess this is in support of Henri's proposal of parsing the cue using
the
HTML fragment parser (same as innerHTML)? That would be easy to
implement,
but how do we then mark up speakers? Using <span class="narrator"></span>
around each cue is very verbose. HTML isn't very good for marking up
dialog,
which is quite a limitation when dealing with subtitles...


I actually think that the <span @class> mechanism is much more flexible
than
what we have in WebSRT right now. If we want multiple speakers to be able
to
speak in the same subtitle, then that's not possible in WebSRT. It's a
little more verbose in HTML, but not massively.

We might be able to add a special markup similar to the <[timestamp]>
markup
that Hixie introduced for Karaoke. This is beyond the innerHTML parser and
I
am not sure if it breaks it. But if it doesn't, then maybe we can also
introduce a <[voice]> marker to be used similarly?


An HTML parser parsing <1> or <00:01:30> will produce text nodes "<1>" and "<00:01:30>". Without having read the HTML parsing algorithm I guess that elements need to begin with a letter or similar. So, it's not possible to (ab)use the HTML parser to handle inner timestamps of numerical voices, we'd
have to replace those with something else, probably more verbose.



I have checked the parse spec and
http://www.whatwg.org/specs/web-apps/current-work/#tag-open-state indeed
implies that a tag starting with a number is a parse error. Both, the
timestamps and the voice markers thus seem problems when going with an
innerHTML parser. Is there a way to resolve this? I mean: I'd quite happily
drop the voice markers for a <span @class> but I am not sure what to do
about the timestamps. We could do what I did in WMML and introduce a <t>
element with the timestamp as a @at attribute, but that is again more
verbose. We could also introduce an @at attribute in <span> which would then
at least end up in the DOM and can be dealt with specially.

What should numerical voices be replaced with? Personally I'd much rather write <philip> and <silvia> to mark up a conversation between us two, as I think it'd be quite hard to keep track of the numbers if editing subtitles with many different speakers. However, going with that and using an HTML parser is quite a hack. Names like <mark> and <li> may already have special parsing rules or default CSS.

Going with HTML in the cues, we either have to drop voices and inner timestamps or invent new markup, as HTML can't express either. I don't think either of those are really good solutions, so right now I'm not convinced that reusing the innerHTML parser is a good way forward.

Think for example about the case where we had a requirement that a double
newline starts a new cue, but now we want to introduce a means where the
double newline is escaped and can be made part of a cue.

Other formats keep track of their version, such as MS Word files. It is to
be hoped that most new features can be introduced without breaking
backwards
compatibility and we can write the parsing requirements such that certain things will be ignored, but in and of itself, WebSRT doesn't provide for this extensibility. Right now, there is for example extensibility with the "WebSRT settings parsing" (that's the stuff behind the timestamps) where
further "setting:value" settings can be introduced. But for example the
introduction of new "cue identifiers" (that's the <> marker at the start
of
a cue) would be difficult without a version string, since anything that
doesn't match the given list will just be parsed as cue-internal tag and
thus end up as part of the cue text where plain text parsing is used.


The bug I filed suggested allowing arbitrary voices, to simplify the parser and to make future extensions possible. For a web format I think this is a better approach format than versioning. I haven't done a full review of the parser, but there are probably more places where it could be more forgiving
so as to allow future tweaking.



That's a good approach and will reduce the need for breaking
backwards-compatibility. In an xml-based format that need is 0, while with a text format where the structure is ad-hoc, that need can never be reduced to 0. That's what I am concerned about and that's why I think we need a version
identifier. If we end up never using/changing the version identifier, the
better so. But I'd much rather we have it now and can identify what
specification a file adheres to than not being able to do so later.

Perhaps I'm too influenced by HTML and its failed attempts at versioning, but I think that if you want to know which version of a spec a document is written against, you can run it through a parser for each version. This doesn't tell you the author intent, but I'm not sure that's very interesting to know. If the author thinks it's important, perhaps it can be put in a comment in the header.

On the other hand, keeping the same extension and (unregistered) MIME type
as SRT has plenty of benefits, such as immediately being able to use
existing SRT files in browsers without changing their file extension or
MIME
type.



There is no harm for browsers to accept both MIME types if they are sure
they can parse old srt as well as new websrt. But these two formats are
different enough that they should be given a different extension and mime
type. I do not see a single advantage in stealing the MIME type of an
existing format for a new specification.


But there's no spec for the old SRT, the only thing one could do is parser
it with a WebSRT parser.


I can write that spec in an afternoon and register the mime type with IANA. That really isn't a problem. People have managed to write correct SRT files
without having a spec, because it's so trivial. Creating a spec is just a
formality. For now, the wikipedia page really is sufficient.

Having a separate spec isn't really useful unless we expect people to implement it. Perhaps some new implementations would follow the spec, but browsers sure wouldn't implement two different parsers.

That would make text/srt and text/websrt synonymous, which is kind of
pointless.


No, it's only pointless if you are a browser vendor. For everyone else it is a huge advantage to be able to choose between a guaranteed simple format and
a complex format with all the bells and whistles.



The advantages of taking text/srt is that all existing software to create
SRT can be used to create WebSRT


That's not strictly true. If they load a WebSRT file that was created by
some other software for further editing and that WebSRT file uses advanced
WebSRT functionality, the authoring software will break.

Right, especially settings appended after the timestamps are quite likely to be stripped when saving the file.

and servers that already send text/srt don't need to be updated. In either
case I think we should support only one mime type.


What's the harm in supporting two mime types but using the same parser to
parse them?

Most content will most likely be plain old SRT without voices, <ruby> or similar. People will create them using existing software with the .srt extension and serve them using the text/srt MIME type. When they later decide to add some <ruby> or similar, it will just work without changing the extension or MIME type. The net result is that text/srt and text/websrt mean exactly the same thing, making it a wasted effort.

  * there is no definition of the "canvas" dimensions that the cues are


 prepared for (width/height) and expected to work with other than
saying
it
is the video dimensions - but these can change and the proportions
should
be
changed with that


 I'm not sure what you're saying here. Should the subtitle file be
hard-coded to a particular size? In the quite peculiar case where the
same
subtitles really don't work at two different resolutions, couldn't we
just
have two files? In what cases would this be needed?



Most subtitles will be created with a specific width and height in mind.
For
example, the width in characters relies on the video canvas having at
least
that size and the number of lines used usually refers to a lower third
of
a
video - where that is too small, it might cover the whole video. So, my proposal is not the hard-code the subtitles to a particular size, but to
put
the minimum width and height that are being used for the creation of the
subtitles into the file. Then, the file can be scaled below or above
this
size to adjust to the actual available space.


In practice, does this mean scaling font-size by
width_actual/width_intended or similar? Personally, I prefer subtitles to
be
something like 20 screen pixels regardless of video size, as that is
readable. Making them bigger hides more of the video, while making them smaller makes them hard to read. But I guess we could let the CSS media
query min-width and similar be evaluated against the size of the
containing
video element, to make it possible anyway.




Have you ever tried to keep the small font size of subtitles on a 320x240
video when going full-screen? They are almost unusable at that size.
YouTube
doesn't do a good job at that, incidentally, so you can go check it out
there - go full-screen and see how tiny the captions become then step back
from your screen to where you'd want to watch the video from and notice
how
the captions are basically unreadable.

When you scale the font-size with the video, you do not hide more of the
video - you hide the exact same part of the video. Video and font get
larger
in the same way. And that's exactly the need that we have.


Existing media players have basically two different ways of handling this.
The kind you're describing is like MPlayer, where subtitles appear to
actually be rendered on to the video frames and then scaled together with
the video. The kind I've used more is like Totem, where subtitles are
rendered in a separate layer at a fixed size in pixels, regardless of
whether or not you're watching in fullscreen. This means that word wrapping
will be different depending on screen size.


In the Totem case, does the font size increase with a change in screen size?

Oops, on closer inspection I am completely wrong, the text is actually rendered and scaled with the video, just a bit prettier than MPlayer does it. Maybe the prettiness lead me to believe it was somehow different. Sigh.

My suggestion is to have them in different layers, but there is knowledge
about the intended anchoring, i.e. where is the text supposed to appear on
the video screen. The keep that anchoring intact no matter what the video
size.



I find both MPlayer's and Totem's behavior annoying in some situations, but
personally prefer Totem most of the time.


Do you find MPlayer's behavior annoying because by rescaling already
rendered text, the text loses resolution and becomes less readable? This is
definitely not the behaviour I am after.

Scaling with the video is annoying with small videos, as the text ends up being huge in fullscreen. I assume we're going to do scaling as well as we can, so that's not an argument in either direction.

I'll have to withdraw any opinion for now, I don't know how to best deal with this.

--
Philip Jägenstedt
Core Developer
Opera Software

Reply via email to