On Wed, 25 Aug 2010 17:40:08 +0200, Silvia Pfeiffer <silviapfeiff...@gmail.com> wrote:

At this point, what is your recommendation? The following ideas have been
on the table:

* Change the file extension to something other than .srt.

I don't have an opinion, browsers ignore the file extension anyway.


 Yes, I think we should definitely have a new file extension.


I'll leave this to others to decide, but since browsers have no concept
of
file extensions, just using .srt will work. If the format is SRT-like
it's
likely at least some files will use .srt in practice.



All SRT files in practice use the .srt extension - it is typically how
these
formats are identified by applications. Just because *nix ignores file
extensions mostly for identifying file types doesn't mean that
applications
do. Again, I believe strongly that re-using the same file extension is the
one biggest pain we can inflict on the community.


As shown above, several popular (?) media players ignore or give little
weight to the file extension.


I don't think that's a fair sample - as I said, on Linux and on the
command-line things are different. I have a GUI mplayer here and it reacts
like VLC - doesn't let me open .wsrt files. The vast majority of
applications on Windows and the Mac make their decision on whether they
support files based on the file extension.

That the file selection dialogs are filtered by file extensions doesn't mean that applications don't sniff the content. In fact, MPlayer, VLC and Totem will happily load and use an SRT file even if it is called foo.smi, even though SAMI is a completely incompatible format. In other words, they sniff the content as being SRT. The reason that they rely on sniffing is likely that many files use the wrong file extension (my OpenSubtitles batch have no extensions, so I have no statistics on this).

Again, if we want to avoid exposing existing SRT parsers to WebSRT syntax, then the format needs to be more incompatible. File extensions will be changed, popular players rely on sniffing, some ignore leading garbage and also headers can simply be removed by naive conversion tools.

Assuming we pick the same file extension and we now have a new application
that only supports WebSRT parsing, we will make a large bunch of existing
valid SRT files invalid - not only those that are not in UTF-8, but also
those with <font>..</font> and <u>...</u>. I do wonder if the text between
the <font> start and end element and inside the <u>..</u> may even get
removed because of lack of support for these.

I've seen no application that removes everything between tags it doesn't recognize, the only things that I've seen happen is treating it as plain text or ignoring the tags much like a browser does with HTML.

  * Add a header to WebSRT to make it uniquely identifiable.


The header would have to be mandatory and browsers would have to reject
files that don't have it. Such files would be compatible with some
existing
software and break some, depending on how they sniff. We could also put
metadata in such a header.


Yes, I think we need to introduce a header. Maybe we can hide all the
structure in what SRT recognizes as comments (i.e. start the lines as
";".
But I believe we need some hints like the @profile to identify the type
of
the cues and the <link> to link to a style sheet, and we need metadata
like
the <meta> element of HTML headers.


I had no idea that semicolon was used for comments in SRT, is this usage
widespread? Does it work in most players?



I thought it was, but maybe it was just introduced for WebSRT. It is not tested in Hixie's SRT research[2]. Can you take a quick look through your SRT file collection if there are any? I'm probably wrong about this seeing
as it's not mentioned in the wiki page for SRT [3].

[2] http://wiki.whatwg.org/wiki/SRT_research
[3] http://en.wikipedia.org/wiki/SubRip


OK, I grepped the 10000 files. Only 15 had any lines beginning with a
semicolon, and by manual inspection it doesn't look like any of them are
clearly intended as comments (it's hard to tell, all are in foreign
languages). None of them were at the very beginning of the file.


Ah, that actually makes for another incompatibility of WebSRT and SRT: such lines are regarded as comments in WebSRT when they probably aren't in SRT.

I can't find anything about this when searching for "comment" and "semicolon" in the spec, are you sure you're not thinking of some other format than WebSRT?

It seems increasingly that the only thing that WebSRT and SRT still have in common is the "-->" character sequence. As a friend of mine in a11y recently said: "I was hoping to never have to stare at "-->" ever again... We could
indeed go all the way and define an much more different format, though I
don't think it will create implementations as quickly as a SRT-based but
changed format.

I would prefer if we follow one of two paths:

1. Let WebSRT be maximally compatible with SRT, making it a "retro-spec" of existing SRT use with extensions that cause as little breakage as possible in the ecosystem.

2. Make something incompatible and rid ourselves of all legacy constraints. For example, there would be no need to accept both period and comma as a separator between seconds and milliseconds.

I can't see any insurmountable issues with option 1, but would want to hear from actual media player developers, not just our guesses of what they might think. Option 2 would also be fine. Something in between, where we try to make it a little bit incompatible in order to make people aware that there *might* be some compatibility issues, is not something I'm interested in.

--
Philip Jägenstedt
Core Developer
Opera Software

Reply via email to