On Wed, 25 Aug 2010 17:40:08 +0200, Silvia Pfeiffer
<silviapfeiff...@gmail.com> wrote:
At this point, what is your recommendation? The following ideas have
been
on the table:
* Change the file extension to something other than .srt.
I don't have an opinion, browsers ignore the file extension anyway.
Yes, I think we should definitely have a new file extension.
I'll leave this to others to decide, but since browsers have no
concept
of
file extensions, just using .srt will work. If the format is SRT-like
it's
likely at least some files will use .srt in practice.
All SRT files in practice use the .srt extension - it is typically how
these
formats are identified by applications. Just because *nix ignores file
extensions mostly for identifying file types doesn't mean that
applications
do. Again, I believe strongly that re-using the same file extension is
the
one biggest pain we can inflict on the community.
As shown above, several popular (?) media players ignore or give little
weight to the file extension.
I don't think that's a fair sample - as I said, on Linux and on the
command-line things are different. I have a GUI mplayer here and it
reacts
like VLC - doesn't let me open .wsrt files. The vast majority of
applications on Windows and the Mac make their decision on whether they
support files based on the file extension.
That the file selection dialogs are filtered by file extensions doesn't
mean that applications don't sniff the content. In fact, MPlayer, VLC and
Totem will happily load and use an SRT file even if it is called foo.smi,
even though SAMI is a completely incompatible format. In other words, they
sniff the content as being SRT. The reason that they rely on sniffing is
likely that many files use the wrong file extension (my OpenSubtitles
batch have no extensions, so I have no statistics on this).
Again, if we want to avoid exposing existing SRT parsers to WebSRT syntax,
then the format needs to be more incompatible. File extensions will be
changed, popular players rely on sniffing, some ignore leading garbage and
also headers can simply be removed by naive conversion tools.
Assuming we pick the same file extension and we now have a new
application
that only supports WebSRT parsing, we will make a large bunch of existing
valid SRT files invalid - not only those that are not in UTF-8, but also
those with <font>..</font> and <u>...</u>. I do wonder if the text
between
the <font> start and end element and inside the <u>..</u> may even get
removed because of lack of support for these.
I've seen no application that removes everything between tags it doesn't
recognize, the only things that I've seen happen is treating it as plain
text or ignoring the tags much like a browser does with HTML.
* Add a header to WebSRT to make it uniquely identifiable.
The header would have to be mandatory and browsers would have to
reject
files that don't have it. Such files would be compatible with some
existing
software and break some, depending on how they sniff. We could also
put
metadata in such a header.
Yes, I think we need to introduce a header. Maybe we can hide all
the
structure in what SRT recognizes as comments (i.e. start the lines as
";".
But I believe we need some hints like the @profile to identify the
type
of
the cues and the <link> to link to a style sheet, and we need
metadata
like
the <meta> element of HTML headers.
I had no idea that semicolon was used for comments in SRT, is this
usage
widespread? Does it work in most players?
I thought it was, but maybe it was just introduced for WebSRT. It is
not
tested in Hixie's SRT research[2]. Can you take a quick look through
your
SRT file collection if there are any? I'm probably wrong about this
seeing
as it's not mentioned in the wiki page for SRT [3].
[2] http://wiki.whatwg.org/wiki/SRT_research
[3] http://en.wikipedia.org/wiki/SubRip
OK, I grepped the 10000 files. Only 15 had any lines beginning with a
semicolon, and by manual inspection it doesn't look like any of them are
clearly intended as comments (it's hard to tell, all are in foreign
languages). None of them were at the very beginning of the file.
Ah, that actually makes for another incompatibility of WebSRT and SRT:
such
lines are regarded as comments in WebSRT when they probably aren't in
SRT.
I can't find anything about this when searching for "comment" and
"semicolon" in the spec, are you sure you're not thinking of some other
format than WebSRT?
It seems increasingly that the only thing that WebSRT and SRT still have
in
common is the "-->" character sequence. As a friend of mine in a11y
recently
said: "I was hoping to never have to stare at "-->" ever again... We
could
indeed go all the way and define an much more different format, though I
don't think it will create implementations as quickly as a SRT-based but
changed format.
I would prefer if we follow one of two paths:
1. Let WebSRT be maximally compatible with SRT, making it a "retro-spec"
of existing SRT use with extensions that cause as little breakage as
possible in the ecosystem.
2. Make something incompatible and rid ourselves of all legacy
constraints. For example, there would be no need to accept both period and
comma as a separator between seconds and milliseconds.
I can't see any insurmountable issues with option 1, but would want to
hear from actual media player developers, not just our guesses of what
they might think. Option 2 would also be fine. Something in between, where
we try to make it a little bit incompatible in order to make people aware
that there *might* be some compatibility issues, is not something I'm
interested in.
--
Philip Jägenstedt
Core Developer
Opera Software