I searched the archives and did not find this precise issue.
I have a vob file extracted from a DVD. Call it 0055743.vob if you like.
vlc plays this vob fine and displays the subtitles as they should be.
I use this transcode based command to extract the substream:
tccat -i 0055743.vob | tcextract -x ps1 -t vob -a 0x20 > 0055743.en_0.subtrack;
and subtitle2pgm to break it out into images for OCR
subtitle2pgm -o 0055743.en_0 -c 255,0,0,255 < 0055743.en_0.subtrack
Then I use various OCR engines etc to get an srt file.
The problem is that when I follow this some of the timings and subs come
out wrong. Very often a sub will be repeated where there should be two
different subs. This often happens where the endpoint of one is the start
of another. Here is an example my process gives of this type:
11
00:01:24,180 --> 00:01:26,819
30 barrels of rice for land taxes.
12
00:01:26,819 --> 00:01:29,510
30 barrels of rice for land taxes.
When it should give this:
11
00:01:24,180 --> 00:01:26,819
Yoza, it seems you have collected
12
00:01:26,819 --> 00:01:29,510
30 barrels of rice for land taxes.
Obviously the pgms extracted by subtitle2pgm are wrong. Sometimes
there are larger errors consisting of a sequence of pgms all displaced
by one.
My question, is this a problem with tcextract or with subtitle2pgm?
Where should I look first for a fix?
Has anybody else seen this, or related problems. I can host the 4G vob for
anybody to download to test their setup on.
Also what other simple ways are there to do this process another way. I
extract a lot of subs so it has to be command line based and managable.
Thanks in advance,
Simon.