I can confirm that Rudolf (rk-com)'s and George Chriss (gschriss)'s fix
works. Thanks!
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/623438
Title:
Font size not correct in merged sandvich PDF
To
Many thanks to George Chriss! (see above)
My workaround based on his description:
Modify the created hocr by XSLT (see below). Then using hocr2pdf 0.8.9 - and
the textboxes are placed (almost) correctly.
$ tesseract image.tif ocr_file hocr
$ xsltproc -html -nonet -novalid -o ocr_fixed.hocr
Treating Comment #1 as works as intended (with a character precision
limitation) and Bug #632524 as broken (font size/placement has no
correlation to underlying text + out-of-bounds/missing/dog-piled
text), I'm happy to report the following:
While developing a new Inkscape extension to export
Link to Inkscape Extension 'Export Image Overlay Text as hOCR' mentioned
in Comment #58: https://bugs.launchpad.net/inkscape/+bug/1069248
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/623438
Title:
** Changed in: exactimage (Ubuntu)
Status: New = Confirmed
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/623438
Title:
Font size not correct in merged sandvich PDF
To manage notifications
On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen
jussi.pakka...@canonical.com wrote:
I'd like to remind everyone that Cuneiform is currently unmaintained.
No-one is working on this or any other bug.
Sad, but I had such an impression already.
As far as I can see the one and only OCR option for Linux
To be fair there are also OCRAD, GOCR, and Tesseract.
Igor
On Wed, 2011-08-10 at 08:53 +, Martin Wildam wrote:
On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen
jussi.pakka...@canonical.com wrote:
I'd like to remind everyone that Cuneiform is currently unmaintained.
No-one is working on
I'd like to remind everyone that Cuneiform is currently unmaintained.
No-one is working on this or any other bug.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/623438
Title:
Font size not correct
I've installed exactimage 0.8.6 from source and verified that it still can't
cope with new cuneiform hocr file format.
Latest cuneiform version that still outputs old format is the 0.8.0. I had to
revert to that version to get usable results.
Since both cuneiform and hocr2pdf are needed to get
I'm having similar issue. I can confirm that it is not related to Cuneiform.
I'm using ocropus (ocroscript recognize) (which uses Tesseract) and I have
check the resulting .html (hocr) which seems valid and pixel perfect.
However, hocr2pdf misalign the text with their related bounding boxes.
I just recently discovered this issue and wonder what is the final
disposition? I read all the comments, but I am still unsure what is
going to happen. Has it been determined that cuneiform is producing hocr
standard compliant output and the issue is with hocr2pdf? Based on what
I have read in the
Let me first summarize the cuneiform specific issues / proposed changes
from Martin Wildam's conversation with Rene Rebe.
1) rev 413 to 415 completely changed the way bounding box info is written, now
bbox per line and additional array of x start position, missing y height for
proper font size
I find the specification somewhat difficult to interpret at times but
it is my understanding that character bbox info goes within the
ocr_line tag element. whether it goes before or after the textual
elements is irrelevant. E.g.
span class='ocr_line' id='line_18' title=bbox 363 1253 581
Jakub Wilk, as you can see in any hocr output, the span is closed, I was
sloppy when I copy pasted to the post. I have run the produced hocr output from
cuneiform through
http://validator.w3.org/check
and it validates just fine.
As for the
span class='ocr_line'...Some textspan
I will have to change the ocr_cinfo span anyway.. to fix the whitespace
bbox and also, I have noted that cuneiform occasionally gives control
codes as part of the text. Not sure when I will have time to make the
changes, but in any case, we could agree on what the format should be
and then
Example:
span class='ocr_line' id='line_1' title=bbox 0 0 45 20span
class='ocr_xword' id='xword_1' title=bbox 0 0 20 20span class='ocr_cinfo'
title=x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...hello/span/spanspan
/spanspan class='ocr_xword' id='xword_2' title=bbox 25 0 45 20span
class='ocr_cinfo'
Similar problems when using Ocropus
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
This bug also affects me. Would it be possible to add a command-line
switch which allows reverting to the older bounding box format?
Have downgraded to 0.8 for the time being...
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug
I have got in touch with the developer - he has very much todo, but I
sent a donation and he looked at the issue (I exchanged a few emails
with him) - here is his final response so far:
On Mon, Sep 13, 2010 at 10:28, Rene Rebe r...@exactcode.de wrote:
Dear Martin,
the problem is that the latest
** Changed in: cuneiform-linux
Status: Invalid = Confirmed
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs mailing list
I am not entirely convinced about his arguments about UTF-8 and
whitespace (sounds like just being lazy to adopt the parser to hOCR
specs), but the loss of information about y-coordinates, which used to
be present in the output of the previous versions sounds very much like
a bug (if it's indeed
How will you proceed now regarding this issue?
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs mailing list
I reopened the bug and maybe Jussi or someone who cares will have a
look.
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs mailing
Here is another note from René:
On Mon, Sep 13, 2010 at 11:53, Rene Rebe r...@exactcode.de wrote:
Note that I wrote the initial hOCR annotation in cuneiform, ... :-)
If they desperately want to keep this new format, one could add 2
different hOCR formats, like hocr and hocr-detailed or so to
I am not aware of any open source OCR software that is doing multi-
column document recognition. It's more of a segmentation task, rather
than recognition itself, so it should be rather implemented in a front-
end, such as OCRopus. If you have a linear text flow, sandwich PDFs can
be read by a
** Also affects: exactimage (Ubuntu)
Importance: Undecided
Status: New
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs
The bug against exactimage is not going to be processed, as this package
is autosynced from Debian, so the way it will work is as follows: one
day someone from Ubuntu will report it against Debian, and few years
later a Debian Developer will try to report it to upstream.
It is possible to change
Yes, I used the company number. And I already sent them an email. So far now
response.
I followed now your advice to subscribe to the mailing list and will report the
issue there - we will see if this works.
Thank you for your assistance.
--
Font size not correct in merged sandvich PDF
Martin,
Have you tried other OCR engines which can generate hOCR output?
I'm not sure all of them can but here are a few free and open source OCR
engines I've run on Linux:
GOCR
OCRAD
Tesseract
Does this issue affect them as well?
Best,
Igor
On Fri, 2010-09-10 at 11:45 +, Martin Wildam
I don't understand your question. Can you formulate it using no more
than 75 words?
--
Font size not correct in merged sandvich PDF
https://bugs.launchpad.net/bugs/623438
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
--
ubuntu-bugs
@Igor: I searched quite a while - don't remember ocrad explicitely now
but I am quite sure I came across it. I also found at other places (blog
posts) that cuneiform seems to be the only one producing hocr output.
I would be glad if there would be more choices. I have written a common
file
Martin,
I'm not using this functionality myself, so you most likely know best,
but OCRAD is producing ORF output with -x command-line option.
According to the README ORF file will contain bounding boxes for OCRed
characters and lines.
Igor
On Fri, 2010-09-10 at 17:52 +, Martin Wildam wrote:
On Fri, 10 Sep 2010 Martin Wildam 623...@bugs.launchpad.net wrote:
@Igor: I searched quite a while - don't remember ocrad explicitely now
but I am quite sure I came across it. I also found at other places (blog
posts) that cuneiform seems to be the only one producing hocr output.
This was
I could not find any documentation about how to get the hocr output back
when I tested those OCR engines and after looking back now I can't find
any documentation for ocropus or tesseract on how to produce the hocr
html files.
--
Font size not correct in merged sandvich PDF
34 matches
Mail list logo