I've merged Nick White's bugfix into hocr-tools. Thank you, Nick.
I expect most people will instead use the native PDF support
built into Tesseract henceforth, and I intend to focus most of my
time and energy there.
However, there is still some use for hocr-pdf, especially when
working with
do you now 3.03 release time ?!
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For
As for Arabic and other right-to-left scripts, please try using the new
native PDF capability in Tesseract instead. It is significantly more
sophisticated and I think it should work correctly.
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
I don't know, it is up to Ray. My guess is quite soon. In any case,
I just ran on your example images, noticed a small problem, and
fixed it. Thank you for providing them.
I should also mention that there is no need to convert your binary
images to JPEG when using Tesseract's native PDF
I cant say this is bug
but all words added to pdf are reversed!
when i will search or get text with pdftotext all words are reversed and
must search reversed to find...
is this have any solution!?
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr
I just tested hocr2pdf, and amazingly you're right, it doesn't seem
to support UTF-8. Which is pretty shocking.
maybe you can try alternative solution ;-) [1]. It was created by google(I
think ;-) ) and there is visible contributor e-mail if it does not work :-)
also it seems they dont answer their email too!!
this is more shock able for me!! :(
how can we tell them regarding this bug!?
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to
also it seems they dont answer their email too!!
this is more shock able for me!! :(
Well they're probably just busy with other things at the moment. We
should be happy and grateful that they've released free tools which
we can use!
how can we tell them regarding this bug!?
They have
thank you nick
now i just run the script like this
./hocr-pdf 1
the folder '1' contain .jpg and .hocr files with the same name
after a lof of output it will end with :
--
0008491207 0 n
0008491485 0 n
0008491764 0 n
0008492045 0 n
0008492326 0 n
0008492607
Quote/Cytat - peiman F. uniresel...@gmail.com (Mon 27 Jan 2014
01:23:45 PM CET):
thank you nick
now i just run the script like this
./hocr-pdf 1
the folder '1' contain .jpg and .hocr files with the same name
after a lof of output it will end with :
--
0008491207 0 n
ok . thank you
i got the out put file but there is 2 problem
1-the file page order is different from original
2-file is not search able!
i will paste my steps here :
1)
make image from file :
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%
04d.jpg
2)
use tesseract
Hmm, that's odd.
Can you post an example image and .html file that are included in
this process? You're right, there's no text embedded in that pdf...
On Mon, Jan 27, 2014 at 04:53:32AM -0800, peiman F. wrote:
ok . thank you
i got the out put file but there is 2 problem
1-the file page order
sure,,
the zip file containing the images and the *.hocr files that renamed from
*.html
http://nilzoom.com/files/1.ziphttp://www.google.com/url?q=http%3A%2F%2Fnilzoom.com%2Ffiles%2Fa.pdfsa=Dsntz=1usg=AFQjCNGjdhFbzJy98ycXykuO4dnR8N7pkQ
--
--
You received this message because you are
the correct link is this:
http://nilzoom.com/files/1.zip
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
Great, I found the bug and submitted a patch[0]. The reason was that
all of the words are in a strong hOCR tag, which hocr2pdf wasn't
handling.
I also found that it wasn't sorting the pages in any sane way, so
patched that too[1].
As there are quite a few patches floating around for now, I'm
I am the author of the hocr2pdf utility. Thank you for the patch,
I'll merge it some time next week. This week my focus is fixing
some problem reports with the new native PDF output capability
for Tesseract.
Jeff
--
--
You received this message because you are subscribed to the Google
Groups
thank you ,nick!
you helped me so much,
the attached file is an script to automate the processing of ocr and
compiling the pdf...
Warm Regards
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to
Did you asked this question author of hocr2pdf ?
Zdenko
On Sun, Jan 26, 2014 at 9:25 PM, peiman F. uniresel...@gmail.com wrote:
Hi
i have a pdf file and i have to make it searchable
the pdf is in arabic language
i can ocr its a single page with tesseract without any problem
but when i
yes!
but thay dont answered me after 1 week so i asked here!!
any one else have problem with utf-8!?
On Mon, Jan 27, 2014 at 12:47 AM, zdenko podobny zde...@gmail.com wrote:
Did you asked this question author of hocr2pdf ?
Zdenko
On Sun, Jan 26, 2014 at 9:25 PM, peiman F.
maybe you can try alternative solution ;-) [1]. It was created by google(I
think ;-) ) and there is visible contributor e-mail if it does not work :-)
https://code.google.com/p/hocr-tools/source/browse/hocr-pdf
Zdenko
On Sun, Jan 26, 2014 at 10:19 PM, universal reseller
this havnt any sample or document!
it seems not complete and usable yet!!
--
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
21 matches
Mail list logo