Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I've merged Nick White's bugfix into hocr-tools. Thank you, Nick. I expect most people will instead use the native PDF support built into Tesseract henceforth, and I intend to focus most of my time and energy there. However, there is still some use for hocr-pdf, especially when working with

Re: hocr2pdf and arabic language

2014-02-06 Thread universal reseller
​do you now 3.03 release time ?!​ -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For

Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
As for Arabic and other right-to-left scripts, please try using the new native PDF capability in Tesseract instead. It is significantly more sophisticated and I think it should work correctly. -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group.

Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I don't know, it is up to Ray. My guess is quite soon. In any case, I just ran on your example images, noticed a small problem, and fixed it. Thank you for providing them. I should also mention that there is no need to convert your binary images to JPEG when using Tesseract's native PDF

Re: hocr2pdf and arabic language

2014-01-29 Thread peiman F.
I cant say this is bug but all words added to pdf are reversed! when i will search or get text with pdftotext all words are reversed and must search reversed to find... is this have any solution!? -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr

Re: hocr2pdf and arabic language

2014-01-27 Thread Nick White
I just tested hocr2pdf, and amazingly you're right, it doesn't seem to support UTF-8. Which is pretty shocking. maybe you can try alternative solution ;-) [1]. It was created by google(I think ;-) ) and there is visible contributor e-mail if it does not work :-)

Re: hocr2pdf and arabic language

2014-01-27 Thread universal reseller
​​also it seems they dont answer their email too!! ​ ​this is more ​​shock able for me!!​ :( how can we tell them regarding this bug!? -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to

Re: hocr2pdf and arabic language

2014-01-27 Thread Nick White
​​also it seems they dont answer their email too!! ​ ​this is more ​​shock able for me!!​ :( Well they're probably just busy with other things at the moment. We should be happy and grateful that they've released free tools which we can use! how can we tell them regarding this bug!? They have

Re: hocr2pdf and arabic language

2014-01-27 Thread peiman F.
thank you nick now i just run the script like this ./hocr-pdf 1 the folder '1' contain .jpg and .hocr files with the same name after a lof of output it will end with : -- 0008491207 0 n 0008491485 0 n 0008491764 0 n 0008492045 0 n 0008492326 0 n 0008492607

Re: hocr2pdf and arabic language

2014-01-27 Thread Janusz S. Bien
Quote/Cytat - peiman F. uniresel...@gmail.com (Mon 27 Jan 2014 01:23:45 PM CET): thank you nick now i just run the script like this ./hocr-pdf 1 the folder '1' contain .jpg and .hocr files with the same name after a lof of output it will end with : -- 0008491207 0 n

Re: hocr2pdf and arabic language

2014-01-27 Thread peiman F.
ok . thank you i got the out put file but there is 2 problem 1-the file page order is different from original 2-file is not search able! i will paste my steps here : 1) make image from file : gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_% 04d.jpg 2) use tesseract

Re: hocr2pdf and arabic language

2014-01-27 Thread Nick White
Hmm, that's odd. Can you post an example image and .html file that are included in this process? You're right, there's no text embedded in that pdf... On Mon, Jan 27, 2014 at 04:53:32AM -0800, peiman F. wrote: ok . thank you i got the out put file but there is 2 problem 1-the file page order

Re: hocr2pdf and arabic language

2014-01-27 Thread peiman F.
sure,, the zip file containing the images and the *.hocr files that renamed from *.html http://nilzoom.com/files/1.ziphttp://www.google.com/url?q=http%3A%2F%2Fnilzoom.com%2Ffiles%2Fa.pdfsa=Dsntz=1usg=AFQjCNGjdhFbzJy98ycXykuO4dnR8N7pkQ -- -- You received this message because you are

Re: hocr2pdf and arabic language

2014-01-27 Thread peiman F.
the correct link is this: http://nilzoom.com/files/1.zip -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to

Re: hocr2pdf and arabic language

2014-01-27 Thread Nick White
Great, I found the bug and submitted a patch[0]. The reason was that all of the words are in a strong hOCR tag, which hocr2pdf wasn't handling. I also found that it wasn't sorting the pages in any sane way, so patched that too[1]. As there are quite a few patches floating around for now, I'm

Re: hocr2pdf and arabic language

2014-01-27 Thread Jeff Breidenbach
I am the author of the hocr2pdf utility. Thank you for the patch, I'll merge it some time next week. This week my focus is fixing some problem reports with the new native PDF output capability for Tesseract. Jeff -- -- You received this message because you are subscribed to the Google Groups

Re: hocr2pdf and arabic language

2014-01-27 Thread peiman F.
thank you ,nick! you helped me so much, the attached file is an script to automate the processing of ocr and compiling the pdf... Warm Regards -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to

Re: hocr2pdf and arabic language

2014-01-26 Thread zdenko podobny
Did you asked this question author of hocr2pdf ? Zdenko On Sun, Jan 26, 2014 at 9:25 PM, peiman F. uniresel...@gmail.com wrote: Hi i have a pdf file and i have to make it searchable the pdf is in arabic language i can ocr its a single page with tesseract without any problem but when i

Re: hocr2pdf and arabic language

2014-01-26 Thread universal reseller
yes! but thay dont answered me after 1 week so i asked here!! any one else have problem with utf-8!? On Mon, Jan 27, 2014 at 12:47 AM, zdenko podobny zde...@gmail.com wrote: Did you asked this question author of hocr2pdf ? Zdenko On Sun, Jan 26, 2014 at 9:25 PM, peiman F.

Re: hocr2pdf and arabic language

2014-01-26 Thread zdenko podobny
maybe you can try alternative solution ;-) [1]. It was created by google(I think ;-) ) and there is visible contributor e-mail if it does not work :-) https://code.google.com/p/hocr-tools/source/browse/hocr-pdf Zdenko On Sun, Jan 26, 2014 at 10:19 PM, universal reseller

Re: hocr2pdf and arabic language

2014-01-26 Thread universal reseller
​this havnt any sample or document! it seems not complete and usable yet!!​ -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to