Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Jeff Breidenbach
Tesseract produces searchable PDF directly. If you really want to use HOCR as an intermediate format, you can but you will need external software. There are a couple of "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable job tying things together. That said, going direct should

Re: [tesseract-ocr] combine_lang_model makes no dawg file

2018-09-17 Thread Shree Devi Kumar
I use it as follows and it works. Please check that you are using correct paths for the files. combine_lang_model \ --input_unicharset ./layersan/san.unicharset \ --script_dir ~/langdata \ --words ~/langdata/san/san.wordlist \ --numbers ~/langdata/san/san.numbers \ --puncs ~/langdata/san/san.punc

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Shree Devi Kumar
I think pdf creation adds a text layer only and there isn't an option to add HOCR to it. @jbreiden can confirm. On Mon, Sep 17, 2018 at 6:10 PM, Monica wrote: > I have tried this, but this is showing the default behaviour. I think the > default output is overlaying on pdf instead of hocr out.

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
I have tried this, but this is showing the default behaviour. I think the default output is overlaying on pdf instead of hocr out. On Mon, Sep 17, 2018 at 5:47 PM Monica wrote: > Thanks Zdenko for you response. > will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on > pdf

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
Thanks Zdenko for you response. will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf file ? On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny wrote: > Something like this? > > tesseract scannedFile.png scanned.pdf -l eng hocr pdf > > Zdenko > > > po 17. 9. 2018 o 14:12

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Zdenko Podobny
Something like this? tesseract scannedFile.png scanned.pdf -l eng hocr pdf Zdenko po 17. 9. 2018 o 14:12 monica kumari napĂ­sal(a): > for OCRing a scanned pdf, > first it is converted to image format then OCRed and gives a temperory > file of pdf/text format and overlays on original scanned

[tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread monica kumari
for OCRing a scanned pdf, first it is converted to image format then OCRed and gives a temperory file of pdf/text format and overlays on original scanned pdf. I want the output format to be hocr. for this, I ran the command "convert scannedFile.pdf scannedFile.png" and then "tesseract