Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Shree Devi Kumar
I think pdf creation adds a text layer only and there isn't an option to
add HOCR to it.

@jbreiden can confirm.

On Mon, Sep 17, 2018 at 6:10 PM, Monica  wrote:

> I have tried this, but this is showing the default behaviour. I think the
> default output is overlaying on pdf instead of hocr out.
>
>
> On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:
>
>> Thanks Zdenko for you response.
>> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
>> pdf file ?
>>
>> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>>
>>> Something like this?
>>>
>>> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>>>
>>> Zdenko
>>>
>>>
>>> po 17. 9. 2018 o 14:12 monica kumari 
>>> napísal(a):
>>>
 for OCRing a scanned pdf,
 first it is converted to image format then OCRed and gives a temperory
 file of pdf/text format and overlays on original scanned pdf.
 I want the output format to be hocr. for this, I ran the command
 "convert scannedFile.pdf scannedFile.png" and then "tesseract
 scannedFile.png scanned.pdf -l eng hocr"
 I got the hocr fomat as output.
 Now I need a help to overlay it on scannned pdf file.

 Anybody have any idea about it ?

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%
 40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87
>>> XG8A%40mail.gmail.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%
> 3DRyychterctsVQ%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUTCr-OfCd0xAC_AoJAqk6J%2B0OaJ4mR4_nyoU34qLMAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] combine_lang_model makes no dawg file

2018-09-17 Thread Shree Devi Kumar
I use it as follows and it works. Please check that you are using correct
paths for the files.

combine_lang_model \
--input_unicharset ./layersan/san.unicharset \
--script_dir ~/langdata \
--words ~/langdata/san/san.wordlist \
--numbers ~/langdata/san/san.numbers \
--puncs ~/langdata/san/san.punc \
--output_dir ./layersan \
--lang san \
--pass_through_recoder \
--version_str ` cat ./layersan/san.new.version`

And, here is the unpacking of this traineddata file

~/tesstutorial-deva/layersan/san$ combine_tessdata -u san.traineddata ./san.

Extracting tessdata components from san.traineddata
Wrote ./san.config
Wrote ./san.lstm-punc-dawg
Wrote ./san.lstm-word-dawg
Wrote ./san.lstm-number-dawg
Wrote ./san.lstm-unicharset
Wrote ./san.lstm-recoder
Wrote ./san.version
Version
string:4.0.0-beta.4-138-g2093:san:shreeshrii20180917:from:4.00.00alpha:Devanagari:synth20170629test
0:config:size=1013, offset=192
18:lstm-punc-dawg:size=5306, offset=1205
19:lstm-word-dawg:size=15123986, offset=6511
20:lstm-number-dawg:size=450, offset=15130497
21:lstm-unicharset:size=12621, offset=15130947
22:lstm-recoder:size=1552, offset=15143568
23:version:size=92, offset=15145120




On Mon, Sep 17, 2018 at 4:18 PM, Hosein Khoshdel 
wrote:

> i used combine_lang_model like this:
>
> combine_lang_model--input_unicharset 
> ../combinelangmodel/fas.lstm-unicharset
>  \
> --script_dir../combinelangmodel/sdir   \
> --outputdiroutputdir \
> --langfas  \
> --lang_is_rtltrue \
> --words..\lists\fas.wordlist  \
> --puncs..\lists\fas.punc  \
> --numbers ..\lists\fas.numbers  \
>
> BTW i get fas.lstm-unicharset by using combine_tessdata with -u on
> official fas.traineddata and got fas.wordlist, fas.punc and fas.numbers
> from langdata repo. now almost everything is fine except that when i unpack
> the resulting traineddata there is no dawg file in it although the help
> says that if the 3 word lists are provided the dawg files are also added to
> traineddata file.
> can you please help me and show me what part i am doing wrong?
> also the extra spaces in command is just for better readability here
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/ecb262d7-d448-4125-a60e-ddf266aea40c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWq8PCg-VL2cKurCcyO0cKAFr-Gi3hCKYWoxf0An%3DniVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Jeff Breidenbach
Tesseract produces searchable PDF directly.  If you really want to use HOCR
as an
intermediate format, you can but you will need external software. There are
a couple
of  "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable
job
tying things together. That said, going direct should give best results.



On Mon, Sep 17, 2018 at 10:08 AM Shree Devi Kumar 
wrote:

> I think pdf creation adds a text layer only and there isn't an option to
> add HOCR to it.
>
> @jbreiden can confirm.
>
> On Mon, Sep 17, 2018 at 6:10 PM, Monica  wrote:
>
>> I have tried this, but this is showing the default behaviour. I think the
>> default output is overlaying on pdf instead of hocr out.
>>
>>
>> On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:
>>
>>> Thanks Zdenko for you response.
>>> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
>>> pdf file ?
>>>
>>> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>>>
 Something like this?

 tesseract scannedFile.png scanned.pdf -l eng hocr pdf

 Zdenko


 po 17. 9. 2018 o 14:12 monica kumari 
 napísal(a):

> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pdf.
> I want the output format to be hocr. for this, I ran the command
> "convert scannedFile.pdf scannedFile.png" and then "tesseract
> scannedFile.png scanned.pdf -l eng hocr"
> I got the hocr fomat as output.
> Now I need a help to overlay it on scannned pdf file.
>
> Anybody have any idea about it ?
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHjiUbpuHSzzsC31fN6BqmzVPb6_TJxDmFiwBiTRPEM_wnTY2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Zdenko Podobny
Something like this?

tesseract scannedFile.png scanned.pdf -l eng hocr pdf

Zdenko


po 17. 9. 2018 o 14:12 monica kumari  napísal(a):

> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pdf.
> I want the output format to be hocr. for this, I ran the command
> "convert scannedFile.pdf scannedFile.png" and then "tesseract
> scannedFile.png scanned.pdf -l eng hocr"
> I got the hocr fomat as output.
> Now I need a help to overlay it on scannned pdf file.
>
> Anybody have any idea about it ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread monica kumari
for OCRing a scanned pdf, 
first it is converted to image format then OCRed and gives a temperory file 
of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command 
"convert scannedFile.pdf scannedFile.png" and then "tesseract 
scannedFile.png scanned.pdf -l eng hocr"
I got the hocr fomat as output. 
Now I need a help to overlay it on scannned pdf file.

Anybody have any idea about it ?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
Thanks Zdenko for you response.
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf
file ?

On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:

> Something like this?
>
> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>
> Zdenko
>
>
> po 17. 9. 2018 o 14:12 monica kumari 
> napísal(a):
>
>> for OCRing a scanned pdf,
>> first it is converted to image format then OCRed and gives a temperory
>> file of pdf/text format and overlays on original scanned pdf.
>> I want the output format to be hocr. for this, I ran the command
>> "convert scannedFile.pdf scannedFile.png" and then "tesseract
>> scannedFile.png scanned.pdf -l eng hocr"
>> I got the hocr fomat as output.
>> Now I need a help to overlay it on scannned pdf file.
>>
>> Anybody have any idea about it ?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRhJaHuY651CwQV-GvmzPyKyDO_gY2sGRE47LnUHAZ%3DQ7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
I have tried this, but this is showing the default behaviour. I think the
default output is overlaying on pdf instead of hocr out.


On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:

> Thanks Zdenko for you response.
> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
> pdf file ?
>
> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>
>> Something like this?
>>
>> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>>
>> Zdenko
>>
>>
>> po 17. 9. 2018 o 14:12 monica kumari 
>> napísal(a):
>>
>>> for OCRing a scanned pdf,
>>> first it is converted to image format then OCRed and gives a temperory
>>> file of pdf/text format and overlays on original scanned pdf.
>>> I want the output format to be hocr. for this, I ran the command
>>> "convert scannedFile.pdf scannedFile.png" and then "tesseract
>>> scannedFile.png scanned.pdf -l eng hocr"
>>> I got the hocr fomat as output.
>>> Now I need a help to overlay it on scannned pdf file.
>>>
>>> Anybody have any idea about it ?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.