Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-18 Thread Monica
Yes, I agree. I have tried that but the quality is not so good. The quality
is compromising here. Is there any other way to OCR pdfs without or less
compromising with quality ?

On Mon, Sep 17, 2018 at 11:41 PM Jeff Breidenbach  wrote:

> Tesseract produces searchable PDF directly.  If you really want to use
> HOCR as an
> intermediate format, you can but you will need external software. There
> are a couple
> of  "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable
> job
> tying things together. That said, going direct should give best results.
>
>
>
> On Mon, Sep 17, 2018 at 10:08 AM Shree Devi Kumar 
> wrote:
>
>> I think pdf creation adds a text layer only and there isn't an option to
>> add HOCR to it.
>>
>> @jbreiden can confirm.
>>
>> On Mon, Sep 17, 2018 at 6:10 PM, Monica 
>> wrote:
>>
>>> I have tried this, but this is showing the default behaviour. I think
>>> the default output is overlaying on pdf instead of hocr out.
>>>
>>>
>>> On Mon, Sep 17, 2018 at 5:47 PM Monica 
>>> wrote:
>>>
 Thanks Zdenko for you response.
 will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay
 on pdf file ?

 On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny 
 wrote:

> Something like this?
>
> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>
> Zdenko
>
>
> po 17. 9. 2018 o 14:12 monica kumari 
> napísal(a):
>
>> for OCRing a scanned pdf,
>> first it is converted to image format then OCRed and gives a
>> temperory file of pdf/text format and overlays on original scanned pdf.
>> I want the output format to be hocr. for this, I ran the command
>> "convert scannedFile.pdf scannedFile.png" and then "tesseract
>> scannedFile.png scanned.pdf -l eng hocr"
>> I got the hocr fomat as output.
>> Now I need a help to overlay it on scannned pdf file.
>>
>> Anybody have any idea about it ?
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com
>>> 
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> 

Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Jeff Breidenbach
Tesseract produces searchable PDF directly.  If you really want to use HOCR
as an
intermediate format, you can but you will need external software. There are
a couple
of  "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable
job
tying things together. That said, going direct should give best results.



On Mon, Sep 17, 2018 at 10:08 AM Shree Devi Kumar 
wrote:

> I think pdf creation adds a text layer only and there isn't an option to
> add HOCR to it.
>
> @jbreiden can confirm.
>
> On Mon, Sep 17, 2018 at 6:10 PM, Monica  wrote:
>
>> I have tried this, but this is showing the default behaviour. I think the
>> default output is overlaying on pdf instead of hocr out.
>>
>>
>> On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:
>>
>>> Thanks Zdenko for you response.
>>> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
>>> pdf file ?
>>>
>>> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>>>
 Something like this?

 tesseract scannedFile.png scanned.pdf -l eng hocr pdf

 Zdenko


 po 17. 9. 2018 o 14:12 monica kumari 
 napísal(a):

> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pdf.
> I want the output format to be hocr. for this, I ran the command
> "convert scannedFile.pdf scannedFile.png" and then "tesseract
> scannedFile.png scanned.pdf -l eng hocr"
> I got the hocr fomat as output.
> Now I need a help to overlay it on scannned pdf file.
>
> Anybody have any idea about it ?
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHjiUbpuHSzzsC31fN6BqmzVPb6_TJxDmFiwBiTRPEM_wnTY2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Shree Devi Kumar
I think pdf creation adds a text layer only and there isn't an option to
add HOCR to it.

@jbreiden can confirm.

On Mon, Sep 17, 2018 at 6:10 PM, Monica  wrote:

> I have tried this, but this is showing the default behaviour. I think the
> default output is overlaying on pdf instead of hocr out.
>
>
> On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:
>
>> Thanks Zdenko for you response.
>> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
>> pdf file ?
>>
>> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>>
>>> Something like this?
>>>
>>> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>>>
>>> Zdenko
>>>
>>>
>>> po 17. 9. 2018 o 14:12 monica kumari 
>>> napísal(a):
>>>
 for OCRing a scanned pdf,
 first it is converted to image format then OCRed and gives a temperory
 file of pdf/text format and overlays on original scanned pdf.
 I want the output format to be hocr. for this, I ran the command
 "convert scannedFile.pdf scannedFile.png" and then "tesseract
 scannedFile.png scanned.pdf -l eng hocr"
 I got the hocr fomat as output.
 Now I need a help to overlay it on scannned pdf file.

 Anybody have any idea about it ?

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%
 40googlegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87
>>> XG8A%40mail.gmail.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%
> 3DRyychterctsVQ%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUTCr-OfCd0xAC_AoJAqk6J%2B0OaJ4mR4_nyoU34qLMAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
I have tried this, but this is showing the default behaviour. I think the
default output is overlaying on pdf instead of hocr out.


On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:

> Thanks Zdenko for you response.
> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
> pdf file ?
>
> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>
>> Something like this?
>>
>> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>>
>> Zdenko
>>
>>
>> po 17. 9. 2018 o 14:12 monica kumari 
>> napísal(a):
>>
>>> for OCRing a scanned pdf,
>>> first it is converted to image format then OCRed and gives a temperory
>>> file of pdf/text format and overlays on original scanned pdf.
>>> I want the output format to be hocr. for this, I ran the command
>>> "convert scannedFile.pdf scannedFile.png" and then "tesseract
>>> scannedFile.png scanned.pdf -l eng hocr"
>>> I got the hocr fomat as output.
>>> Now I need a help to overlay it on scannned pdf file.
>>>
>>> Anybody have any idea about it ?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Monica
Thanks Zdenko for you response.
will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on pdf
file ?

On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:

> Something like this?
>
> tesseract scannedFile.png scanned.pdf -l eng hocr pdf
>
> Zdenko
>
>
> po 17. 9. 2018 o 14:12 monica kumari 
> napísal(a):
>
>> for OCRing a scanned pdf,
>> first it is converted to image format then OCRed and gives a temperory
>> file of pdf/text format and overlays on original scanned pdf.
>> I want the output format to be hocr. for this, I ran the command
>> "convert scannedFile.pdf scannedFile.png" and then "tesseract
>> scannedFile.png scanned.pdf -l eng hocr"
>> I got the hocr fomat as output.
>> Now I need a help to overlay it on scannned pdf file.
>>
>> Anybody have any idea about it ?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRhJaHuY651CwQV-GvmzPyKyDO_gY2sGRE47LnUHAZ%3DQ7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Zdenko Podobny
Something like this?

tesseract scannedFile.png scanned.pdf -l eng hocr pdf

Zdenko


po 17. 9. 2018 o 14:12 monica kumari  napísal(a):

> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pdf.
> I want the output format to be hocr. for this, I ran the command
> "convert scannedFile.pdf scannedFile.png" and then "tesseract
> scannedFile.png scanned.pdf -l eng hocr"
> I got the hocr fomat as output.
> Now I need a help to overlay it on scannned pdf file.
>
> Anybody have any idea about it ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread monica kumari
for OCRing a scanned pdf, 
first it is converted to image format then OCRed and gives a temperory file 
of pdf/text format and overlays on original scanned pdf.
I want the output format to be hocr. for this, I ran the command 
"convert scannedFile.pdf scannedFile.png" and then "tesseract 
scannedFile.png scanned.pdf -l eng hocr"
I got the hocr fomat as output. 
Now I need a help to overlay it on scannned pdf file.

Anybody have any idea about it ?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.