Re: [tesseract-ocr] How to overlay hocr output on original scanned pdf.

2018-09-17 Thread Jeff Breidenbach
Tesseract produces searchable PDF directly.  If you really want to use HOCR
as an
intermediate format, you can but you will need external software. There are
a couple
of  "hocr2pdf" programs floating around and "OCRMyPDF" does an admirable
job
tying things together. That said, going direct should give best results.



On Mon, Sep 17, 2018 at 10:08 AM Shree Devi Kumar 
wrote:

> I think pdf creation adds a text layer only and there isn't an option to
> add HOCR to it.
>
> @jbreiden can confirm.
>
> On Mon, Sep 17, 2018 at 6:10 PM, Monica  wrote:
>
>> I have tried this, but this is showing the default behaviour. I think the
>> default output is overlaying on pdf instead of hocr out.
>>
>>
>> On Mon, Sep 17, 2018 at 5:47 PM Monica  wrote:
>>
>>> Thanks Zdenko for you response.
>>> will "tesseract scannedFile.png scanned.pdf -l eng hocr pdf" overlay on
>>> pdf file ?
>>>
>>> On Mon, Sep 17, 2018 at 5:44 PM Zdenko Podobny  wrote:
>>>
 Something like this?

 tesseract scannedFile.png scanned.pdf -l eng hocr pdf

 Zdenko


 po 17. 9. 2018 o 14:12 monica kumari 
 napísal(a):

> for OCRing a scanned pdf,
> first it is converted to image format then OCRed and gives a temperory
> file of pdf/text format and overlays on original scanned pdf.
> I want the output format to be hocr. for this, I ran the command
> "convert scannedFile.pdf scannedFile.png" and then "tesseract
> scannedFile.png scanned.pdf -l eng hocr"
> I got the hocr fomat as output.
> Now I need a help to overlay it on scannned pdf file.
>
> Anybody have any idea about it ?
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c5b4f9c7-67e5-41d8-8c24-b4e5e4c39ed3%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-ocr+unsubscr...@googlegroups.com.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xGPRx0ZriLS%2BH7kyNHEFaAFHweKJc5KhycfLKT87XG8A%40mail.gmail.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPgEwRjWnOe%3DXwxbZp_F9ZUFFPVDtDztcTiq%3DRyychterctsVQ%40mail.gmail.com
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHjiUbpuHSzzsC31fN6BqmzVPb6_TJxDmFiwBiTRPEM_wnTY2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: [tesseract-ocr/tesseract] Tag a new version for LSTM 4.0 (#995)

2017-08-27 Thread Jeff Breidenbach
Alexander Pozdnyakov has done a really good job packing Tesseract in his
Personal Package Archive (PPA). I think it is getting to be time for wider 
usage,
so I'm working with him to promote these to official packages. First step 
is 
Debian Experimental. That's a good place to work out problems, and hopefully
something can be ready for real users within a few weeks.

https://packages.qa.debian.org/t/tesseract.html

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bb3153c2-8d99-442c-a06f-4bd16e86339f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] pdf -> searchable PDF

2017-01-20 Thread Jeff Breidenbach
There is a lengthy side discussion that is appropriate to move
back here. I've been asked to elaborate what I mean by image 
extraction.

  https://github.com/tesseract-ocr/tesseract/issues/660

There are two ways to turn a PDF file into images. One is to
render it, for example using a tool like pdftoppm. This is great
if there are things like fonts involved.

But far better, for bag-of-images PDF files, such as produced
by certain scanning machines, is to crack open the bag and
take out the images. This guarantees no rescaling, no loss
of image information, and no (possibly space inefficient) format 
conversions.

Tools for image extraction are not super common, but it sounds
from the name like podofoimgextract does it. And for a fairly limited
set of formats, so does pdfimages from poppler-utils. The best case
scenario is image extract with no transcoding whatsoever. That's
not always possible (expecially when dealing with really fancy formats
like JBIG2) but it should be fine for PDF files produced by a scanner.
And also any PDF files produced by Tesseract.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/518e79db-ebf3-4572-89b4-4f10e109c857%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] makebox not working with --tessdata-dir argument

2016-08-13 Thread Jeff Breidenbach
I know from a separate email that you are using Debian GNU/Linux.
The default location on Debian is /usr/share/tesseract-ocr/tessdata
Therefore you need to either

1) do your work inside /usr/share/tesseract-ocr/tessdata, or
2) copy everything in 
/usr/share/tesseract-ocr/tessdata to /home/linux/tessdata.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6bf3a787-18c9-4a02-a6d4-a85464cfc7b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract For PHP error - Error in pixReadMemPng: tmpfile stream not opened

2016-07-07 Thread Jeff Breidenbach
Go ahead and take this question to the tesseract-ocr-for-php developers.
>From your error messages, you are running on a platform that
doesn't support fmemopen. If Windows, then there is trouble with
Leptonica's fallback function fopenWriteWinTempfile(). If Linux, then 
somehow PHP is restricting the call to tmpfile().

Anyway, start with the tesseract-ocr-for-php developers, and if they need 
a change to Leptonica have them get in direct contact with me or
the Leptonica author. This is probably not something you can solve for 
yourself as a user.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a6548fd6-49e6-4959-8e8b-ef437dfefe8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: PDF/A versions

2016-01-15 Thread Jeff Breidenbach
My understanding is PDF/A requires a bit more metadata, for example some 
color profile information (ICC) and a description about where the data came 
from (XMP). Tesseract doesn't supply that, sorry. I have no reason to 
believe implementation is hard, it's just not something I'm currently 
working on. Would be happy to accept a patch. The PDF creation code in 
Tesseract is under 1000 lines long and not scary. 

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/33832728-ba8c-4e9b-939a-bb756af36cbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: append output file?

2016-01-15 Thread Jeff Breidenbach
There's the normal Linux way for appending things:

tesseract image-1.png - >> results.txt
tesseract image-2.png - >> results.txt
tesseract image-3.png - >> results.txt
...

Or perhaps you are thinking about support for streaming:

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b0dd6e29-cb10-44e8-af45-61f5e06cae9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2016-01-15 Thread Jeff Breidenbach
Hi all, I just want to mention that the copy of tesstrain.sh that ships 
with Ubuntu is slightly modified to make life a little easier. The 
very terse documentation is in the standard location.

  /usr/share/doc/tesseract/README.debian

The modification saves some typing.  This is an example of 
training for Japanese.

  get clone https://github.com/tesseract-ocr/langdata.git 
  apt-get install fonts-noto-cjk fonts-japanese-mincho.ttf 
fonts-takao-gothic fonts-vlgothic
  tesstrain.sh --lang jpn --langdata_dir langdata

I apologize, but I don't have time to read all the questions on this
thread or provide support to people having trouble. Just wanted
folks (especially Nick White) to know that Ubuntu and similar
distributions have a few of the default parameters automatically
filled out for tesstrain.sh. We can do that because many of the directory 
locations are standardized. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/055a166b-795d-4402-8996-22c02182b14e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-04 Thread Jeff Breidenbach
But I would like to see an example PDF - one of the simpler ones - just to 
see how the vector graphics were done. Please do not get your hopes up.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/30327803-0413-4705-9f03-ebb77ab30ba3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

2015-09-04 Thread Jeff Breidenbach
This would be ridiculously hard to implement.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ec4b9fb-f960-4262-a7a7-79c593d3ec97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Successfully installed and run Tesseract on Ubuntu, but can't find baseapi.h file to include ...

2015-09-03 Thread Jeff Breidenbach
sudo apt-get install tesseract-dev

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a9abb1b1-1051-4e85-a4bf-fee6d09a3545%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] building tesseract on windows using cygwin

2015-07-21 Thread Jeff Breidenbach
Not to mention the data corruption problem on stdout. Maybe wait another 
week or two for anything else to come up, and then declare 3.04.01? 

(Just to be clear, it doesn't matter from Debian's perspective; the
stdout fix has already been patched there.)

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/83191b43-dc0b-4c05-9143-e6383ed10954%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Tesseract 3.04 Build Error

2015-07-18 Thread Jeff Breidenbach
Or bake some really delicious cookies for Tom Powers, who is in charge of 
Leptonica for Windows.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8c8f028e-5d2a-4e7b-91af-54be85ace567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Building tesseract without leptonica

2015-07-17 Thread Jeff Breidenbach
Forget it. Leptonica is a core requirement and provides the primary in 
memory image data structure, Pix.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f5a361b9-4abe-4e0f-a3af-8f6c59528d33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: persian in tesseract-ocr

2015-07-17 Thread Jeff Breidenbach
I think 'fas' is the language code for Persian.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f06412da-ce7c-4b00-8265-9519c9b61d5d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Why is Tesseract so much more popular than Ocropus?

2015-07-17 Thread Jeff Breidenbach
Tesseract is more complete in terms of 'throw me an arbitrary document 
image and produce something useful'

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9d1cb77d-37ae-4c78-99fb-27a2258a8d7a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-07-17 Thread Jeff Breidenbach
JBIG2 is a mutlipage image format, but is different from - for example - 
multipage tiff
because the images are not independently compressed. They share compression 
data, specifically a symbol dictionary.

There are three possible approaches here:

1. Have Tesseract accept JBIG2 images produced by jbig2enc and embed them
into PDF without modification,

2. Have Tesseract actually do JBIG2 compression.

3. Have Tesseract do image segmentation, compress some parts of the page
as JBIG2, other parts as JP2K, and store the results in PDF in a mixed 
raster
format.

I'm only going to discuss #1 because it is simplest and matches the current
'try to never transcode' philosophy. We'd need a JBIG2 decoder in Leptonica.
That's probably straightforward but still a very solid chunk of work. 

Then, there is what to do in Tesseract. The PDF rendering module would need 
to learn
about the symbol dictionary (or dictionaries) and add it to collection of 
PDF objects.
It will need an understanding of what's going on much better than what we 
currently
use, which is simply 'Hey, what image file belongs to this page? Let's try 
to inline it
unchanged,'

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L811

Now the good news is the PDF rendering module is really small and is not 
cemented
down by a whole bunch of unnecessary abstraction layers. And I know it's 
possible
because I've personally done it with colleagues elsewhere.

But it is a pretty significant effort, and I'm honestly not sure it's worth 
putting inside
Tesseract. Maybe a better approach is post processing, with a PDF to PDF 
converter
that uses approach #3. This is the winning strategy  for Linearization, 
which can be
done on a Tesseract produced PDF using QPDF. 




-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Text output vs. PDF

2015-06-29 Thread Jeff Breidenbach
Unfortunately, I think there is nothing we can do. I've done everything I 
can to 
maximize compatibility with various PDF rendering engines, but Preview uses 
particularly terrible text extraction heuristics. To be fair, the root 
problem is
the design and complexity of the PDF specification itself.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/262a0e22-eddf-4b10-bd17-7e7f5f17cac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract 3.04 Build Error

2015-06-29 Thread Jeff Breidenbach
You need version 1.71 or later. Current leptonica release is 1.72.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/01baa2a9-c224-4c2b-8cce-e482297c89a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: jbig2 encoding in PDF output file

2015-06-29 Thread Jeff Breidenbach
Not available currently, and pretty major effort required to make it happen,
both in Leptonica and Tesseract's PDF output module. No plans to work
on this. For other formats we try hard to not re-encode during PDF 
generation
whenever practical.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cc76280f-98d0-4330-a202-9613fc0a289e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: compile error under ubuntu 14.04

2014-09-09 Thread Jeff Breidenbach
This error comes from Leptonica 1.70. Tesseract now requires Leptonica 1.71.
Leptonica 1.71 can be installed manually (but not so easily) and will ship 
with 
Ubuntu for their 14.10 release scheduled for October 23 of this year.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e38bd7ae-4705-48dc-877d-1932457160d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: [tesseract-dev] Re: Training tools linking failure, icu_48::*

2014-08-01 Thread Jeff Breidenbach
Done. Bonus points if someone can remember to remove
the instructions when they become obsolete in October.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHjiUbpRM-QP9vUw8LtJdvcjfB%3Dq2W1bDf5n3VXJtL_b%2BXJjzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I've merged Nick White's bugfix into hocr-tools. Thank you, Nick.
I expect most people will instead use the native PDF support 
built into Tesseract henceforth, and I intend to focus most of my
time and energy there. 

However, there is still some use for hocr-pdf, especially when 
working with slow digitization equipment like a Linear Book 
Scanner. Generating a separate HOCR files per image (then 
assembling them into a PDF at the end) means you don't have 
to wait for scanning to complete before beginning OCR. Leading 
to faster overall results.

Cheers,
Jeff

http://www.youtube.com/watch?v=4JuoOaL11bw

-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
As for Arabic and other right-to-left scripts, please try using the new
native PDF capability in Tesseract instead. It is significantly more 
sophisticated and I think it should work correctly.

-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: hocr2pdf and arabic language

2014-02-06 Thread Jeff Breidenbach
I don't know, it is up to Ray. My guess is quite soon. In any case, 
I just ran on your example images, noticed a small problem, and 
fixed it. Thank you for providing them.

I should also mention that there is no need to convert your binary 
images to JPEG when using Tesseract's native PDF capability.
This will help improve both the image quality and the filesize of the
resulting PDF.

-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: hocr2pdf and arabic language

2014-01-27 Thread Jeff Breidenbach
I am the author of the hocr2pdf utility. Thank you for the patch,
I'll merge it some time next week. This week my focus is fixing
some problem reports with the new native PDF output capability
for Tesseract.

Jeff

-- 
-- 
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.