Re: [tesseract-ocr] What is the state of the C and Python APIs?

2018-08-07 Thread Nick White
Hi Luke, On Mon, Aug 06, 2018 at 02:12:38PM -0700, Luke Brandl wrote: > I've been working to understand Tesseract and looking through the C and Python > API code and documentation. It looks like some of the code and documentation > are up to date, while the rest refers to 3.0.2 at least in the com

Re: [tesseract-ocr] Modyfying existing traineddata

2016-02-23 Thread Nick White
Hi Devon, On Mon, Feb 22, 2016 at 10:43:33AM -0800, Devon Yoo wrote: > I have test set that only has "uppercase English alphabets" and "numbers". But > the provided eng.traineddata returns symbols and lower case alphabets > sometimes. Is there a way to modify the existing traineddata file so that

Re: [tesseract-ocr] Using plain makefiles for fun and profit (was: Run Tesseract on linux without shared libraries)

2016-01-21 Thread Nick White
So this email prompted me to try something a little crazy, but it worked; I just built a statically linked tesseract binary :) A long time ago I wrote some plain makefiles which didn't rely on any automake / cmake stuff. The main devs weren't interested, understandably, but it was useful and fu

Re: [tesseract-ocr] Run Tesseract on linux without shared libraries

2016-01-21 Thread Nick White
Hi Łukasz, > Is it possible to run tesseract without setting up > LD_LIBRARY_PATH? Why don't you want to just use LD_LIBRARY_PATH? I suspect, to be honest, that it would be difficult to compile the leptonica library into the tesseract executable. It would be fun and interesting (to me) to tr

Re: [tesseract-ocr] Tesseract for Tibetan

2015-11-25 Thread Nick White
Hi Yizhen, On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote: > I am working on a volunteer project to digitize the Sutra and all related > materials, most of them in Tibetan. Sounds like a great project :) > Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new

{Spam?} Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-16 Thread Nick White
On Sun, Nov 15, 2015 at 07:00:52PM +0100, Marco Atzeri wrote: > On cygwin I already packaged the training utilities for 3.04.00. > and some training data. Ah cool, thanks Marco, sorry, I haven't kept up with everything here and had missed your messages earlier. I'm glad you've packaged it all up

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-15 Thread Nick White
On Sun, Nov 15, 2015 at 09:16:29PM +0530, Sriranga(83yrsold) wrote: > Dear nick, > kindly clarify whether "make" file will work on windows "vista" since binaries > for windows are not available for download at present? If so how to do? No, it won't work on Windows, and I have no plans to make it d

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-10 Thread Nick White
On Tue, Nov 10, 2015 at 08:59:19AM -0800, Ryan Baumann wrote: > Thanks for this, Nick. I'm just getting around to looking into moving my Latin > training into the tesstrain.sh system and this is very helpful. Great, I was planning to do that myself with your Latin training - let me know if you ne

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-09 Thread Nick White
tar.gz[x_8px] > ​Dear Nick, > Awaiting your valuable guidance.Kindly treat my request as SOS due to my > overaged factor of 83+yrs old. I want to enjoy the program. > With warmest reagards, > sriranga(83yrs) > > On Tue, Nov 3, 2015 at 12:19 AM, Nick White wrote: > >

Re: [tesseract-ocr] how to use tesstrain .sh etc in ubuntu 15.10

2015-11-02 Thread Nick White
Hi Sriranga, > I find there three files of '.sh - viz. > 1) language-specific.sh. (My lang is "kan") > 2)tesstrain.sh > 3)tesstrain_utils.sh. > Request for the valuable guidance how to use above .sh files ( step by step I plan to write up some proper documentation on how to use these scripts

Re: [tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White
Just a note, all the .git URLs listed below are git repositories, and there isn't a web interface to them on my server, so just clone them directly like this: git clone http://ancientgreekocr.org/mignetools.git Nick On Thu, Oct 29, 2015 at 06:23:21PM +0000, Nick White wrote: >

[tesseract-ocr] Ground truth files

2015-10-29 Thread Nick White
Hi all, I recently finally got around to organising and releasing some (well, a lot of) ground truth files for the language I have been training for ages now, Ancient Greek. By "ground truth" I mean real page scans with the corresponding (hand-typed) correct text, which is essential to be able

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-23 Thread Nick White
Hi Alfred, On Fri, Oct 23, 2015 at 01:11:55AM -0700, Alfred Puca wrote: > I sent an attachment with the error using program from command line with psm > option-4 Thanks for that. The first thing I notice is that you're using an old version of tesseract (3.02). Can you update to the latest vers

Re: [tesseract-ocr] Tesseract option 4 bug

2015-10-22 Thread Nick White
Hi Alfred, On Wed, Oct 21, 2015 at 01:16:22AM -0700, Alfred Puca wrote: > I'm having problems with psm option 4 (Assume a single column of text of > variable sizes). > It seems as a bug in the application. > How is it possible to use this option? What problems are you having? Can you give an ex

Re: [tesseract-ocr] How to install on Shared Hosting

2015-10-22 Thread Nick White
Hi Avinash, On Wed, Oct 21, 2015 at 01:40:35PM -0700, Avinash Mishra wrote: > I dont have VPS can anybody tell me how to install it on shared hosting The instructions for installing without root should be what you need: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#install-elsewhere-

Re: [tesseract-ocr] Re: Why would this be? -> When I reinitialize tesseract for every call in a loop it consistently runs faster by a something like .1 second per loop iteration

2015-09-17 Thread Nick White
On Fri, Sep 11, 2015 at 12:13:02AM -0700, fsbo.cons...@gmail.com wrote: > To anyone else who may run across this, it is because of the way C++ uses > scope > to optimize the code when it compiles. Things that are within the scope of the > for loop will run faster than things that have larger scope

Re: [tesseract-ocr] Tesseract 3.04 error.

2015-09-17 Thread Nick White
On Wed, Sep 16, 2015 at 10:16:40PM +0530, ShreeDevi Kumar wrote: > If you are having trouble using it with Java, Quan maybe able to suggest a > solution. I agree, this sounds more like a Java issue to me. I don't know Java at all, but if it's treating anything that sends output to stderr as fail

Re: [tesseract-ocr] Re: Easiest way to run Tesseract from a Mac

2015-08-21 Thread Nick White
On Fri, Aug 21, 2015 at 02:13:17PM +0100, Allistair wrote: > This, I think, just illustrates there is no one-size-fits-all approach. All > methods should be enumerated for installing Tesseract for Mac. I disagree. Mac OS X is a homogenous enough system that we ought to be able to do it right, onc

Re: [tesseract-ocr] Easiest way to run Tesseract from a Mac

2015-08-20 Thread Nick White
On Thu, Aug 20, 2015 at 03:46:32PM +0100, Allistair wrote: > I had issues installed with Homebrew - it didn't install the dependencies very > well like Leptonica etc. but could just have been an issue I was having. > Conversely MacPorts worked out of the box. Interesting. Do you remember what exac

[tesseract-ocr] Easiest way to run Tesseract from a Mac

2015-08-20 Thread Nick White
Hi all, I was looking at the Tesseract wiki, and it states that "The easiest way to install Tesseract is with MacPorts." I don't think that is true any more. MacPorts requires XCode to be manually installed before it can be installed, which doesn't look like it's very simple for a non-expert.

Re: [tesseract-ocr] displayed version number of tesseract when compiled from git

2015-07-23 Thread Nick White
Hi Simon, Yes, ideally it would be good if git compiled versions printed their commit id when asked for 'tesseract -v'. I'm sure nobody would object to a patch making that so ;-) Nick On Thu, Jul 23, 2015 at 06:33:41PM +0200, Simon Eigeldinger wrote: > Hi all, > > at the moment when you compi

[tesseract-ocr] Small update on the tools I wrote

2015-04-30 Thread Nick White
Hi all, long time since I last posted here. This is just a little update about some training related tools I wrote a while ago, the 'tesstrainingtools' collection. It has largely been superceded by the training stuff that's included in Tesseract now, but maybe someone will still find it useful.

Re: [tesseract-ocr] Re: Tesseract for recognition the international phonetic transcription

2015-04-30 Thread Nick White
gt; > Any opinions on whether it's worth training for the phonetic alphabet or is it > going to just be too difficult to recognize even with specific training? > > Tom > > On Wednesday, January 22, 2014 at 11:55:28 AM UTC-5, Nick White wrote: > > Hi Ep

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-25 Thread Nick White
On Fri, Aug 22, 2014 at 12:42:21PM -0700, Thomas Bruno wrote: > Is this common when training from text2image output? > > > APPLY_BOXES: boxfile line 5364/748 ((1488,893),(1532,6)): FAILURE! Couldn't > find a matching blob > > FAIL! Yes, there will be some of these. Check the proportion of faili

Re: [tesseract-ocr] Re: Tesseract compilation on code blocks (gcc + mingw)

2014-08-21 Thread Nick White
On Thu, Aug 21, 2014 at 11:29:09AM -0700, shree wrote: > zdenko, > > the current problem also seems related to strtok_r > > please see > > http://stackoverflow.com/questions/12973750/ > fatal-error-strtok-r-h-no-such-file-or-directory-while-compiling-tesseract-oc > > http://sourceforge.net/p/mi

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White
; > same problem happen. > > > > > On Thu, Aug 21, 2014 at 4:03 PM, Nick White wrote: > > Hi Dovhani, > > Does this happen with all images when using your training, or just > one? > > Nick > > On Thu, Aug 21, 2014 at 03:03:47AM -07

Re: [tesseract-ocr] tesseract trained successfully but gives:Tesseract Open Source OCR Engine v3.03 with Leptonica Segmentation fault (core dumped)

2014-08-21 Thread Nick White
Hi Dovhani, Does this happen with all images when using your training, or just one? Nick On Thu, Aug 21, 2014 at 03:03:47AM -0700, Dovhani Foneworx wrote: > Hi guys, I have a problem, I have succesfully trained tesseract 3.03 in Ubunt > 14.04 but when i run tesseract it is giving errors on an i

Re: [tesseract-ocr] Makefile:372: recipe for target 'all' failed - using current version with leptonica 1.71 on cygwin

2014-08-21 Thread Nick White
On Thu, Aug 21, 2014 at 01:41:23PM +0530, Shree Devi Kumar wrote: > Hi Zdenko, > > ./ confusing for me :-) :-) ./ is a common idiom for unix. '.' means 'current directory', so ./ means 'in the current directory'. You have to do it to run programs in the current directory (or just do something

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-21 Thread Nick White
On Wed, Aug 20, 2014 at 07:39:50PM -0700, SHEN Fei wrote: > hi Nick, > > I'm trying to use tesseract in my mobile phone so the tessdata size is > critical. > Since I only care about very few fonts, it would be convenient if I could add/ > remove a special font. > > Maybe removing some dictionary

Re: [tesseract-ocr] Best image pre-processing software

2014-08-20 Thread Nick White
Hi Chris, On Wed, Aug 20, 2014 at 11:12:50AM -0700, Chris Smeal wrote: > I've been doing some research on using Tesseract for both document scans and > text in scenery, and I was wondering what image processors are best? Given I > have a lot of images, I cannot process each batch by hand, so I wi

Re: [tesseract-ocr] Losing accuracy when training tessearct on fonts it already is trained on

2014-08-20 Thread Nick White
Hi Thomas, On Mon, Aug 18, 2014 at 02:17:19PM -0700, Thomas Bruno wrote: > Where can I find the box/tif combo for the eng.traineddata that Tessearct 3.02 > provides for download? The tif/box files used to create the eng.traineddata for 3.02 are not available, and are very unlikely to be made so

Re: [tesseract-ocr] set_unicharset_properties

2014-08-20 Thread Nick White
Hi Dovhani, On Tue, Aug 19, 2014 at 04:06:26AM -0700, Dovhani Foneworx wrote: > Hi I have a problem that when I run: > > set_unicharset_properties -U input_unicharset -O output_unicharset > --script_dir > =/home/foneworx/DM/Tesseracting/tesseract-3.03/training/langdata > > > > I get the follo

Re: [tesseract-ocr] Can I remove some fonts from an existing traineddata?

2014-08-20 Thread Nick White
Hi Shen, On Wed, Aug 20, 2014 at 01:10:30AM -0700, SHEN Fei wrote: > Can I remove some fonts from an existing traineddata file? > For example, if I only need 2 or 3 comon fonts of default eng.traineddata, is > there a way to extract them out of the original file? No, I'm afraid not, not at the m

Re: [tesseract-ocr] Re: How to disable image pre-processing?

2014-08-13 Thread Nick White
On Wed, Aug 13, 2014 at 08:39:06AM -0700, Oliver Nicolini wrote: > A little up, I can't find any doc for this topic. If anyone can help that > would > be fantastic. Did you read Paul's reply? Tesseract only does binarisation. If you don't want it to do that, binarise your image before passing it

Re: [tesseract-ocr] Passing RegEx to Zone Scans

2014-08-12 Thread Nick White
Hi David, You're right, that would be useful. Tesseract has a basic version of that, called "patterns"; search the manpage for a bit of information on them. However at present they can't be assigned per region, only as possible patterns for the whole OCR job. Also they aren't restrictive, but

Re: [tesseract-ocr] Re: Trying to understand custom dictionaries

2014-08-12 Thread Nick White
On Thu, Jul 24, 2014 at 05:53:56AM -0700, Victoria A. wrote: > From my experience, seeing that Tesseract's English training data can > recognize > words that are NOT contained in the dictionary, I suppose Tesseract only uses > the custom dictionary for "hints" instead of only knowing the words in

Re: [tesseract-ocr] Outreach from the Wikisource community

2014-08-12 Thread Nick White
Dear Wikisourcerers, It's good to hear from you. Wikisource is awesome, as far as I am concerned. > One > of the most serious issues was raised by the Belarusian community which uses 2 > different scripts with no commercial OCR support. This means that the > volunteers have to type each word man

Re: [tesseract-ocr] Building training tools from source

2014-08-12 Thread Nick White
Zdenko's right. To help you out more, you seem to have skipped over this part of the instructions in the Compiling wiki page: sudo apt-get install libicu-dev # (if you plan to make the training tools) Nick On Mon, Aug 11, 2014 at 07:10:47PM +0200, zdenko podobny wrote: > It looks like you

Re: [tesseract-ocr] Error when running "make" - scanutils.cpp:38:14: error: typedef redefinition with different types ('long' vs '__darwin_off_t' (aka 'long long'))

2014-08-12 Thread Nick White
On Tue, Aug 12, 2014 at 12:58:23PM +0530, Shree Devi Kumar wrote: > On Tue, Aug 12, 2014 at 4:31 AM, testing1234 > wrote: > > Note.. Step 5 above the last command should be > > "sudo make install-langs" > > Nick, it maybe helpful to add/update instructions in wiki. Cory just meant in

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White
On Wed, Aug 06, 2014 at 08:50:27PM +0530, Shree Devi Kumar wrote: > My current plan for documentation is as follows: > > - Rewrite and simplify TrainingTesseract3 on the wiki > - Write manpages for each tool in training/ > - Document how each training file is used,

Re: [tesseract-ocr] Re: Get Tesseract ocr to ignore or replace images with whitespace

2014-08-06 Thread Nick White
Hi Richard, On Sun, Jul 20, 2014 at 01:51:32PM -0700, Richard Arnold wrote: > Stroke Width Transform looks very interesting. However, I have some questions > regarding its use in what I'm doing. > I'm writing a Desktop application and OpenOCR appears to use a web service > call?? Stroke Width Tr

Re: [tesseract-ocr] Failed to get the text

2014-08-06 Thread Nick White
Hi Fajar, Looks like you should try binarising the image yourself prior to handing it over to Tesseract. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-08-06 Thread Nick White
Hi Albrecht, Sorry for not replying sooner, I've been away. > Nevertheless I read a post from Ray where he says that he receives > millions of > emails and the last thing he likes to do is writing long texts (email > responses > or documentations). I think this is a fatal situation, because if

Re: [tesseract-ocr] OCR using C

2014-08-06 Thread Nick White
Hi Rara, On Thu, Jul 31, 2014 at 08:29:51AM -0700, Rara wrote: > I'm searching of a detailed guide for developpement with Tesseract and a tuto > explained how to use and test this platform with windows OS. > Looking forward to your answer ! There is an example program using the C API here: http

Re: [tesseract-ocr] Not getting accuracy with Arabic font

2014-08-06 Thread Nick White
Hi Prashant, On Wed, Aug 06, 2014 at 01:32:54AM -0700, Prashant Mahskey wrote: >I am using tesseract for my android app with arabic language. I've > copied all the files required from the language files download page. I've > tried > with gray scaling and cropping extra blank part from th

Re: [tesseract-ocr] I compiled and installed tesseract from the source on CentOS. I kept both 3.01 and 3.02 versions. I use environment path stored in bash file to point to the version in use.

2014-08-06 Thread Nick White
On Tue, Jul 22, 2014 at 11:48:21PM +0200, zdenko podobny wrote: > If you want to have several version of tesseract (e.g. you want to compare OCR > result) I would suggest you to compile them from source (e.g. in /usr/src) and > not installed them. If you want to test particular version you can run

Re: [tesseract-ocr] JTessbox Modifying the boxes

2014-07-17 Thread Nick White
On Thu, Jul 17, 2014 at 12:14:43AM -0700, Jing JC wrote: > The Ray's tutorial said the bounding box overlaps. > so when I modify the box inside JTessbox, > do I keep the overlapping boxes, > or > make the boxes non touching. That's interesting, actually; I didn't realise Tesseract did outlin

Re: [tesseract-ocr] what does "width= right -left => no silly +1/-1" mean in this tutorial?

2014-07-17 Thread Nick White
On Wed, Jul 16, 2014 at 11:17:00PM -0700, Jing JC wrote: > I am going through Ray Smith's tutorial, and don't get it? He means that as the co-ordinate system uses bottom left as the origin, you will never get a minus number co-ordinate (as you could if the origin was elsewhere). -- You receiv

Re: [tesseract-ocr] Re: Regarding Tesseract OCR engine for recognizing Tamil Fonts

2014-07-15 Thread Nick White
On Mon, Jul 14, 2014 at 11:36:46AM -0700, Paul wrote: > Am Montag, 14. Juli 2014 10:07:59 UTC+2 schrieb sibi kanagaraj: > But , I feel that Tamil Training is not sufficient and it > could be > streamlined . Hence I went to see if there are sufficient training > documents for Tamil

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White
Hi, On Tue, Jul 15, 2014 at 10:04:24AM -0700, Jing JC wrote: > yep yep. > > Thanks a lot Nick. > > I tried to cancel mu post last night. > but seems I can not get access to it after posted but before approved. > > I tried to match the V2's example to V3's format. > > I figured it out late

Re: [tesseract-ocr] How to find the font properties

2014-07-15 Thread Nick White
Hi Mustak, On Tue, Jul 15, 2014 at 03:14:35AM -0700, Mustak M wrote: > I am new to tesseract. I am using tesseract 3.2. I am able to retrieve the > text > from an image. And able to get the co-ordinates for each word with "tesseract > source.jpg output hocr" command. Is there any command to reti

Re: [tesseract-ocr] How to replace libtiff, libpng, libjpeg, etc with GDI+ on Windows

2014-07-15 Thread Nick White
On Mon, Jul 14, 2014 at 03:13:43PM -0700, Albrecht Hilker wrote: > I know that this can be done with a few lines of code, but such a usefull > class is missing in the OpenCV project. My first trials showed that this is > not as trivial as it seems on the first look because a lot of conversio

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
Sorry for the noise. I've looked into this more, and discovered more :) On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote: > On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: > > When I download the traineddata files and extract the unicharset file from

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
Hi again, On Mon, Jul 14, 2014 at 09:38:26AM -0700, Albrecht Hilker wrote: > After some days I came back here and I'm very surprised about your lots of > posts. > Thanks for answering and taking the time. As you may have noticed, there aren't too many people around here who are comfortable look

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-15 Thread Nick White
Hi Albrecht, On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: > When I download the traineddata files and extract the unicharset file from > them > I notice that some are extremely different from the ones on SVN in the folder > training/langdata. > > For example: > Bengali, Hebr

Re: [tesseract-ocr] questions when reading unicharset manual: https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html

2014-07-15 Thread Nick White
Hi, The part you aren't reading closely enough from the manual page is: properties An integer mask of character properties, one per bit. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation. So ; has ispunctuation set, but none of the others,

Re: [tesseract-ocr] Re: is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-14 Thread Nick White
On Mon, Jul 14, 2014 at 07:38:19AM -0700, Christopher Smeenk wrote: > I found the source for v3.03 here: http://packages.ubuntu.com/trusty/ > tesseract-ocr The version called "3.03" in Ubuntu is an -rc - there is no official 3.03 release yet. As I understand it Ray & Jeff called it 3.03 so that

Re: [tesseract-ocr] is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-13 Thread Nick White
On Sun, Jul 13, 2014 at 06:38:11PM +0430, universal reseller wrote: > is google drive use tesseract 3.03 ? It's -rc1, meaning release candidate 1. So it isn't an official release, but rather a "testing preview" release, which should be to what the final 3.03 will be. > i checked one english pd

Re: [tesseract-ocr] builing the svn source code in windows is too difficult.

2014-07-13 Thread Nick White
> I build the tesseract svn source code in win8, I used the > VS2013/Cygwin/MinGW to build this, all failed. Hi, you need to give us more clues as to why it failed. What error messages did you get? > what version of leptonica the newest svn use? 1.70 or 1.71? Tesseract should work fine with e

Re: [tesseract-ocr] is tesseract 3.03's source tar available? need to compile on CentOS 5.6

2014-07-13 Thread Nick White
On Fri, Jul 11, 2014 at 04:22:41PM -0700, Jing JC wrote: > google's tesseract download page listed up 3.02 only. > > I need to compile tesseract on CentOs5.6 > where is the download link for tesseract 3.03 > > or not available yet. It isn't available yet. There is a -rc1 version that is availa

Re: [tesseract-ocr] Re: need help removing garbage characters from my OCR

2014-07-12 Thread Nick White
On Fri, Jul 11, 2014 at 03:06:29PM -0700, Alex Ryan wrote: > I wrote some simple code to preprocess the image because I realized I will be > doing basically the same image every time so its foolish to try and use > Tesseracts binaziration technique which was designed for a different and more > gen

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
0 @ p a r m F u s B » f d c h C t L ? T M y R l ~ < ® N b k [ « 1 , . ” g H $ ( + D w V £ 4 9 Q & A P ¢ ] 3 2 © 8 / > X é j ; 7 € O ¥ U x } E § = ! ’ G ) Z q { “ — Y K * W " \ ° fi ‘ _ fl /* * Copyright 2014 Nick White * * Licensed under the Apache License, Version 2.0 (the

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
On Tue, Jul 08, 2014 at 10:36:50PM -0700, Alex Ryan wrote: > In one of the links tho I saw something about -psm setting. When I run the OCR > with -psm 6 all of a sudden it worked perfect!!! Im really not sure what that > setting does, ive tried doing some searches, but im still unclear. Can you

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
I have more thoughts to the unicharset metrics discussion. > So this example says that > the character "1" has a min_bottom value of 59 and > the character "9" has a min_bottom value of 18. > > Weird ? ? ? > Both numbers are aligned to the baseline! I am guessing now (I'll take a look at the cod

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
On Sat, Jul 05, 2014 at 03:34:05PM -0700, Albrecht Hilker wrote: > Hello zdenop > > It is clear that you are not the right person to answer this question. > If YOU would ever have looked into the source code you have seen that these > values ARE in use (in version 3.03). You're being pretty unfai

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

2014-07-10 Thread Nick White
I'm just going to go through your numbered points here. On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote: > 1.) > The column "other_case" should contain the ID of the other-case letter. > For the lowercase letters they point correctly to the uppercase letters. > But the uppercase le

Re: [tesseract-ocr] Any way to prevent contextual digits<->letters flipping ?

2014-07-10 Thread Nick White
Hi, I haven't tried it, but quickly grepping around the source code suggests setting the config variable "crunch_include_numerals" to true might do the job. Please let us know if that works. Nick On Wed, Jul 09, 2014 at 11:15:10PM -0700, Damien D wrote: > Hi everyone, > > tesseract seems to

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-10 Thread Nick White
Hi Alex, One quick thought, if you're still using .uzn, it's only loaded with certain psm levels (it is with -psm 6, but not -psm 3, the default). And it's loaded from .uzn. So if you have any .uzn files lying around, they will be being applied with psm 6, but not if you don't explicitly stat

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

2014-07-09 Thread Nick White
Hi Albrecht, On Thu, Jul 03, 2014 at 09:40:51PM -0700, Albrecht Hilker wrote: > Generally it is very sad that there is no detailed documentation about > Tesseract. I agree. I do work on the documentation, but there is an awful lot missing. I appreciate you taking the time to ask questions here

Re: [tesseract-ocr] Re: Font Limit = 64 fonts in traineddata, really ??

2014-07-09 Thread Nick White
On Tue, Jul 08, 2014 at 10:49:49PM -0700, shree wrote: > My information IS dated - I haven't followed the recent changes. Please see > this thread - almost a year old which talked of the upcoming changes for > training > > https://groups.google.com/forum/#!searchin/tesseract-dev/fonts/tesse

Re: [tesseract-ocr] Is it wise to interfere with the pre-processing pipeline of Leptonica

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 09:50:01AM -0700, Rani Yaroshinski wrote: > In order to improve the accuracy of the OCR results ? Yes, it is, if you know more details about the images you'll be using, so can do better than Tesseract's guesses. See https://code.google.com/p/tesseract-ocr/wiki/ImproveQual

Re: [tesseract-ocr] Is there any influence of the input format of the image PNG vs TIFF

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 09:48:20AM -0700, Rani Yaroshinski wrote: > From the point of view of the performance measures of the OCR ? I don't think anybody has figures on this. You could do some tests yourself, and let us know the results. I would guess that file size would be a bigger slowdown th

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-09 Thread Nick White
On Wed, Jul 09, 2014 at 03:16:08AM -0700, Paul wrote: > How about using ImageJ (can be automated with macros) to create a better > binary > result of the image. Thanks for mentioning this; I hadn't heard of it and it sounds very useful. I added a link to the ImproveQuality wiki page. Nick --

Re: [tesseract-ocr] need help removing garbage characters from my OCR

2014-07-08 Thread Nick White
Hi Alex, If you're up for some programming, you could recognise the squares yourself, and pass each one separately to tesseract with the PSM_SINGLE_CHAR segmentation type. That should help if Tesseract is not segmenting each whole square separately. If the board is always the same size, you co

Re: [tesseract-ocr] New language traineddata based on the existing one.

2014-07-04 Thread Nick White
On Fri, Jul 04, 2014 at 02:15:52AM -0700, Iskander Sharipov wrote: > I need to create new tessdata language, which is very similar to russian in > charset. > Every time I try to do so by training tesseract on a box containing needed > letters I get new traineddata, > which actually can recognize ne

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-04 Thread Nick White
On Fri, Jul 04, 2014 at 02:08:46AM -0700, Meenal Goyal wrote: > If you're sure that all the words you will encounter will be in the > dictionary this should help somewhat: > https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_ > increase_the_trust_in/strength_of_the_dictionary?

Re: [tesseract-ocr] Terrible results from Tesseract API

2014-07-03 Thread Nick White
Hi Elena, Just a guess, but maybe this line: > api -> SetSourceResolution(600); is the source of your troubles? Tesseract from the command line would have just been guessing it, and perhaps its guess, coupled with its ideas about different sizes of fonts, were better than yours? Nick

Re: [tesseract-ocr] How to download the Tesseract trained data for Digital display numbers ( Seven Segments Data trained data )

2014-07-03 Thread Nick White
Hi Artur, On Wed, Jul 02, 2014 at 10:18:55PM -0300, Artur Augusto wrote: > As many people ask about how to use tesseract to read 7 segments display, I > decided to publish an open source sample project. > > If someone wanna check it: https://github.com/arturahttps://github.com/ > arturaugusto/di

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-03 Thread Nick White
On Wed, Jul 02, 2014 at 10:26:16PM -0700, Meenal Goyal wrote: > The post about "question about training tesseract" only suggests some > pre-processing steps which include binarisation and I have already tried > them. > I wanted to know if anything can be done to improve output at later stage, >

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-02 Thread Nick White
n such > cases? Any feedback mechanism which can help improve? > > On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote: > > Hi Meenal, > > On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: > > When I try to ocr an image, it also

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-07-01 Thread Nick White
Hi Meena, On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: > When I try to ocr an image, it also produces some noise apart from the > meaningful words. An example output for an image is: > > All women become > > like their’ mqthers. _ ' 1"’ ' > > - —T at-{rs their tragedy. ” "R"-‘

Re: [tesseract-ocr] How to use the API in linux system

2014-07-01 Thread Nick White
Hi, On Mon, Jun 30, 2014 at 09:25:23PM -0700, 韩煦深 wrote: > I'm a Chinese student and I want to use the tesseract-ocr in our linux system. > I have Ubuntu OS and I install tesseract in my ubuntu system. > But I don't know how to use C++ API in linux system because all the examples > are based on V

Re: [tesseract-ocr] Tesseract-OCR

2014-07-01 Thread Nick White
On Mon, Jun 30, 2014 at 10:42:41PM -0700, nirali kanani wrote: > is there Tesseract - ocr v 3.03 exe available anywhere ? Tesseract v3.03 hasn't been released yet (except as a pre-release version in the latest ubuntu). The code is unlikely to change a lot from what's currently in SVN, so you co

Re: [tesseract-ocr] Advice needed on effective hexadecimal recognition

2014-06-30 Thread Nick White
Hi Scott, On Fri, Jun 27, 2014 at 09:39:21PM -0700, scott.ha...@gmail.com wrote: > Hi all. Firstly let me say I am totally blown away by Tesseract, it vastly > exceeded my expectations for an open source OCR project. I have an > application > (http://hackaday.io/project/1569-NSA-Away) that invo

Re: [tesseract-ocr] retrieve words not matching the dictionary

2014-06-30 Thread Nick White
Hi Meenal, On Mon, Jun 30, 2014 at 01:40:10AM -0700, Meenal Goyal wrote: > When i run tesseract on my image, it produces some words not present in the > dictionary. Is there some way to directly get the list of these words and > prevent tesseract from showing them in the output. > Example of such

Re: [tesseract-ocr] Can tesseract read cursive handwriting?

2014-06-27 Thread Nick White
On Fri, Jun 27, 2014 at 04:57:30PM -0400, Nick White wrote: > On Mon, Jun 23, 2014 at 10:11:28AM -0700, Paulo Basilio wrote: > > Good day, I am trying to develop a mobile app that can read cursive > > handwriting > > (doctor's handwriting to be exact). My question

Re: [tesseract-ocr] 'BLOCK_LINE_IT' was not declared in this scope

2014-06-27 Thread Nick White
Hi Raghavan, On Tue, Jun 24, 2014 at 06:58:56AM -0700, Raghavan P wrote: > When i try to make use of tesseract classes like BLOCK_IT and BLOCK_LINE_IT, I > am getting the error "it was not declared in this scope". > May i know what header should i bring in or what am i missing here? Are you using

Re: [tesseract-ocr] Can tesseract read cursive handwriting?

2014-06-27 Thread Nick White
Hi Paulo, On Mon, Jun 23, 2014 at 10:11:28AM -0700, Paulo Basilio wrote: > Good day, I am trying to develop a mobile app that can read cursive > handwriting > (doctor's handwriting to be exact). My question is, can tesseract-ocr read > cursive handwriting? If not, can someone give me suggestion f

Re: [tesseract-ocr] Support for Sinhala

2014-06-27 Thread Nick White
Hi Sheeyam, sorry for not replying to your emails sooner. On Sun, Jun 22, 2014 at 04:43:27AM -0700, sheeyam shellvacumar wrote: > Does Tesseract support sinhala. How do u guys train them ??? Actually i am > confused help me It looks like some people have trained Tesseract for Sinhala; see http:

Re: [tesseract-ocr] question on training tesseract for arbitrary big images

2014-06-27 Thread Nick White
Hi Mori, On Fri, Jun 27, 2014 at 01:51:01AM -0700, morteza neishaboori wrote: > I want to use OCR to detect small words in images containing indoor signs and > etc > you can find some sample images in the link below to get the idea > https://drive.google.com/folderview?id=0B3dLM0w0EeD-RFZVc1NjaGN

Re: [tesseract-ocr] read multi-language ( arabic and english) image

2014-06-27 Thread Nick White
On Fri, Jun 27, 2014 at 01:48:52AM -0700, thinker wrote: > reading image with multiple language (arabic and english) by using -l > ara+eng option gives garbage output. There are currently a couple of bugs with combining Arabic and English together, so it isn't working. I'd recommend you add an

Re: [tesseract-ocr] Any suggestions on pre-processing to improve accuracy?

2014-06-26 Thread Nick White
On Mon, Jun 23, 2014 at 08:32:52AM -0700, Traun Leyden wrote: > One more thing that document should have is a mention of Stroke Width > Transform > to improve OCR recognition on images that have a lot of non-text content. Oh cool, that looks great! I definitely will add that to the wiki page so

Re: [tesseract-ocr] general tesseract help for coding newbie

2014-06-26 Thread Nick White
Hi Jack, I replied privately, but the gist is that VietOCR is a graphical program that makes Tesseract easier to use on a Mac (as well as Linux & Windows). Nick On Thu, Jun 26, 2014 at 08:55:19AM -0700, Jack Kershaw wrote: > I am an ancient greek student currently studying A levels. I have bee

Re: [tesseract-ocr] Tesseract with PHP wrapper input stream not found

2014-06-25 Thread Nick White
Hi Eddie, I'd suggest contacting the author of the PHP wrapper, that isn't something provided by the core Tesseract project, and it doesn't look like any issue with Tesseract proper, just with the caller. Nick On Wed, Jun 25, 2014 at 12:36:59AM -0700, Eddie G wrote: > I'm using the PHP wrapper

Re: [tesseract-ocr] Re: hocr2pdf

2014-06-23 Thread Nick White
Hi Amar, If you can wait for the release of Tesseract 3.03 (or compile the latest version from SVN), that has PDF output built in. Nick On Mon, Jun 23, 2014 at 12:19:52AM -0700, Amar wrote: > Hello dear friends, Is HOCR2PDF command line tool limited only to non-windows > platforms? I could not

Re: [tesseract-ocr] Any suggestions on pre-processing to improve accuracy?

2014-06-20 Thread Nick White
Hi Traun, > Any tips on doing pre-processing on the images to improve the > recognition? The place to start would be here: https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To

Re: [tesseract-ocr] Re: Pharmaceutics OCR recognition project

2014-06-19 Thread Nick White
On Wed, Jun 18, 2014 at 07:30:03AM -0700, Paul wrote: > That upper bound actually might be the root of your problem. If you've already > compiled Tesseract on your own, > try to use a greater number for kMaxUserDawgEdges. If you have not, you could > either reduce the number of > words in your dic

Re: [tesseract-ocr] limiting to 8 letters and numbers only (LPR)

2014-06-19 Thread Nick White
Hi Ketut, On Tue, Jun 10, 2014 at 11:30:39PM -0700, ketut ariasa wrote: > I have a very limited OCR application using tesseract, where I want to > recognize only 8 letters and numbers begin with the letter 'D'. > Is there a way to restrict tesseract to attempt to > recognize only 8 digits letter

[tesseract-ocr] Example C-API program outputing UZN zone files

2014-06-05 Thread Nick White
t http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140605164046.GB5444%40manta.lan. For more options, visit https://groups.google.com/d/optout. /* Copyright 2014 Nick White * * Licensed under the Apache Li

Re: [tesseract-ocr] Tesseract 3 doesn't recognize portion of the image with one word inside

2014-06-05 Thread Nick White
On Thu, Jun 05, 2014 at 01:51:24PM +0200, zdenko podobny wrote: > On Thu, Jun 5, 2014 at 12:10 PM, 'thakobyan' via tesseract-ocr > tesseract-ocr@googlegroups.com> wrote: >> >> Trying to OCR the portion of the image. For some reason if I >> cut only one word (see Fail.png and Fail2.png att

  1   2   3   4   5   >