[tesseract-ocr] Trouble with Apparently Simple Source Image
Hello, I've run into some trouble using Tesseract OCR in a python program doing some screen scraping. I can't quite wrap my head around why this one value is having so much more trouble than the others on the same page, with the same contrast and font. This is the image in question: It has been scraped from a 1080p resolution screenshot, sliced into individual images for the values in a grid, scaled up by 10x, inverted (from white-on-black to this), thresholded, and passed to Tesseract. I have also tried various Gaussian and median blurs but those seem to just make other strings fail more. I have tried most of the PSM options that make sense, and passed options with just numerals, $, comma, and decimal as allow list of characters. I've tried all the different interpolations OpenCV has to offer. Tesseract just constantly chokes on this value. It's a little frustrating because the only OCR I've found that works with this value is an A9T9 model(I think) through the free api at ocr.space ( https://ocr.space/ocrapi#ocrengine2 ). Unfortunately there doesn't appear to be a way for me to run that locally, and the string seems like it should be simple for an OCR read. Any advice on poking Tesseract in the right way to read this, or some fancy filtering I could do to help make the image clearer for it? Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae2ae7cd-6cd1-44ef-843e-ef10a35929c6n%40googlegroups.com.
[tesseract-ocr] Specify target file name patterns?
I may have missed it in teh command line parameters but is there any way to specify the names of target OCR-ed PDF files instead of having a (Windows in my case) file copy of the original file? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d03c3420-a936-4cd3-be8a-27245beed648n%40googlegroups.com.
[tesseract-ocr] tesstrain.sh no output
Hello, i used tesstrain with tessdata_best German (deu) ant two installed Fonts. I had some Problems: 1. Which langdata do i Need for this (lstm or the normal)? I build tesseract and the Training Tools from source, but i do not have a langdata Folder. Which files do i Need? 2. In Phase I: Generating training Images i receive the message "Stripped 66 unrenderable words" (the number varies). What does this mean? 3. At the end it says tesseract failed loading language 'eng', but i used deu, so i don't understand why this Erro occurs. See my Terminal Input/Output below (i forgot the Latin.unicharset): src/training/tesstrain.sh --fonts_dir /usr/local/share/fonts --lang deu --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir ./tessdata --fontlist "Desyrel" "Journal" --output_dir ~/tesstutorial/deueval === Starting training for language 'deu' [So 2. Dez 22:54:19 CET 2018] /usr/local/bin/text2image --fonts_dir=/usr/ local/share/fonts --font=Desyrel --outputbase=/tmp/font_tmp.yYD7WTtIyC/ sample_text.txt --text=/tmp/font_tmp.yYD7WTtIyC/sample_text.txt -- fontconfig_tmpdir=/tmp/font_tmp.yYD7WTtIyC Rendered page 0 to file /tmp/font_tmp.yYD7WTtIyC/sample_text.txt.tif === Phase I: === Rendering using Desyrel [So 2. Dez 22:54:22 CET 2018] /usr/local/bin/text2image --fontconfig_tmpdir= /tmp/font_tmp.yYD7WTtIyC --fonts_dir=/usr/local/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/deu-2018-12- 02.1rr/deu.Desyrel.exp0 --max_pages=0 --font=Desyrel --text=./langdata/deu/ deu.training_text Rendering using Journal [So 2. Dez 22:54:23 CET 2018] /usr/local/bin/text2image --fontconfig_tmpdir= /tmp/font_tmp.yYD7WTtIyC --fonts_dir=/usr/local/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/deu-2018-12- 02.1rr/deu.Journal.exp0 --max_pages=0 --font=Journal --text=./langdata/deu/ deu.training_text Stripped 66 unrenderable words Rendered page 0 to file /tmp/deu-2018-12-02.1rr/deu.Journal.exp0.tif ... ... Stripped 72 unrenderable words Rendered page 4969 to file /tmp/deu-2018-12-02.l2i/deu.Journal.exp0.tif Rendered page 4937 to file /tmp/deu-2018-12-02.l2i/deu.Desyrel.exp0.tif Stripped 3 unrenderable words Rendered page 4970 to file /tmp/deu-2018-12-02.l2i/deu.Journal.exp0.tif === Phase UP: Generating unicharset and unichar properties files === [So 2. Dez 22:04:32 CET 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/deu-2018-12-02.l2i/deu.unicharset --norm_mode 1 /tmp/deu-2018-12-02.l2i /deu.Desyrel.exp0.box /tmp/deu-2018-12-02.l2i/deu.Journal.exp0.box Extracting unicharset from box file /tmp/deu-2018-12-02.l2i/deu.Desyrel.exp0 .box Extracting unicharset from box file /tmp/deu-2018-12-02.l2i/deu.Journal.exp0 .box Wrote unicharset file /tmp/deu-2018-12-02.l2i/deu.unicharset [So 2. Dez 22:06:19 CET 2018] /usr/local/bin/set_unicharset_properties -U / tmp/deu-2018-12-02.l2i/deu.unicharset -O /tmp/deu-2018-12-02.l2i/deu.unicharset -X /tmp/deu-2018-12-02.l2i/deu.xheights --script_dir=./langdata Loaded unicharset of size 117 from file /tmp/deu-2018-12-02.l2i/deu. unicharset Setting unichar properties Setting script properties Failed to load script unicharset from:./langdata/Latin.unicharset Warning: properties incomplete for index 3 = M ... ... Warning: properties incomplete for index 114 = " Warning: properties incomplete for index 115 = i Warning: properties incomplete for index 116 = € Writing unicharset to file /tmp/deu-2018-12-02.l2i/deu.unicharset === Phase E: Generating lstmf files === Using TESSDATA_PREFIX=./tessdata [So 2. Dez 22:06:21 CET 2018] /usr/local/bin/tesseract /tmp/deu-2018-12-02.l2i/deu.Desyrel.exp0.tif /tmp/deu-2018-12-02.l2i/deu.Desyrel.exp0 --psm 6 lstm.train [So 2. Dez 22:06:21 CET 2018] /usr/local/bin/tesseract /tmp/deu-2018-12-02.l2i/deu.Journal.exp0.tif /tmp/deu-2018-12-02.l2i/deu.Journal.exp0 --psm 6 lstm.train Error opening data file ./tessdata/eng.traineddata Error opening data file ./tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your " tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract. Please make sure the TESSDATA_PREFIX environment variable is set to your " tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract. ERROR: /tmp/deu-2018-12-02.l2i/deu.Desyrel.exp0.lstmf does not exist or is not readable -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this
[tesseract-ocr] Training with Font Files
Hello, i want to create a traineddata file based on a few different fonts. I'm using Tesseract 4.0 with LSTM. Whats the easiest way? Is there a Tool to train Tesseract with font files directly (.tff- files) or do i have to create Text images based on the Font and then use those to train? Thanks in advance. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/79f1f5e1-7473-467e-b5e2-f468a0d24225%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Handwriting training
Hello everyone, I am currently working on making a scanned fillable text document readable for the computer. This document can be filled in with computer writing as well as with handwriting. The quality of the scanned document is good enough and the font is not too small. I'm sing Ubuntu 18.04, Python 3 and Tesseract 4.0. What is the best way to recognize both types of font (in particular handwriting)? Do you have some easy steps for me to archieve the Training for this Problem? I found this "https://github.com/OCR-D/ocrd-train;, it seems to make the Training Process a lot easier right? Thanks in advance and best wishes. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/783dc358-e7b7-47f7-9a82-06552d3af37d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Configuring blacklists in windows 7
Hi All, I'm using Tesseract 3.02.02 on a windows 7 computer, via gImageReader GUI front-end (so I don't have to go into the black stuff, ms-dos). Works well, except... same problem as everyone else: character sequence fi and fl are replaced by unicode(?) characters 0xFB01 and 0xFB02, latin ligatures small fi and fl. Solution in a few other threads is to put a blacklist in the config file, but I've tried and not succeeded. How do you actually do that in the windows operating system? Firstly: There is no config file, as such. Tesseract is not installed, but has its files copied across to the directory: C:\Users\rob\AppData\Local\Tesseract-OCR Deeper down there are 3 more directories: 1.C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata which has the files: eng.traineddata eng.cube.fold eng.cube.lm_ eng.cube.word-freq eng.cube.size eng.cube.nn eng.cube.params eng.cube.bigrams eng.cube.lm eng.tesseract_cube.nn osd.traineddata plus 2 directories: 2. C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\configs which has the files: ambigs.train api_config bigram box.train box.train.stderr digits hocr inter kannada linebox logfile makebox quiet rebox strokewidth unlv 3.C:\Users\rob\AppData\Local\Tesseract-OCR\tessdata\tessconfigs which has the files: batch batch.nochop matdemo msdemo nobatch segdemo Is one of these the configuration file I need to edit? Note also, windows standard editor would be ms-notepad, you have option to save text as ANSI, UTF-8, Unicode or Unicode big-endian. Which is the correct one to use - ANSI is standard, but won't allow you to save the ligatures, so it must be one of the others. I've tried them all, editing existing files and adding new files. Always failed. More info: I know nothing about programming, have no compiler on my computer. I downloaded working executables from sourceforge or github or googlecode or somewhere. Managed to get them going without too much fuss by following the instructions. I never did any training of Tesseract - it came already trained, presumably. But I can't find any simple configuration instructions to follow to get rid of the latin fi and fl ligatures by editing windows files. And I want to get rid of them - convert each to two standard english letters for saving the files as english text. Any help appreciated, Regards, Rob -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eef3df68-25db-4a95-b0ef-9786edbbb99a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] mftraining core dump - Illegal malloc request size on Ubuntu...
Thanks Nick! Regarding mftraining - I just couldn't see what was wrong, I must have went a bit code blind there. Things are working now with a simple change to that one line... mftraining -F font_properties -U unicharset.out -O unicharset.out2 eng.FreeSans.exp0.tr So it's onto testing to see what difference all this can make. Good idea about the make file. Thanks once more! -- Rob -- -- Texthelp Ltd is a limited company registered in Belfast, N. Ireland with registration number NI31186 having its registered office and principal place of business at Lucas Exchange, 1 Orchard Way, Antrim, N. Ireland, BT41 2RU. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8da43571-b54d-4237-bb2a-1f1c6c418992%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] mftraining core dump - Illegal malloc request size on Ubuntu...
Hi! I've been trying to train tesseract and after a hard day getting all the dependencies downloaded and compiled I managed to get so far down the training documentation. I'm using Ubuntu 14.04LTS and I've downloaded and compiled leptonica-1.70. I ended up creating a shell script after compiling and installing tesseract and tesseract-training... Start of file (called commands.sh)... #!/bin/bash # Get a copy of Tesseract src code... # svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only # # Make a folder, let's call it 'training_text' # mkdir training_text # cd training_text # # Create a '1.txt' file containing the training text. (Try the Gutenburg project). # Copy 'font_properties' from tesseract-ocr-read-only/training/langdata... # cp ../tesseract-ocr-read-only/training/langdata/font_properties . # # Run this commands file... # commands.sh # Remove any previously generated files (you will get errors # if this is the first time you run this, but it's OK)... rm eng.FreeSans.exp0.box rm eng.FreeSans.exp0.tif rm eng.FreeSans.exp0.tr rm eng.FreeSans.exp0.txt rm shapetable rm unicharset rm unicharset.out # Try to generate them again... text2image --text=1.txt -outputbase=eng.FreeSans.exp0 --font='FreeSans' --fonts_dir=/usr/share/fonts/truetype/freefont tesseract eng.FreeSans.exp0.tif eng.FreeSans.exp0 box.train unicharset_extractor eng.FreeSans.exp0.box set_unicharset_properties -U unicharset -O unicharset.out --script_dir=../tesseract-ocr-read-only/training/langdata shapeclustering -F font_properties -U unicharset eng.FreeSans.exp0.tr #shapeclustering -F font_properties -U unicharset.out eng.FreeSans.exp0.tr mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr #mftraining -F font_properties -U unicharset.out -O eng.FreeSans.exp0.tr #cntraining eng.FreeSans.exp0.tr End of file Once I get down to shaperclustering I can't tell from the documentation which unicharset file to use the first one produced or the one produced by the 'set_unicharset_properties' command. Either way the mftraining usually fails, sometimes a second attempt at running shapeclustering and mftraining outside of this shell file works, but almost every time I get the following error... Start of Error (mftraining) Error: Illegal malloc request size! Fatal error encountered! == NULL:Error:Assert failed:in file globaloc.cpp, line 75 ./commands.sh: line 40: 20958 Segmentation fault (core dumped) mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr End of Error And even worse the cntraining command doesn't work at all... Start of Error (cntraining) Error: Illegal short name for a feature! Fatal error encountered! == NULL:Error:Assert failed:in file globaloc.cpp, line 75 Segmentation fault (core dumped) End of Error What am I doing wrong? Any help would be appreciated. Also I think adding this kind of shell script (or equivalent) to a 'fast start' for training could be useful. Rob -- -- Texthelp Ltd is a limited company registered in Belfast, N. Ireland with registration number NI31186 having its registered office and principal place of business at Lucas Exchange, 1 Orchard Way, Antrim, N. Ireland, BT41 2RU. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/63157b27-eb70-467c-bae9-69b12931dadb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: differences between IOS version and regular version
read my thread more carefully. i did recompile against tesseract 3.02 Typos courtesy of my iPhone On Jan 4, 2014, at 6:15 PM, Benjamin Sølberg benjamin.soelb...@gmail.com wrote: Hi Robert You probably already know this but your project uses an old version/snapshot of tesseract. Just a heads up as I was hoping that that you were using the latest code :-) There have been at least one fix regarding the osx version. Benjamin Den fredag den 3. januar 2014 21.20.27 UTC+1 skrev Robert Mathews: I recompiled against the latest tesseract and leptonica-1.69 You can see the project I used to compile here: https://github.com/robmathews/compile-tesseract Then, I updated the sample ios app to - use tesseract 3.02 + leptonica-1.69 - allow choosing a photo from the photo library and checked into this fork: https://github.com/robmathews/OCR-iOS-Example And that's all I know. -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Tessarect Version - Linux ?
yum info tesseract apt-get show tesseract On Oct 17, 2013 5:40 AM, Sriram Varadharajan varadhuku...@gmail.com wrote: I have tessarect installed in linux machine and wanted to find out what version it is. I tried using command line tessarect --version and it does not give out the version.Please let me know if someone has encountered the same. Thanks -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Is there a variable for tuning character spacing?
Hi Merve, Thank you for your reply! I think my case is slightly different. I want to adjust the spacing threshold on *input* images, not the output text. In my case, I get *no* output, whereas you get output that is spaced improperly. I see your question here: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/cfffeed5da7ab757/08e70a97c50e32e7?lnk=gstq=space+threshold#08e70a97c50e32e7 Your image said apple and tesseract produced app le. In my case, I get no output. Here are the two images: http://imgur.com/a/KSeiW The first produces no output, the second one produces 591. Anyone else have a suggestion? Thanks again, Rob On Nov 28, 8:17 am, Merve Temizer mervet2...@gmail.com wrote: I asked similar question a while ago, and had got a reply which tells: There is not such a variable to tell tesseract the space threshold between characters unfortunately 2011/11/27 Rob r...@wholewhale.com Greetings, is there a variable for tuning character spacing? I ran tesseract on an image with three characters and it gave no result. Then I used photoshop to add space between the characters, and it came out perfectly. Since I'm new, I'm wondering, is there a simple setting I can adjust, or is this something that would require training? Thanks! Rob -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en
Is there a variable for tuning character spacing?
Greetings, is there a variable for tuning character spacing? I ran tesseract on an image with three characters and it gave no result. Then I used photoshop to add space between the characters, and it came out perfectly. Since I'm new, I'm wondering, is there a simple setting I can adjust, or is this something that would require training? Thanks! Rob -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en
Re: tif to pdf
On Wed, Mar 9, 2011 at 12:46 AM, Jeffrey Ratcliffe jeffrey.ratcli...@gmail.com wrote: On 8 March 2011 20:25, UziTech tbri...@gmail.com wrote: is there an easy way to make the output a pdf or doc or format other than txt? I have built this functionality into gscan2pdf. Regards Jeff if it is already in XML, then you way want to look at the package that xmlresume uses to take xml and output to pdf, html, manpages, rtf... -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Errors on startup after compiled in VS 2010 and Windows 7
I have successfully managed to compile tesseract in visual studio 2010, but the program hits an unhandled exception as soon as it executes Unhandled exception at 0x00427be8 in cntraining.exe: 0xC005: Access violation reading location 0x. I'm not sure if this has anything to do with Windows 7, but I haven't been able to find anyone else having the same problem through a google search. Anyone have ideas on how to fix this? Thank you -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Compillation in Visual Studio 2010
Thank you, that was very useful On Jan 3, 10:49 pm, SURAJ suraj.supe...@gmail.com wrote: Hello all, I have tried to compile tessaract 3.0 inVisualStudio2010. Good news is its compiled but need small 2/3 changes in code due to new C++ specifications followed in VS2010for Templates. I am using XP SP3 and VS2010Team edition. My Observations are 1. Due to change in Template spec, you canot pass NULL in tamplates call. To overcome this problem you need to typecase NULL. Fortunetly all changes are in on file only. scrollview.cpp in Viewer Project. Path ..\tesseract\viewer Line 140 : Original : std::pairScrollView*, SVEventType awaiting_list_any_window (NULL, SVET_ANY); New : std::pairScrollView*, SVEventType awaiting_list_any_window ((ScrollView*)NULL, SVET_ANY); Original : waiting_for_events[ea] = std::pairSVSemaphore*, SVEvent* (sem,NULL); New: waiting_for_events[ea] = std::pairSVSemaphore*, SVEvent* (sem, ( SVEvent*)NULL); Line 430 : Original : std::pairScrollView*, SVEventType ea(NULL, SVET_ANY); New: std::pairScrollView*, SVEventType ea((ScrollView* )NULL, SVET_ANY); Line 433 : Original : waiting_for_events[ea] = std::pairSVSemaphore*, SVEvent* (sem,NULL); New: waiting_for_events[ea] = std::pairSVSemaphore*, SVEvent* (sem, (SVEvent*)NULL); I hope this information is useful for developers who wants to useVisualstudio2010 SURAJ -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Compressing a sequence of spaces
Tesseract is compressing a sequence of spaces in an input TIFF into a single space in the output text. I want to preserve the original spaces. Tesseract 2.03 Debian 4 (2.6.18-5-686 kernel) libtiff-tools libtiff-dev I'd appreciate any advice. Thanks, Rob --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Simple and fast editor of box files (QT)
nice! Thanks. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Simple and fast editor of box files (QT)
1. Ergonomically speaking, If you load a box then the corresponding image should be loaded... and vice versa. I'm not aware of any reason that someone would want to load an image without a box file... or vice versa. Since Tesseract generates a box/txt file with the same name as the image, your editor should try to load both the image+box file at the same time by default. If both files are not in the same directory (e.g. if you keep images in one directory and box files in another), then display a file browser window to have the user select the corresponding box or image. 2. The characters I want to use are not mapped to any known keyboard layouts. So I can't type them directly. The only option is to copy/paste which is more tedious than typing the actual unicode hex value. Maybe you could show both the character and hex value on your pop-up and use the TAB key to switch into hex mode where the user would type 4 hex values? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Great tool for working with unicode
Copy and paste the following text into the basic notepad application. It will show up as little boxes. There's a good chance that your web browser doesn't have a unicode enabled font, so most of the following characters will display as garbage. The following characters are: circled E, circled F, circled L, circled L, circled U, circled P, circled S, circled S, circled T, circled U ⒺⒻⓁⓁⓊⓅⓈⓈⓉⓊ Or you can copy/paste those into the web app and view them: http://rishida.net/scripts/uniview/uniview.php?codepoints=24BA 24BB 24C1 24C1 24CA 24C5 24C8 24C8 24C9 24CA On May 3, 5:35 am, 74yrs old withblessi...@gmail.com wrote: Thanks. very good idea. will you please upload sample of little box? On Sun, May 3, 2009 at 9:21 AM, Rob H. hksny...@gmail.com wrote: I'm training Tess to recognize letters/numbers/symbols/etc. used for geometrical tolerancing and annotations (ASME Standard Y14.5) Alot of the characters used in the ASME standard are coming from all over the unicode tables (although the characters/words are from the English language). This is part of a data validation project and I'm using OCR as part of the process. Since OCR is not 100% accurate, some of the validation will need to be done by hand (hopefully as little as possible). If the person checking the annotation sees a little box (ie unprintable character) then it will slow down their job. For the moment, I check unprintable characters using the webapp which I posted above. Once this goes into production, there will be a font (purchasd or home- brewed) which can correctly draw all the letters/numbers/symbols/etc. On May 2, 7:04 am, 74yrs old withblessi...@gmail.com wrote: Hi Rob, I know about conversion.php which I am using for long time for Kannada project. Will you kindly explain by step by step of your experiment with sample if any. I wanted to have hands on experience. BTW which lang. you were training? Regards, sriranga(76yrs old) On Sat, May 2, 2009 at 6:37 AM, Rob H. hksny...@gmail.com wrote: Also, I got this e-mail from a someone named Albert = Hi Rob, Reply to your ps That doesn't make any sense to me. You are asking for a set of glyphs that can represent every Unicode character in existence. Not only would such a file be *HUGE* in size, but I can't see it as serving any purpose to anyone (other than you, I guess)... So you should stop looking for it. - Albert = Arial Unicode covers ~50K of the ~140K characters defined at unicode.org. This font file is 22mb. Wouldn't a complete unicode font be around 70mb? If you need a general text viewer which can legibly show documents that contain any number of the valid ~140K characters, then a complete font would be useful. Great advice Albert...*roll eyes*... stop looking... how about something a little more constructive? maybe you know a strategy of mixing fonts to enable an application to view all the possible unicode characters?- Hide quoted text - - Show quoted text - --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Great tool for working with Unicode
Thanks for the reply Albert. I think I'll stop looking ... for a silver bullet and create a strategy which covers my set of glyphs. (maybe the pdf solution will work). I thought Unicode did specify what a character looks like (on a basic level), and then fonts were responsible for their interpretation (which can be completely off). For example, WingDings is vastly different from what Unicode shows in their PDF renderings. I assumed that the character drawn in those unicode files were a basic rendition of what the character should look like. Do you have any experience creating fonts? I might create one... it doesn't have to be pretty... just needs to help the user accomplish their task of comparing text extract from the UI vs text extracted from the model. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Not able to use it
You probably need some language data. Check the downloads page again for this. Once you've unzipped your language, there should be a directory called tessdata under which you will see files with file extensions like DangAmbigs, inttemp, pffmtable, etc... This tessdata directory would be located here (in the same sub directory as tesseract.exe): \tesseract-2.03\tessdata All languages you download, or create, will be placed in that directory. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Tesseract 3.0
But seriously... I'm writing a fairly interesting application using Tesseract for my client: Gulfstream Aerospace. I have no problem testing 3.0, especially if I can get some performance gains. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Ehm Ehm
Start by reading through here: http://code.google.com/p/tesseract-ocr/wiki/ReadMe You probably need Visual Studio C++ Express (I think 2005 and 2008 will work). You open the *.sln file and build the solution. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Great tool for working with unicode
I'm training Tess to recognize letters/numbers/symbols/etc. used for geometrical tolerancing and annotations (ASME Standard Y14.5) Alot of the characters used in the ASME standard are coming from all over the unicode tables (although the characters/words are from the English language). This is part of a data validation project and I'm using OCR as part of the process. Since OCR is not 100% accurate, some of the validation will need to be done by hand (hopefully as little as possible). If the person checking the annotation sees a little box (ie unprintable character) then it will slow down their job. For the moment, I check unprintable characters using the webapp which I posted above. Once this goes into production, there will be a font (purchasd or home- brewed) which can correctly draw all the letters/numbers/symbols/etc. On May 2, 7:04 am, 74yrs old withblessi...@gmail.com wrote: Hi Rob, I know about conversion.php which I am using for long time for Kannada project. Will you kindly explain by step by step of your experiment with sample if any. I wanted to have hands on experience. BTW which lang. you were training? Regards, sriranga(76yrs old) On Sat, May 2, 2009 at 6:37 AM, Rob H. hksny...@gmail.com wrote: Also, I got this e-mail from a someone named Albert = Hi Rob, Reply to your ps That doesn't make any sense to me. You are asking for a set of glyphs that can represent every Unicode character in existence. Not only would such a file be *HUGE* in size, but I can't see it as serving any purpose to anyone (other than you, I guess)... So you should stop looking for it. - Albert = Arial Unicode covers ~50K of the ~140K characters defined at unicode.org. This font file is 22mb. Wouldn't a complete unicode font be around 70mb? If you need a general text viewer which can legibly show documents that contain any number of the valid ~140K characters, then a complete font would be useful. Great advice Albert...*roll eyes*... stop looking... how about something a little more constructive? maybe you know a strategy of mixing fonts to enable an application to view all the possible unicode characters? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Great tool for working with unicode
Well Tesseract 2.0 has support for unicode, but many times it can be hard to understand the results of the OCR because the characters are not printable in many fonts. Typically in text editors (including Notepad++, UltraEdit, MS Word, Notepad, etc.), an unrecognized character will be displayed as a simple box. This is not readable. So, to verify your results, especially while training, you need to check how accurate the results came out. So, if you are using unprintable characters and don't have a font which recognizes them correctly, then this webapp will help you know which character the OCR recognized unless you know off the top of your head what hex value matches what characters you want. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Great tool for working with unicode
Also, I got this e-mail from a someone named Albert = Hi Rob, Reply to your ps That doesn't make any sense to me. You are asking for a set of glyphs that can represent every Unicode character in existence. Not only would such a file be *HUGE* in size, but I can't see it as serving any purpose to anyone (other than you, I guess)... So you should stop looking for it. - Albert = Arial Unicode covers ~50K of the ~140K characters defined at unicode.org. This font file is 22mb. Wouldn't a complete unicode font be around 70mb? If you need a general text viewer which can legibly show documents that contain any number of the valid ~140K characters, then a complete font would be useful. Great advice Albert...*roll eyes*... stop looking... how about something a little more constructive? maybe you know a strategy of mixing fonts to enable an application to view all the possible unicode characters? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: What causes this error? 6 classes in inttemp while unicharset contains 7
Do you know what the problem is already? Maybe you could point me to the method which needs to be fixed, and explain the problem? PS: Is it just my VS2005 setup, or am I seeing the for/if/function statements split up over multiple rows (must be some leftover HP stuff)? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: My training methodology does not work :(
Have you tried building a dictionary of words...word-dawg + freq-dawg. At least try putting those 2 words (mother india) into your dictionary. I am starting to train the OCR to recognize special characters and I've considered this single character approach, but not yet tried it. I am leaning towards building a page of special characters now. On Apr 17, 3:18 pm, Debayan Banerjee debaya...@gmail.com wrote: As much as I hate to admit it my training methodology http://hacking-tesseract.blogspot.com/2009/04/my-old-training-methodo... of generating one image per akshar does not work. I hate to say it since I put some effort into writing the Python code that does this . Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training. Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless. Let me show you an example of how miserably it failed. I decided to test the training on the string ভারত মাতা (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering. Then I generated 7 images per sample of ভ র ত ম and used the subsequently generated training fils for OCR. The result was this: মভতভ Yes, I know. The result is absolutely outrageous. However, what if I still auto-generate images of characters but this time in single lines adjacently? Will it work? You may go throughhttp://hacking-tesseract.blogspot.com/for all my work. -- Be Intelligent, Use GNU/Linux http://debayanin.googlepages.com/http://debayan.wordpress.comhttp://lug.nitdgp.ac.in --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
What causes this error? 6 classes in inttemp while unicharset contains 7
I've read through all the related threads on the topic, but I don't understand what causes this problem. Does this error even matter, since I can modify the unicharset file by removing the extra characters? I ask because I'm having this problem with some fonts which I am training now. I have trained 2 fonts without this problem and then there are the 2 fonts which have this problem. In the end, I'm wondering how good the OCR will be, if I remove special unicode characters from the unicharset which are needed in my results? - Some analysis - I am running with the 2.03 code, which I downloaded and compiled. Here is a sample error: APPLY_BOXES: boxfile 1/2/h ((47,1546),(80,1594)): FAILURE! box overlaps blob in labelled word APPLY_BOXES: ALSO ignoring corrupted char blk:1 row:1 T When tess generated the box, it had created a box around 2 letters, so I modified the box file to have 2 boxes instead of 1. This error complains about one of my boxes... I noticed that the *.tr is missing the two letters which were in these two boxes which I created. So, based on this quote from training page, I suppose splitting a box is not supported? If you didn't sucessfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: How much training data - if characters are always the same
-Ray S. I noticed in this thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/71a41fa5065855c9 You said: The training process usually uses a minimum of 5-10 samples of each character in each font. When my character is drawn in the exact same size/shape/etc. on the image, but in different locations, does the training still need 5-10 samples of each character? Is the goal to have the OCR understand a certain character when it is next to other characters? I'm interested in understanding why (either way)... --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: Tesseract 3.0???
Has Version 3.0 been discussed somewhere else on this google group? I'm curious about the upcoming features? On Feb 25, 7:49 pm, Ray Smith theraysm...@gmail.com wrote: If everything goes according to plan, it should be available around the end of March. I can't promise anything though, other than that it *will* be worth the wait!Ray. On Wed, Feb 11, 2009 at 8:32 PM, bharath bhooshan abbhoos...@gmail.comwrote: We are eagerly waiting for that. On Wed, Feb 11, 2009 at 11:37 PM, Swistak swistak...@gmail.com wrote: Same question. Any approximate date will be appreciated.- Hide quoted text - - Show quoted text - --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---
Re: How to decrease Tif file size
convert to group 4 fax using command line ImageMagik? On 3/6/09, Rags2u raghu7...@gmail.com wrote: Hi, Im using Tesseract2.dll for my project. Tif files with size in KB is working fine and converting to Text files. But the Tif files with size in MB is not working. It is not converting to text files. Can anybody help me how to decrease the Tif file size? or any other suggesion for this issue? Thanks in advance. Raghu. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~--~~~~--~~--~--~---