Re: concatenating tr files

2013-04-22 Thread Shree Devi Kumar
Thanks, Zdenko. I'll change the filename and try using the /b switch with copy as suggested by Quan. I was trying to concatenate the files because http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 says: An alternative to multi-page tiffs is to create many single-page tiffs for > a s

Re: concatenating tr files

2013-04-22 Thread Quan Nguyen
.tr are binary files; as such, you should use: copy /b san.sanskrit2003.exp0*.tr san.sanskrit2003.exp2000.tr -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe

Re: Include Tesseract in C++ code

2013-04-22 Thread zdenko podobny
On Sat, Apr 20, 2013 at 11:06 PM, TedJ wrote: > >Have you looked at my "Using the latest Tesseract-OCR sources" page [1] > that explains how to use TortoiseSVN to get the latest sources? > > I tried installing Tortroise too. Couldn't install it either. I have > only downloaded and unzipped the

Re: concatenating tr files

2013-04-22 Thread zdenko podobny
I don't have a lot of time, so I just run some simple tests on linux and here are results: 1. fix name of file: san.sanskrit2003.tr is not correct filename. Should be something like san.sanskrit2003.exp1000.tr 2. I tried to use linux cat instead of windows copy (cat san.sanskrit2003.e

bbtesserat

2013-04-22 Thread Scott Guthery
Does anybody use bbtesserat? Download and execution of bbT_exe_00_06_46.7z on Windows 7 64-bit halts. The dll is present and on the path. Thanks for any insight. Cheers, Scott ** Exception Text ** System.BadImageFormatException: Could not load file or assembly 'Magi

Re: Training individual characters in an existing language

2013-04-22 Thread Attila Sukosd
Hi again, I've looked at the unicharambigs file, but I think the problem is elsewhere. In the attached image, you can see that the last word is "omkommet", but tesseract recognises it as "o

Re: Include Tesseract in C++ code

2013-04-22 Thread TP
On Sun, Apr 21, 2013 at 1:15 PM, TedJ wrote: > *The following error has occurred during XML parsing:* > * > * > *File: > I:\Android\Tesseract\tesseract-3.02.02\tesseract-ocr-3.02-API-Example-vs2008\APIExample\baseapitester\baseapitester.vcproj > * > *Line: 27* > *Column: 4* > *Error Message:* > *

Re: Training individual characters in an existing language

2013-04-22 Thread Attila Sukosd
Wow, thank you for the detailed reply! I will give it a try! :) Best, Attila On Monday, April 22, 2013 11:04:32 AM UTC+2, sdk wrote: > > Please look at the unicharambigs file for your language. You can add these > substitutions to the same and recombine the traineddata without needing to > do

Re: Tesseract page segment mode

2013-04-22 Thread Sven Pedersen
It divides the page into segments. Sven On Sunday, April 21, 2013, Đỗ Ngọc Tuấn wrote: > I've tried using mode 2 - automatic page segment but not OSD or OCR, but not > shown any results whatsoever. So thís mode will do what ? > > -- > -- > You received this message because you are subscribed to

How do I add this to unicharambigs file?

2013-04-22 Thread Shree Devi Kumar
​While doing OCR with san.traineddata I am getting many cases where​ [ ​ga ​ग] [virāma ्] [ZWJ] i.e. ग्‍‍ followed by ा is being output, instead of ग similarly for श ण etc. Zero width joiner is not a unit in the unichar file. And, most half letters are shown with viraama - so I may have ग् in u

Re: Training individual characters in an existing language

2013-04-22 Thread Shree Devi Kumar
Please look at the unicharambigs file for your language. You can add these substitutions to the same and recombine the traineddata without needing to do any additional training. Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 - section on - The last file (unicharambigs)

Re: Training individual characters in an existing language

2013-04-22 Thread Shree Devi Kumar
See http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/combine_tessdata.1.html for instructions on how to unpack the unicharambigs file and how to overwrite it in the traineddata after update. Shree Devi Kumar भजन - कीर्तन - आरती

Training individual characters in an existing language

2013-04-22 Thread Attila Sukosd
Hi all, I'm trying to run some OCR on some old-ish danish datasets from 1970+, and it seems like some of the characters are consequently recognized wrong: å => á mm => nn : => e l => 1 Is there any way to improve on the recognition of these individual characters without having to retrain the c

How to deal with the part can't be recognise?

2013-04-22 Thread Le Ji
input > tesseract img.jpg outputbase digits output > 002249 the last digit is wrong. I know it's difficult for tesseract to recognise what it is. Can tesseract return a placeholder s

When part of the image can't recognised, can tesseract return with '#' or some other placeholder?

2013-04-22 Thread Le Ji
Example: my pic contain 6 digits. but sometimes, 5 digits and a strange character. Can I ask the tesseract to return me a '~' or '#' ,when it can't reconginse it as digits? when i send this image to