Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-08-01 Thread Jimmy O'Regan
2010/8/1 Zdenko Podobný : > > Dňa 28.07.2010 17:02, Jimmy O'Regan wrote / napísal(a): >> > I grepped the code and it seems to be looking for something called > LANG.user-words, but that didn't seem to do anything -- I got the same > garbled text when I ran Tesseract 3 the second time. >

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-08-01 Thread Zdenko Podobný
Dňa 28.07.2010 17:02, Jimmy O'Regan wrote / napísal(a): > I grepped the code and it seems to be looking for something called LANG.user-words, but that didn't seem to do anything -- I got the same garbled text when I ran Tesseract 3 the second time. >> Turns out T3 does

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-28 Thread Jimmy O'Regan
On 27 July 2010 20:49, Philip Pemberton wrote: > On 27/07/10 17:30, Jimmy O'Regan wrote: >>> >>> The Ubuntu wordlist is pretty big... 921 user-added words... >> >> As wordlists go, that's tiny :) > > Aye, but it's an exceptions list :) > Seems to contain a lot of fairly technical words and abbrevi

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-28 Thread Philip Pemberton
On 27/07/10 17:30, Jimmy O'Regan wrote: The Ubuntu wordlist is pretty big... 921 user-added words... As wordlists go, that's tiny :) Aye, but it's an exceptions list :) Seems to contain a lot of fairly technical words and abbreviations which I assume aren't in the Tesseract base wordlist.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-27 Thread Jimmy O'Regan
On 27 July 2010 13:35, Philip Pemberton wrote: > On 27/07/10 12:38, Jimmy O'Regan wrote: >>> At the risk of sounding like an idiot... how do you do that? >>> I didn't see anything about a user dictionary in the documentation... >>> >> It's a plain text file, one word per line, eng.user-words > > A

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-27 Thread Philip Pemberton
On 27/07/10 12:38, Jimmy O'Regan wrote: >> At the risk of sounding like an idiot... how do you do that? >> I didn't see anything about a user dictionary in the documentation... >> > It's a plain text file, one word per line, eng.user-words Ah, there it is. I can see it in the Ubuntu 10.04 package

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-27 Thread Jimmy O'Regan
On 27 July 2010 11:28, Philip Pemberton wrote: > On 27/07/10 09:57, Jimmy O'Regan wrote: >> Have you tried adding 'MHz' to the user dictionary? > > At the risk of sounding like an idiot... how do you do that? > I didn't see anything about a user dictionary in the documentation... > It's a plain t

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-27 Thread Philip Pemberton
On 27/07/10 09:57, Jimmy O'Regan wrote: > Have you tried adding 'MHz' to the user dictionary? At the risk of sounding like an idiot... how do you do that? I didn't see anything about a user dictionary in the documentation... >> - The top line of text sometimes gets garbled (as in, read as rand

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-27 Thread Jimmy O'Regan
On 26 July 2010 19:21, Philip Pemberton wrote: > Problem is, Tesseract 2.04 doesn't like quoted text: > > phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2 > Tesseract Open Source OCR Engine > tesseract: unicharset.cpp:76: const UNICHAR_ID > UNICHARSET::unichar_to_id(const char*, in

Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-07-26 Thread Philip Pemberton
Hi, I'm currently working on cataloguing about 20 years worth of electronics magazines, books and journals, down to article level. Obviously, typing in the article names, page numbers and synopses isn't an option -- for a start it'd make my hands hurt (a lot!) and take a very long time... we'