[tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-09-26 Thread James Q
Hi Vipin
I didn't get much further with that wrapper I'm afraid. In the end, I went 
for building tesseract from the C++ source code.

On Tuesday, September 25, 2018 at 6:03:19 PM UTC+1, Vipin Tom Varghese 
wrote:
>
> Hi James, my apologies to hit you up so randomly, but I had no ther 
> options left. Ive been trying to get Tesseract 4 working using 
> tesseract.net wrapper following the wiki here 
> <https://github.com/tvn-cosine/tesseract.net/wiki/Compiling-the-Tesseract-lib-for-Windows>,
>  
> but i'm unable to build from source. Could share how you got it working ?
>
> Thanks
> Vipin
>
> On Monday, 8 January 2018 15:33:50 UTC+5:30, James Q wrote:
>>
>> By the way I do have the Tesseract.net nuget package working ( 
>> https://www.nuget.org/packages/tesseract.net/ ), but have 2 issues with 
>> this:
>> 1.) I need to write a separate Bitmap -> Pix converter in C#
>> 2.) I haven't yet got whitelists/blacklists working
>>
>> Neither of these were issues with the tesseract 3 Charles Weld wrapper, 
>> hence my reason for trying to get the tdhintz one working (as this is based 
>> on Charles Weld's 3 wrapper).
>> Thanks
>> James
>>
>> On Monday, January 8, 2018 at 7:49:43 AM UTC, Mohammad Mahdizadeh wrote:
>>>
>>> I have the same problem 
>>>
>>>
>>> On Friday, January 5, 2018 at 8:38:08 PM UTC+3:30, James Q wrote:
>>>>
>>>> I'm trying to use this wrapper:
>>>> https://github.com/tdhintz/tesseract4win64
>>>>
>>>> It's an x64 .Net assembly with one main DLL (Tesseract.dll) and two 
>>>> dependency DLLs (liblept1741.dll and libtesseract400.dll). To start with 
>>>> I'm just trying to get a Visual Studio console app running. I've added 
>>>> Tesseract.dll in as a reference but it fails to recognize the dependency 
>>>> DLLs, throwing a runtime DllNotFoundException: "Failed to find library 
>>>> "liblept1741.dll" for platform x64.".
>>>>
>>>> I've tried placing the DLLs in the .\bin\x64\Debug folder and elsewhere 
>>>> along the project structure but no luck! I've tried manually adding them 
>>>> to 
>>>> an ItemGroup in the csproj file with 'CopyToOutputDirectory Always'. I've 
>>>> also tried setting TesseractEnviornment.CustomSearchPath in my Main class, 
>>>> but although the runtime searches in the correct folders, it still doesn't 
>>>> find the DLLs. My app is for x64 so the image type should match. I can't 
>>>> think of what else to try.
>>>>
>>>> If anyone has this working I would greatly appreciate any advice.
>>>>
>>>> Thanks in advance
>>>> James
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b41f8d0-7526-44f0-b2ac-f3b62e164e4d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-11 Thread James Q
Thank you Shree
I ran with --debug_interval -1  as you suggested and I can see 1 iteration 
showing 1 text line from a given font (lstmf) and then the next iteration 
showing 1 text line from the next font. This suggests I would need number 
of iterations calculated from *[number of training_text lines] * [number of 
lstmf files]* in order to use all my training data? e.g if my training text 
is 100 lines and I have 2000 lstmf files, I need 20 iterations. Is that 
right?

Apologies if I am asking silly questions - I am new to tesseract training.

On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection 
wrote:
>
> Hi Tesseract Group
> I am trying to train tesseract to recognize handwritten characters and 
> have prepared several thousand lstmf files (from tif/box sets) so I can 
> finetune best trained eng.traineddata, I read elsewhere on this forum that 
> a low number (say 300 - 400) if iterations is recommended when finetuning 
> to avoid overfitting. In my case though it appears that if I choose a low 
> number of iterations, only (approximately) that number of lstmf files get 
> loaded by the training process. I assumed that each iteration would be a 
> training pass over all the lstmf files. Below is my script (which assumes 
> my lstmf files are ready in trained_output_dir). How should I amend this so 
> that it loads all my lstmf files? Should the number of iterations be 
> greater than the number of lstmf files? ... or is there a maximum number of 
> lstmf files that can used for training at once?
>
> Any help would be much appreciated
> Thanks
>
> #! /bin/bash
> #
> # Script to finetune a language traineddata file for a set of
> # pre built lstmf files and a starter traineddata
> # for tesseract4.0.0-beta
> # Modify directory paths and filenames as required for your setup.
> #
>
> Lang=eng
> bestdata_dir=~/tesseract-ocr/tessdata_best
> tesstrain_dir=~/tesseract-ocr/src/training
> trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact
>
> echo "## EXTRACT BEST LSTM MODEL ##"
> combine_tessdata -e $bestdata_dir/$Lang.traineddata 
> $bestdata_dir/$Lang.lstm
>
> echo "## LSTM TRAINING ##"
> echo " running lstmtraining for finetuning from 
> $bestdata_dir/$Lang.traineddata #"
>
> lstmtraining \
> --continue_from  $bestdata_dir/$Lang.lstm \
> --net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata$trained_output_dir/$Lang/$Lang.traineddata \
> --max_iterations 400 \
> --debug_interval 0 \
> --train_listfile $trained_output_dir/$Lang.training_files.txt \
> --model_output  $trained_output_dir/finetune
>
> echo "## BUILD FINETUNED MODEL ##"
> echo " Building final trained file $Lang-finetune-$Lang.traineddata  
> "
> lstmtraining \
> --stop_training \
> --continue_from $trained_output_dir/finetune_checkpoint \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata$trained_output_dir/$Lang/$Lang.traineddata \
> --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e65642af-4c98-442e-8b48-409a6386386e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Training with a large number of LSTMF files

2018-09-11 Thread James Q


On Tuesday, September 11, 2018 at 1:57:34 PM UTC+1, ProgressNotPerfection 
wrote:
>
> Hi Tesseract Group
> I am trying to train tesseract to recognize handwritten characters and 
> have prepared several thousand lstmf files (from tif/box sets) so I can 
> finetune best trained eng.traineddata, I read elsewhere on this forum that 
> a low number (say 300 - 400) if iterations is recommended when finetuning 
> to avoid overfitting. In my case though it appears that if I choose a low 
> number of iterations, only (approximately) that number of lstmf files get 
> loaded by the training process. I assumed that each iteration would be a 
> training pass over all the lstmf files. Below is my script (which assumes 
> my lstmf files are ready in trained_output_dir). How should I amend this so 
> that it loads all my lstmf files? Should the number of iterations be 
> greater than the number of lstmf files? ... or is there a maximum number of 
> lstmf files that can used for training at once?
>
> Any help would be much appreciated
> Thanks
>
> #! /bin/bash
> #
> # Script to finetune a language traineddata file for a set of
> # pre built lstmf files and a starter traineddata
> # for tesseract4.0.0-beta
> # Modify directory paths and filenames as required for your setup.
> #
>
> Lang=eng
> bestdata_dir=~/tesseract-ocr/tessdata_best
> tesstrain_dir=~/tesseract-ocr/src/training
> trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact
>
> echo "## EXTRACT BEST LSTM MODEL ##"
> combine_tessdata -e $bestdata_dir/$Lang.traineddata 
> $bestdata_dir/$Lang.lstm
>
> echo "## LSTM TRAINING ##"
> echo " running lstmtraining for finetuning from 
> $bestdata_dir/$Lang.traineddata #"
>
> lstmtraining \
> --continue_from  $bestdata_dir/$Lang.lstm \
> --net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata$trained_output_dir/$Lang/$Lang.traineddata \
> --max_iterations 400 \
> --debug_interval 0 \
> --train_listfile $trained_output_dir/$Lang.training_files.txt \
> --model_output  $trained_output_dir/finetune
>
> echo "## BUILD FINETUNED MODEL ##"
> echo " Building final trained file $Lang-finetune-$Lang.traineddata  
> "
> lstmtraining \
> --stop_training \
> --continue_from $trained_output_dir/finetune_checkpoint \
> --old_traineddata  $bestdata_dir/$Lang.traineddata \
> --traineddata$trained_output_dir/$Lang/$Lang.traineddata \
> --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ea973192-2aed-465e-a84f-71f25d8958d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: BOX File Automatic Generation using the word coordinates

2018-08-24 Thread James Q
Correct me if I am wrong, but shouldn't each character be bound by its own 
box? Try opening this in JTessBoxEditor ( 
http://vietocr.sourceforge.net/training.html ).

On Thursday, August 23, 2018 at 12:33:07 PM UTC+1, eng.ahmed@gmail.com 
wrote:
>
> I want to train tesseract 4 using images and ground truth text. I have 
> generated the BOX file in for a page in the below format.
>
>
> D 1107 191 1167 209 0 
> a 1107 191 1167 209 0 
> t 1107 191 1167 209 0 
> e 1107 191 1167 209 0 
> : 1107 191 1167 209 0 
>   1107 191 1167 209 0 
> 2 1202 192 1294 209 0 
> 0 1202 192 1294 209 0 
> 1 1202 192 1294 209 0 
> 8 1202 192 1294 209 0 
> - 1202 192 1294 209 0 
> 1 1202 192 1294 209 0 
> - 1202 192 1294 209 0 
> 3 1202 192 1294 209 0 
>  1294 209 1295 210 0 
> W 157 237 313 323 0 
> a 157 237 313 323 0 
> l 157 237 313 323 0 
>   157 237 313 323 0 
> m 321 256 402 322 0 
>   321 256 402 322 0 
> a 406 256 454 323 0 
>   406 256 454 323 0 
> r 460 237 525 323 0 
> t 460 237 525 323 0 
>   460 237 525 323 0 
> e 967 261 1041 280 0 
> - 967 261 1041 280 0 
> S 967 261 1041 280 0 
> D 967 261 1041 280 0 
> R 967 261 1041 280 0 
>   967 261 1041 280 0 
> s 1049 261 1113 281 0 
> e 1049 261 1113 281 0 
> r 1049 261 1113 281 0 
> i 1049 261 1113 281 0 
> a 1049 261 1113 281 0 
> l 1049 261 1113 281 0 
>   1049 261 1113 281 0 
> n 1123 267 1167 281 0 
> o 1123 267 1167 281 0 
> . 1123 267 1167 281 0 
> : 1123 267 1167 281 0 
>   1123 267 1167 281 0 
>   1203 263 1372 281 0 
> C 1203 263 1372 281 0 
> A 1203 263 1372 281 0 
> 1 1203 263 1372 281 0 
> 8 1203 263 1372 281 0 
> 0 1203 263 1372 281 0 
> 1 1203 263 1372 281 0 
> 0 1203 263 1372 281 0 
> 3 1203 263 1372 281 0 
> 0 1203 263 1372 281 0 
> 6 1203 263 1372 281 0 
> 2 1203 263 1372 281 0 
> 2 1203 263 1372 281 0 
> 3 1203 263 1372 281 0 
>  1372 281 1373 282 0
>
>
> where i added the word coordinates for every letter as DATE  and Break the 
> line using *\t.*
>
> Here is an example of tif and box file. The problem that I have CTC 
> compute failure and also when I try to generate BOX file from Tesseract i 
> have the same issue.
>
>
> How to make a valid BOX FILE for a Page.
>
>
>
>  
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e54b6065-48ca-4e3b-9d6a-1c809813f682%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: why such simple word can't be recognized?

2018-08-18 Thread James Q
You're welcome. Please do reply and let us know if the fine tuning worked.

On Thursday, August 16, 2018 at 6:24:27 PM UTC+1, xll...@gmail.com wrote:
>
> many thanks for your information.
> I have tried a lot of scaling, from factor y1.0, x0.8 to y1.0, x0.5 , none 
> of them work.
> I will try do some fine tune train with this font.
> thanks again for let me known this font name :)
>
>
> 在 2018年8月15日星期三 UTC+8下午9:49:01,James Q写道:
>>
>> It looks like you may need to fine tune train Tesseract on this 
>> particular font. From the letters in you images it looks like 'Bevan', 
>> which you can download from here:
>>
>> https://www.fontsquirrel.com/fonts/bevan
>>
>> If you are unable to train Tesseract, I have sometimes had success by 
>> stretching (changing the aspect ratio) of the image. In this case it is 
>> quite a fat font so stretching it taller might improve the result.
>>
>> Hope this helps
>> James
>>
>> On Tuesday, August 14, 2018 at 1:41:06 PM UTC+1, zwwts...@gmail.com 
>> wrote:
>>>
>>> It's interesting. I'v tried many way to process the img, binary inverse, 
>>> cut, resize. 
>>> I'v tried with oem of 3.0.0 and 4.0.0,  psm of 3\6\7 
>>> I thought maybe some one works, but actually no one did, and nothing 
>>> went out
>>> Maybe this special fonts just hit some weakness of tesseract
>>>
>>>
>>> 在 2018年8月14日星期二 UTC+8下午6:59:01,xll...@gmail.com写道:
>>>>
>>>> I use opencv to extract chars from image and combine them together, but 
>>>> tasseract failure to recognize it.
>>>> I have tested with paramters "-c 
>>>> tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.-\\'"
>>>>   
>>>> "-psm 7" and "-psm 8", still no lucky.
>>>> please see attachment, ears.png
>>>>
>>>> but some others were successful, like godmother.png.
>>>>
>>>> who could teach me, please.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a4b03904-8e19-4a0a-a034-4f157872953a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: why such simple word can't be recognized?

2018-08-15 Thread James Q
It looks like you may need to fine tune train Tesseract on this particular 
font. From the letters in you images it looks like 'Bevan', which you can 
download from here:

https://www.fontsquirrel.com/fonts/bevan

If you are unable to train Tesseract, I have sometimes had success by 
stretching (changing the aspect ratio) of the image. In this case it is 
quite a fat font so stretching it taller might improve the result.

Hope this helps
James

On Tuesday, August 14, 2018 at 1:41:06 PM UTC+1, zwwts...@gmail.com wrote:
>
> It's interesting. I'v tried many way to process the img, binary inverse, 
> cut, resize. 
> I'v tried with oem of 3.0.0 and 4.0.0,  psm of 3\6\7 
> I thought maybe some one works, but actually no one did, and nothing went 
> out
> Maybe this special fonts just hit some weakness of tesseract
>
>
> 在 2018年8月14日星期二 UTC+8下午6:59:01,xll...@gmail.com写道:
>>
>> I use opencv to extract chars from image and combine them together, but 
>> tasseract failure to recognize it.
>> I have tested with paramters "-c 
>> tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz,.-\\'"
>>   
>> "-psm 7" and "-psm 8", still no lucky.
>> please see attachment, ears.png
>>
>> but some others were successful, like godmother.png.
>>
>> who could teach me, please.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9f8df077-3cd6-4e55-84ce-29dcb54f194a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: tesseract does not recognize grey colored fonts in the images..

2018-07-31 Thread James Q
It could be that a threshold operation is taking place at a lower 
brightness than you grey text. Try binarizing the image with a high 
threshold value befo sending to tesseract (e.g.200) this should make all 
the text black.

On Saturday, July 28, 2018 at 4:00:16 PM UTC+1, Yogesh Sanchihar wrote:
>
> If we have a text not black, but light greyish. tesseract does not 
> recognize it.
>
> Any solutions to this problem.
>
> Have attached images of the sample bill.
>
> Suppose I want to extract Base Fare
>
> Base Fare  - *Rs 500*
>
> But Since Base Fare is light greyish. Tesseract does not recognize it at 
> all.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f1c49f5b-27f8-4ed4-8d4d-8f01efe4a58f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Creating traineddata with specific wordlist

2018-07-17 Thread James Q
Hi
I'm trying to create a traineddata with a specific word list. What I have 
done so far is:
1.) Create specific files langdata/eng

   - eng.wordlist (containing my specific words)
   - eng.finetune.training_text (representative text containing only chars 
   found in my words)
   - eng.numbers and eng.punc (original English versions but removing chars 
   not present in my words)

2.) Run tesstrain.sh on a couple of fonts to create a starter 
eng.traineddata, and run combine_tessdata -u to extract the new dawg files
3.) Check eng.charset_size=76.txt contains the expected chars and run 
wordlist2dawg 
-t to verify wordlist matches word-dawg
4.) Run combine_tessdata -o [best eng.traineddata] eng.word-dawg 
eng.punc-dawg eng.number-dawg eng.unicharset (to overwrite the original 
dawgs in the traineddata with my own).

At the moment I cannot get step 4 to work, the process simply adds my dawgs 
into the traineddata with shortened names alongside the original ones. I 
have tried renaming my files to match those listed by combine_tessdata -d but 
it still renames and adds them as below:



Can anyone suggest what I might be doing wrong, or how best to incorporate 
my specific dawgs into best traineddata?

Thanks
James


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ecff3bb-8066-4f1e-9a16-4845e95624f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Explanation for training_text and wordlist files

2018-07-06 Thread James Q
No tool I can think of. What I would do is edit the file in a large text 
file editor (such as EmEditor) to remove duplicate words. You could do this 
by replacing all spaces for newlines then sorting and removing duplicates. 
After that you can randomize the unique list of words, add an appropriate 
distribution of punctuation characters and re-edit to create a block of 
text wrapped at say 100 characters. There are online tools to do the 
randomizing and wrapping.

Having said this I don't know how valuable it is to have training text 
containing specific words. I have been struggling myself to train on 
specific word lists without much success. I think training text is just 
about a representative distribution of characters. Please let me know if 
you have any insights on the wordlists in langdata as I'm a bit hazy there.

Thanks
James



On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>
> Hello guys.
>
>
> I want to add new language script to Tesseract OCR and researching to 
> training data.
>
>
> Then I want to know below things.
>
>1. Is there any automatic tool that make a langdata training_text and 
>wordlist files from massive text?
>2. Is there any documentation about preparing text data and 
>explanation about text data files? I just saw directory langdata/jpn/ and 
>there are some files. But I have know idea about this files and how to 
>create files like those? What rule should I use create langdata files?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Really poor performance with decimal numbers

2018-07-06 Thread James Q
Have you tried removing all surrounding whitespace from the image except 
for a thin border (say 8px thick)?

On Friday, July 6, 2018 at 4:52:08 PM UTC+1, Alberto Andreotti wrote:
>
> Hi,
>
> tried it with same results, also, all other cases work well.
>
> 23.78
> 15
> 1.6
> 1.7
> 1.2
> 1.3
> 1.4
> 1.8
> 1.9
>
> The only that won't come out well is "1.5". That's pretty crazy. Any 
> config I may provide or something?
>
> thanks,
> Alberto.
>
> On Friday, July 6, 2018 at 11:38:45 AM UTC-3, shree wrote:
>>
>> try --psm 6
>>
>> On Fri, Jul 6, 2018 at 2:23 PM Alberto Andreotti  
>> wrote:
>>
>>> Hello,
>>>
>>> I'm having problems with the simplest image possible. 
>>> It's a screenshot from GEdit(Ubuntu's text editor), with numbers and 
>>> points. This is what I get,
>>>
>>> 23.78
>>> 15
>>> 1.6
>>> 17.6
>>> 25
>>> 225
>>> 2235
>>> 0.5
>>>
>>> Alberto
>>>
>>> version: tesseract 4.0.0-beta.1-285-g8d3f
>>> run from command line like this, tesseract test_image2.png  outputbase 
>>> --oem 1 --psm 1
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/8d743eca-7a7c-4add-b754-c79b6ea55cba%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> -- 
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b1c4c97e-8081-49f8-8c24-4d383bfef3d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Training for specific words

2018-07-04 Thread James Q
I would like to improve accuracy by training tesseract 4 to use a context 
specific list of words. For example countries. I have created a 
eng.finetune.training_text file containing country names as well as common 
country word (e.g. Republic, Island, New etc.). This (as far as I can tell) 
restricts the char set to those in the file and represents a reasonable 
distribution of characters used in countries. Doing this appears to improve 
accuracy on my testing so far.

What I have also tried is replacing eng.wordlist with the list of country 
related words but this makes accuracy worse. Even though every word in my 
ground truth test set is present in that list.

Is eng.wordlist the wrong thing to change here? Is there another file (or 
combination of files) I need to put my words in?

Any help would be much appreciated.

Thanks
James

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/544d4a87-0e9a-42b0-b943-dd7af3d4d437%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Check validity of box and image files

2018-07-04 Thread James Q
As far as I can tell these look ok to me. They open correctly in 
JTessBoxEditor. If you are creating lstmf files for Tesseract 4, I think 
you may need space+tab in you end-of-line boxes (This is what worked for me 
anyway).

On Tuesday, July 3, 2018 at 7:15:46 AM UTC+1, chandra churh chatterjee 
wrote:
>
> We are trying to train tesseract 4 on hand written images and have 
> generated the following types of images and their respective box files. We 
> can't understand whether our box files are correct or not.Can any one 
> please confirm?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cd40ff0b-fec3-411e-9077-3c082b23ad3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Need Help To recognise handwriting using OCR

2018-06-28 Thread James Q
Hi Chinmay
How did you get on with this? I'd be interested to know your accuracy rate 
in interpreting block handwriting...
Thanks
James

On Tuesday, November 8, 2016 at 2:48:20 PM UTC, chinmay dhumal wrote:
>
> Handwriting would of a random NGO worker, the language would be English.
> and yea, they will be available in good quality as the user will be 
> snaping the pic at the same time when he wants to get the ouput and send it 
> to the NGO server database.
>
> Thanks and Regards,
> Chinmay Dhumal
> +91-7755922327
>
> On Tue, Nov 8, 2016 at 3:09 PM, Tom De Costere  > wrote:
>
>> Can you post an image of the handwriting?
>>
>> The documents on which you will be performing OCR, are they available in 
>> good quality?
>> Otherwise you will have to perform image processing to improve the image 
>> quality (contrast / brightness / invert...)
>>
>> Op vrijdag 4 november 2016 12:04:59 UTC+1 schreef chinmay dhumal:
>>>
>>> hi  i am a student, me and my friends are working on a project for NGOs 
>>> we are in a need of an OCR library which can recognize handwriting are 
>>> trying  for tesseract OCR but we  don't know how to implement it and train 
>>> it accordingly.So it would be grateful if you help us .We'd be  waiting for 
>>> your response 
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/81404e75-e27b-4dea-80e3-8327c5883247%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/14c51acf-d0aa-4e48-844d-2243c5223fe8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract 4 Handwriting recognition

2018-06-27 Thread James Q
Hi Andreas, Have you managed to get this installed on windows 10?

On Wednesday, June 27, 2018 at 8:29:25 AM UTC+1, Andreas R wrote:
>
> Hello,
>
> is the new Tesseract 4 viable for Handwriting recognition?
>
> The FAQ says no.
> (
> https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-use-tesseract-for-handwriting-recognition
> )
>
> But the recommended Lipi Toolkit Project was last updated Jun 25, 2013 
> which gives the impression its
> not really up to date.
>
> So is the FAQ answer still the same for Tesseract 4?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/53bb8954-8ca6-4b40-abca-92184e410e65%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-26 Thread James Q
How are our results so far Navaneetha?
Mine are awful. I've trained tesseract on 350 handwriting fonts and it 
recognizes text in those fonts well enough, but the accuracy for actual 
handwriting is bad at about 10% word accuracy and 50% character accuracy. 
I'm thinking to create some tif/box sets from actual handwriting next.

Have you found some fonts to be better than others? I've tried to stick to 
mainly non-cursive block handwriting styles so far.



On Thursday, June 21, 2018 at 9:19:38 AM UTC+1, Navaneetha Bitla wrote:
>
> yeah i've tried to train with these images but its giving dpi etc error. 
>
> Then i've moved to ttf font then converted ttf to tiff finally trained the 
> data but output is very bad, i dont know whether bad results for training 
> process or dataser.
>
> Still trying to make progress.
>
> On Thu, Jun 21, 2018 at 12:24 PM, chandra churh chatterjee <
> chandrachurh...@gmail.com > wrote:
>
>> Excuse me @Shree Devi Kumar can you please tell me whether data for 
>> training tesseract 4.0 would be better if the data has images which have 
>> paragraphed hand written texts 
>> or single character based texts as follows
>>
>> On Wed, Jun 20, 2018 at 9:00 PM Shree Devi Kumar > > wrote:
>>
>>> You will have better control on training if you use tesstrain.sh 
>>> provided with tesseract.
>>>
>>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla >> > wrote:
>>>
>>>> http://www.1001fonts.com/handwritten-fonts.html.
>>>>
>>>> the above link has 1900+ fonts from that site i have downloaded the ttf 
>>>> files of fonts and converted to tiff files online.
>>>>
>>>> then i have trained the tiff files(fonts) using serak trainer.
>>>>
>>>>
>>>> If you got the accuracy just forward the results so everyone can konw 
>>>> and will follw you.
>>>>
>>>> Thank you
>>>>
>>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q >>> > wrote:
>>>>
>>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script. 
>>>>> If I come across things that improve accuracy though I will let you know.
>>>>>
>>>>> Where did you find 1300 handwriting fonts?
>>>>>
>>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>>>>
>>>>>> serak trainer using training tesseract 3.5.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q  
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Navaneetha
>>>>>>> I am also looking to start training tesseract using handwritten 
>>>>>>> fonts and am about to start setting up my training environment. Are you 
>>>>>>> training tesseract 4 by following the guide at 
>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 
>>>>>>> ?
>>>>>>>
>>>>>>> If so are you fine tuning the existing english model, retraining 
>>>>>>> just the top layer(s) or training from scratch with your additional 
>>>>>>> fonts?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jim
>>>>>>>
>>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi, this is Navaneetha 
>>>>>>>>
>>>>>>>> i'm working in hand written character recognition project. 
>>>>>>>>
>>>>>>>> I have trained 1300 different hand written fonts of english and 
>>>>>>>> moved the files into tessdata directory.
>>>>>>>>
>>>>>>>> tested tesseract using the below commands:
>>>>>>>>
>>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white 
>>>>>>>> -alpha off out.tiff
>>>>>>>>
>>>>>>>>  $tesseract out.tiff eng
>>>>>>>>
>>>>>>>> The input.png is of Alanis Handa font and i have trained this font 
>>>>>>>> but i'm not getting atleast 40% accuracy.
>>>>>>>>
>>>>>>>> Can someone help me.
>>

[tesseract-ocr] Tesseract Training using basic characters only

2018-06-25 Thread James Q
The text I want Tesseract to read will only contain the most basic 
characters. Is there a way of finetuning it therefore so as to only include 
basic upper/lower case letters, digits and punctuation marks? That way I 
could avoid 'c' getting misinterpreted as '¢' etc.? Would simply passing in 
a new 'training_text' and 'wordlist' into tesstrain.sh achieve this?
Thanks
James

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9dd7512f-0a40-4b62-bf56-891ddffda6c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread James Q
Hi Shree, I'm trying out the script you posted earlier which is great so 
thank you! I was wondering how many fonts I can specify at once in the 
'fonts_for_training' list. I have run it with 9 fonts at once and that 
seems fine but I would like to do 100s or even 1000s if I can. Is this the 
best way or would I be better off creating the lstmf files in batches first?

On Thursday, June 21, 2018 at 1:05:42 PM UTC+1, shree wrote:
>
> > Quite a few of these handwriting fonts are uppercase letters only (so 
> lowercase come out as uppercase when typed) . What is the best type of 
> [lang].training_text data to use for training these - is it uppercase only? 
>
> It would depend on the application where training is being used.
>
> If you want support for both upper case and lower case, then make a list 
> of fonts that have only uppercase letters and create LSTMF files for that 
> with a training text that has only capitals. For rest of the fonts use a 
> normal training text with both upper and lower case. While running 
> LSTMtraining use bothh sets of lstmf files.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3d49028a-5fd0-4756-8c3b-810e8f935bbe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-21 Thread James Q
Quite a few of these handwriting fonts are uppercase letters only (so 
lowercase come out as uppercase when typed) . What is the best type of 
[lang].training_text data to use for training these - is it uppercase only?

On Thursday, June 21, 2018 at 10:24:11 AM UTC+1, shree wrote:
>
> I had tried training with the handwriting font you mentioned in first 
> message. 
>
> I think that font has same shapes for capitals as well as lower case 
> letters.
>
> So recognition rates will be lower for it.
>
> On Thu 21 Jun, 2018, 1:49 PM Navaneetha Bitla,  > wrote:
>
>> yeah i've tried to train with these images but its giving dpi etc error. 
>>
>> Then i've moved to ttf font then converted ttf to tiff finally trained 
>> the data but output is very bad, i dont know whether bad results for 
>> training process or dataser.
>>
>> Still trying to make progress.
>>
>> On Thu, Jun 21, 2018 at 12:24 PM, chandra churh chatterjee <
>> chandrachurh...@gmail.com > wrote:
>>
>>> Excuse me @Shree Devi Kumar can you please tell me whether data for 
>>> training tesseract 4.0 would be better if the data has images which have 
>>> paragraphed hand written texts 
>>> or single character based texts as follows
>>>
>>> On Wed, Jun 20, 2018 at 9:00 PM Shree Devi Kumar >> > wrote:
>>>
>>>> You will have better control on training if you use tesstrain.sh 
>>>> provided with tesseract.
>>>>
>>>> On Wed, Jun 20, 2018 at 8:52 PM Navaneetha Bitla >>> > wrote:
>>>>
>>>>> http://www.1001fonts.com/handwritten-fonts.html.
>>>>>
>>>>> the above link has 1900+ fonts from that site i have downloaded the 
>>>>> ttf files of fonts and converted to tiff files online.
>>>>>
>>>>> then i have trained the tiff files(fonts) using serak trainer.
>>>>>
>>>>>
>>>>> If you got the accuracy just forward the results so everyone can konw 
>>>>> and will follw you.
>>>>>
>>>>> Thank you
>>>>>
>>>>> On Wed, Jun 20, 2018 at 3:13 PM, James Q >>>> > wrote:
>>>>>
>>>>>> I'm going to be using tesseract 4 and using the tesstrain.sh script. 
>>>>>> If I come across things that improve accuracy though I will let you know.
>>>>>>
>>>>>> Where did you find 1300 handwriting fonts?
>>>>>>
>>>>>> On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>>>>>>>
>>>>>>> serak trainer using training tesseract 3.5.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 19, 2018 at 9:29 PM, James Q  
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Navaneetha
>>>>>>>> I am also looking to start training tesseract using handwritten 
>>>>>>>> fonts and am about to start setting up my training environment. Are 
>>>>>>>> you 
>>>>>>>> training tesseract 4 by following the guide at 
>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 
>>>>>>>> ?
>>>>>>>>
>>>>>>>> If so are you fine tuning the existing english model, retraining 
>>>>>>>> just the top layer(s) or training from scratch with your additional 
>>>>>>>> fonts?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Jim
>>>>>>>>
>>>>>>>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi, this is Navaneetha 
>>>>>>>>>
>>>>>>>>> i'm working in hand written character recognition project. 
>>>>>>>>>
>>>>>>>>> I have trained 1300 different hand written fonts of english and 
>>>>>>>>> moved the files into tessdata directory.
>>>>>>>>>
>>>>>>>>> tested tesseract using the below commands:
>>>>>>>>>
>>>>>>>>> $convert -density 300 input.png -depth 8 -strip -background white 
>>>>>>>>> -

Re: [tesseract-ocr] Re: tesseract-ocr

2018-06-20 Thread James Q
I'm going to be using tesseract 4 and using the tesstrain.sh script. If I 
come across things that improve accuracy though I will let you know.

Where did you find 1300 handwriting fonts?

On Tuesday, June 19, 2018 at 5:19:54 PM UTC+1, Navaneetha Bitla wrote:
>
> serak trainer using training tesseract 3.5.
>
>
>
> On Tue, Jun 19, 2018 at 9:29 PM, James Q  > wrote:
>
>> Hi Navaneetha
>> I am also looking to start training tesseract using handwritten fonts and 
>> am about to start setting up my training environment. Are you training 
>> tesseract 4 by following the guide at 
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ?
>>
>> If so are you fine tuning the existing english model, retraining just the 
>> top layer(s) or training from scratch with your additional fonts?
>>
>> Thanks
>> Jim
>>
>> On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla wrote:
>>>
>>> Hi, this is Navaneetha 
>>>
>>> i'm working in hand written character recognition project. 
>>>
>>> I have trained 1300 different hand written fonts of english and moved 
>>> the files into tessdata directory.
>>>
>>> tested tesseract using the below commands:
>>>
>>> $convert -density 300 input.png -depth 8 -strip -background white -alpha 
>>> off out.tiff
>>>
>>>  $tesseract out.tiff eng
>>>
>>> The input.png is of Alanis Handa font and i have trained this font but 
>>> i'm not getting atleast 40% accuracy.
>>>
>>> Can someone help me.
>>>
>>>
>>> Thanks in advance.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/29a1bc53-d127-407b-8611-0652821a0707%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: tesseract-ocr

2018-06-19 Thread James Q
Hi Navaneetha
I am also looking to start training tesseract using handwritten fonts and 
am about to start setting up my training environment. Are you training 
tesseract 4 by following the guide 
at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 ?

If so are you fine tuning the existing english model, retraining just the 
top layer(s) or training from scratch with your additional fonts?

Thanks
Jim

On Tuesday, June 19, 2018 at 10:30:30 AM UTC+1, Navaneetha Bitla wrote:
>
> Hi, this is Navaneetha 
>
> i'm working in hand written character recognition project. 
>
> I have trained 1300 different hand written fonts of english and moved the 
> files into tessdata directory.
>
> tested tesseract using the below commands:
>
> $convert -density 300 input.png -depth 8 -strip -background white -alpha 
> off out.tiff
>
>  $tesseract out.tiff eng
>
> The input.png is of Alanis Handa font and i have trained this font but i'm 
> not getting atleast 40% accuracy.
>
> Can someone help me.
>
>
> Thanks in advance.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/253906ac-fedf-4364-ad70-e745b8786c0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Unable to read the circled text from an image

2018-02-16 Thread James Q
Hi Mateusz
After cleaning the the coloured circle, you could try a second Tesseract 
read with PSM Single Line. This should hopefully give you just Text1 
allowing you to filter it from the 'Text1 Text2' read by replacing Text1 
with an empty string.

Thanks
James

On Friday, February 16, 2018 at 4:22:34 PM UTC, Mateusz Dudek wrote:
>
> Hello,
> In brief, from this image I want only extract "text2". When I used 
> "-psm" did not improve the result.
>
> Now I'm doing it by:
> 1) Using Tesseract to extract text from image, so I have: "Text 1 @" <- 
> This @ is propably because of circle.
> 2)Then I use imagemagick to clear colors from image.
> 3) I use again tesseract, and I have "Text1 Text2"
> 4)I compare both results
> 5)Use textreplaces to delete @.
>
> Maybe there is any short way to extract only "text2"?
>
> W dniu poniedziałek, 12 lutego 2018 11:08:13 UTC+1 użytkownik Mateusz 
> Dudek napisał:
>>
>> Hello,
>> I have question about tesseract, is that possible to take text, when it's 
>> in the circle?
>> Like there:
>>
>> 
>>
>> I try use filter monochrome, but then I have two words, Text1, and Text2. 
>> The point is to get only "text2".
>>
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a15df869-5bea-4be1-ac09-c6c11a2312e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: tesseract to recognize the cropped digits

2018-02-14 Thread James Q
Tesseract prefers no noise around the image so you'll need to pre-process 
this image. For example, if you are using opencv, you could find contours 
(within the centre that have the appropriate height/width to be a digit), 
draw those onto a blank mat and send that to tesseract.

On Wednesday, February 14, 2018 at 10:15:01 AM UTC, abhishek gupta wrote:
>
> I want to read this digit in this image but tesseract shows it as empty 
> page.I tried as
>
> tesseract i.png stdout digits
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ae2d8055-cc61-43ed-8787-bb2f861f7a66%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Unable to read the circled text from an image

2018-02-12 Thread James Q


On Monday, February 12, 2018 at 10:08:13 AM UTC, Mateusz Dudek wrote:
>
> Hello,
> I have question about tesseract, is that possible to take text, when it's 
> in the circle?
> Like there:
>
> 
>
> I try use filter monochrome, but then I have two words, Text1, and Text2. 
> The point is to get only "text2".
>
>
> If I understand you correctly you are successfully getting rid of the red 
> circle but tesseract is only reading the fist line not the second? If so 
> try using a different 'Page Segmentation Mode'.  Try specifying 
> 'SPARSE_TEXT' or 'SINGLE_BLOCK'.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/424f4dc6-23e0-448f-ae6c-f3d68de40763%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Tesseract 4 B -> R and E -> F

2018-02-05 Thread James Q
I've noticed on Tesseract 4 that on some occasions, if the first letter of 
a word is 'B' it gets interpreted by Tesseract as 'R', and if the first 
letter of a word is 'E' it gets interpreted by Tesseract as 'F'. It's as if 
the bottom horizontal stroke of the character is getting lost/ignored. Has 
anyone else noticed this and is there any way of working around it?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/295cbca5-b718-4f09-b437-162f91f227c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Inconsistent results with slashes, even on same line

2018-02-02 Thread James Q
Assuming you are using eng.traineddata - have you tried using it with the 
dictionary off or just using osd.traineddata ?

On Friday, February 2, 2018 at 8:56:23 AM UTC, Scott Stekel wrote:
>
> In the attached images (original and preprocessed before OCR), I have some 
> lines of text which include the following:
>
>
> 
>
>S/A  2/2
>Map  G/1
>
> Using tesseract 3.02 (under the covers of MATLAB R2017b), when this image 
> is analyzed as a block, I get inconsistent recognition of the slashes:
>
>SIA 2/2
>Map GI1
>
> I find it interesting that the first slash in S/A is interpreted 
> differently from the slash in 2/2. 
>
> Any suggestions for how to get the slashes to be recognized correctly 
> everywhere?
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/79ffbf01-b251-4452-8acb-234addd6f078%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: tessdata_best traineddata FIles

2018-02-01 Thread James Q

Thanks Shree, So presumably then there is no Latin Script traineddata for 
Tesseract_Only mode?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f446476-4d3e-4aaa-a8e9-5bb8bc206e2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: tessdata_best traineddata FIles

2018-02-01 Thread James Q


On Thursday, February 1, 2018 at 11:01:08 AM UTC, James Q wrote:
>
> The following appear to be both Latin, so can anyone tell me what the 
> difference is between:
> Latin.traineddata
> and:
> lat.traineddata
> apart from the fact that the first one is 10 times bigger?
>
> Thanks
> James
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c6560ad-831f-4571-b819-e38c4aa3e3cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] tessdata_best traineddata FIles

2018-02-01 Thread James Q
The following appear to be both Latin, so can anyone tell me what the 
difference is between:
Latin.traineddata
and:
lat.traineddata
apart from the fact that the first one is 10 times bigger?

Thanks
James

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d8e539ff-f924-4d78-b5f7-dc80d8855342%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Why the SetVariable don't work normally?

2018-01-31 Thread James Q
If you are using tesseract 4 then whitelists/blacklists do not yet work (at 
least not in LSTM mode). I also get the impression that the 'Control 
Parameters' list you obtain by typing 'tesseract --print-parameters'on the 
command line is not updated to the supported functionality in tesseract 4. 
My advice would by to test a particular variable on command line tesseract 
to determine if it is supported before trying to set it via an API.

The whitelists/blacklists are supposed to be supported in Tesseract 4 in 
Tesseract mode I believe, but I haven't managed to get these working at all.

Please let me know how you get on.
Thanks
James
  

yping 

On Friday, January 26, 2018 at 7:35:37 PM UTC, 朱裕清 wrote:
>
> This is my image
>
>
> 
>
>
> And this is my code
>
> #include 
> #include 
>
>
> int main()
> {
>  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
>  api->Init(".\\tessdata", "eng");
>  Pix *image = pixRead("image.png");
>  api->SetImage(image);
> api->SetPageSegMode(tesseract::PSM_SINGLE_CHAR);
>
>  api->SetSourceResolution(300);
>  api->SetVariable("classify_bln_numeric_mode", "1");
>  //api->SetVariable("tessedit_char_whitelist", "0123456789");
>  api->SetRectangle(61, 4, 38, 22);
>  char *outText = api->GetUTF8Text();
>  cout << outText << endl;
>  api->End();
>
>
>  return 0;
> }
>
>
> I will get a character *A*. Why my *SetVariable("classify_bln_numeric_mode", 
> "1");* don't work normally? And even I use 
> *SetVariable("tessedit_char_whitelist", 
> "0123456789");*. The result is same still. How to read a digit from a 
> specify rectangle?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/761a8f8e-478b-4a30-a988-64d53f61c25b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: How to extract character by character using tesseract and pass it to other engine for detection.

2018-01-18 Thread James Q
I haven't done this myself, but I believe you should be able to generate a 
box file from the source image and use this to crop character subimages 
from that source image. Tesseract won't always get the boxes right though.

On Thursday, January 18, 2018 at 12:49:22 PM UTC, Hardik Sutaria wrote:
>
> How do i extract one character at a time and pass it to other engine lets 
> say CNN for OCR  detection. Any help would be helpful.
> Thanks in advance
> Avinash Tiwari
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/50d17c4d-6b19-4c9a-ac52-f205612535d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Criminal record JPGs: Improving image quality

2018-01-18 Thread James Q
In my experience Tesseract gives poor results with lines within the text. 
You can test this by manually whiting out the lines in a paint editor and 
retrying Tesseract with the new image. If the results are improved then you 
will likely need to do this programatically. This is not straightforward 
though since the lines are touching the text, but you could remove at least 
some parts of them using opencv methods.

On Thursday, January 18, 2018 at 12:49:22 PM UTC, brad.sol...@gmail.com 
wrote:
>
> Hello--I am attempting to pull full text from a few hundred JPGs that 
> contain information on death row executions hosted by the Texas Department 
> of Criminal Justice (TDCJ).
>
> Here's one example: 
> http://www.tdcj.state.tx.us/death_row/dr_info/ruizroland.jpg; another: 
> http://www.tdcj.state.tx.us/death_row/dr_info/rodrigezlionell.jpg.
>
> In raw form, the images are mostly ~840x1100, 139 KB, grayscale, with a 
> fair amount of whitespace.  
>
> Tesseract has been able to capture the field names quite well, but has had 
> trouble with the values/sequences corresponding to each field/key.  For 
> example, on the jpg above, I get:
>
> *Co-Defendants'*
> *U-l {IAIN .I'i. ‘ III! [.03 'I‘ I - I95 w. I .-II vII A I I*
> *II I U i I I o. '4 I99 0' .1“, DA. 3 I I ‘ v 9 3.), I .‘aI vlh. I*
> *II M I. {?HJI 0 I: III; '403‘I0 v. IIJ' HI. I IO.“ I II I-!*
> *{.A.‘l. .' I Ilu 'J: -. I' 3. I IIvIII I .III II*
> *0 Inn . I II I*
>
> What I have tried thus far:
> - Increasing image size & dpi significantly.
> - Pixel thresholding (from opencv 
> )
> - Median blurring (from opencv 
> )
>  
> - both through Python interface
> - Went through the Improve Quality 
>  page, 
> but it is clear i am flailing around helplessly.
>
> Appreciate any suggestions for next steps; based on the characteristics of 
> the jpgs, what transformations would be most or least useful?
>
> Thank you.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0fd102a9-cc9d-44ad-8832-b91509fee96a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Variables having no effect on C# Tesseract.net 4.0.0.6 wrapper

2018-01-18 Thread James Q
I think there are 9 DLL files that come with that package beginning 
"pvt.cppan.demo...". I experimented placing them in various locations along 
the execution path until the app worked. On my project they are now in 
.\lib\x64' and in '..\bin\Debug'.

On Wednesday, January 10, 2018 at 1:07:28 PM UTC, James Q wrote:
>
> Here is my code:
> string text = "";
>
> string tessDataPath = ConfigurationManager.AppSettings["TessPath"];
> using (var engine = new TessBaseAPI(@tessDataPath, @"eng"))
> {
> engine.SetVariable("tessedit_ocr_engine_mode", "0");
> engine.SetPageSegMode(PageSegmentationMode.SINGLE_LINE);
> engine.SetVariable("tessedit_char_blacklist", type.GetTesseractOptions
> ().Blacklist());
> engine.SetVariable("tessedit_char_whitelist", type.GetTesseractOptions
> ().Whitelist());
> engine.Process(imageFileName, false);
> text = engine.GetUTF8Text();
>
> }
>
> I'm sending images which represent one or a few words on a single line, 
> but in the above code, the SetPageSegMode(..) method has no effect. On the 
> command line I can use:
> 
> tesseract.exe input.png result -l eng --psm 7 --oem 1
>
> on the same images and see clearly better results on psm 7. Does anyone 
> know how to configure this option via the wrapper or is it just not 
> suppported?
>
> Also, blacklists and whitelists are having no effect in the wrapper. 
> Whilst I understand that these are not supported in Tesseract 4 LSTM mode 
> yet, they should still work in 'Tesseract Only' mode right? I know the 
> SetVariable method works (as I see its effect on engine mode). Is there 
> another way of setting blacklists and whitelists through this wrapper?
>
> Thanks
> James 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9b28288f-51e0-4e75-8a4c-4012d952fc7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Variables having no effect on C# Tesseract.net 4.0.0.6 wrapper

2018-01-15 Thread James Q
Note that the Charles Weld Tesseract 3 wrapper works well with varying 
these values, so I am trying to get the Tesseract 4 version of that working 
which basically has the same API. For now though the Tesseract.net 4.0.0.6 
one is the only 4.0 wrapper that works for me, hence this post.

On Thursday, January 11, 2018 at 7:15:58 PM UTC, James Q wrote:
>
> Is anyone else using tesseract 4.0alpha from C# ?
>
> On Wednesday, January 10, 2018 at 1:07:28 PM UTC, James Q wrote:
>>
>> Here is my code:
>> string text = "";
>>
>> string tessDataPath = ConfigurationManager.AppSettings["TessPath"];
>> using (var engine = new TessBaseAPI(@tessDataPath, @"eng"))
>> {
>> engine.SetVariable("tessedit_ocr_engine_mode", "0");
>> engine.SetPageSegMode(PageSegmentationMode.SINGLE_LINE);
>> engine.SetVariable("tessedit_char_blacklist", type.
>> GetTesseractOptions().Blacklist());
>> engine.SetVariable("tessedit_char_whitelist", type.
>> GetTesseractOptions().Whitelist());
>> engine.Process(imageFileName, false);
>> text = engine.GetUTF8Text();
>>
>> }
>>
>> I'm sending images which represent one or a few words on a single line, 
>> but in the above code, the SetPageSegMode(..) method has no effect. On the 
>> command line I can use:
>> 
>> tesseract.exe input.png result -l eng --psm 7 --oem 1
>>
>> on the same images and see clearly better results on psm 7. Does anyone 
>> know how to configure this option via the wrapper or is it just not 
>> suppported?
>>
>> Also, blacklists and whitelists are having no effect in the wrapper. 
>> Whilst I understand that these are not supported in Tesseract 4 LSTM mode 
>> yet, they should still work in 'Tesseract Only' mode right? I know the 
>> SetVariable method works (as I see its effect on engine mode). Is there 
>> another way of setting blacklists and whitelists through this wrapper?
>>
>> Thanks
>> James 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/21b1fafd-b122-43d8-bc79-613b19a066e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Need help to improve quality

2018-01-15 Thread James Q
Have you tried using the OEM_TESSERACT_CUBE_COMBINED engine mode?

On Sunday, January 14, 2018 at 7:01:17 AM UTC, conman wrote:
>
> Hello!
>
> After trying out a lot I would like to ask for help on improving my OCR 
> results.
>
> I am using tesseract 3.05.01 and have experimented with different PSMs - 
> results vary from "not usable at all" to "going in the right direction but 
> still not ok".
>
> For example with psm 11 result is:
>
> Amsterdam
> Boston
> Ball
> Auckland
> Amalya
>
> -> many lines are missing.
>
> I have attached the tessinput.tif 
>
> Thanks for your support!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eb623f52-2f57-43bd-bdcb-48d1c33e7bbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-15 Thread James Q
You are correct, the runtime version is 140. That doesn't appear to be my 
problem though as x64 Dependency Walker finds this DLL. It fails to find 
several DLLs though which begin 'API-MS-WIN-CORE...'. I would have expected 
these to be present by way of the Win10 SDK but they are not. I tried 
Charles Weld's newer version from Git and got the same issues with that but 
I haven't tried the trace feature in Charles' version yet so will try that.

Thanks for responding to this - much appreciated.
James

On Friday, January 12, 2018 at 2:45:06 PM UTC, THintz wrote:
>
> I built those DLLs with VS 2017.  I think the run-time lib version is 140.
>
> There are 3 main reasons the libs fail to load.
>
> 1. The DLLs are in the wrong folders.  The correct folders are:
>  The .Net wrapper DLL assembly is placed in the exact same folder you 
> run your app from, and the other 2 are placed in a folder x86 or x64 
> located in the app's folder.  The DLLs I created are only x64.
>
> 2. You are missing a dependency.  See 
> https://github.com/charlesw/tesseract/issues/363 for examples tracing 
> this.  If you must resort to procmon to figure this out then you need to be 
> prepared to read tea leaves.
>
> 3. The wrong .Net wrapper is used.  Mr. Weld's wrapper has a trace feature 
> that enables you see mismatch between the wrapper and the DLLs.  A mismatch 
> occurs when the wrapper tries to bind to the public interface of Leptonica 
> and the two differ.
>
> The DLLs I created have been superseded in a branch of charlesw/tesseract 
> on github.  There is actually not much functional difference between the 
> two, at this time, but you might find it easier to work with the newer 
> branch. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81dcb485-146e-46b0-a5ad-fc5c858a0e78%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-12 Thread James Q
Thanks for the reply, In my project I have tried all 3 DLLs in all 
potential folders as follows:

D:\csharp\repos\cwtess4
|__ D:\csharp\repos\cwtess4\cwtess4
|__ liblept1741.dll
|__ libtesseract400.dll
|__ Tesseract.dll
|__ D:\csharp\repos\cwtess4\cwtess4\bin
|__ liblept1741.dll
|__ libtesseract400.dll
|__ Tesseract.dll
|__ D:\csharp\repos\cwtess4\cwtess4\bin\x64
|__ liblept1741.dll
|__ libtesseract400.dll
|__ Tesseract.dll
|__ D:\csharp\repos\cwtess4\cwtess4\bin\x64\Release
|__ liblept1741.dll
|__ libtesseract400.dll
|__ Tesseract.dll
|__ D:\csharp\repos\cwtess4\cwtess4\bin\x64\Debug
|__ liblept1741.dll
|__ libtesseract400.dll
|__ Tesseract.dll

This exception appears at runtime:
System.Reflection.TargetInvocationException: 'Exception has been thrown by 
the target of an invocation.'
Inner Exception
DllNotFoundException: Failed to find library "liblept1741.dll" for platform 
x64.

I see on the page you mention that a common reason might be "The Visual 
Studio 2015 C++ runtime" not being installed, but I am using Visual Studio 
2017 which doesn't allow me to install the earlier version of the runtime.


On Thursday, January 11, 2018 at 8:54:19 PM UTC, THintz wrote:
>
> See https://github.com/charlesw/tesseract/wiki/Error-2
>>>
>>
>  The Tesseract.dll goes in the folder with your binary and the other two 
> dlls go in either an x64 or an x86 folder below that.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e923af7-d81c-4e7a-8424-dc0dfe3539e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Variables having no effect on C# Tesseract.net 4.0.0.6 wrapper

2018-01-11 Thread James Q
Is anyone else using tesseract 4.0alpha from C# ?

On Wednesday, January 10, 2018 at 1:07:28 PM UTC, James Q wrote:
>
> Here is my code:
> string text = "";
>
> string tessDataPath = ConfigurationManager.AppSettings["TessPath"];
> using (var engine = new TessBaseAPI(@tessDataPath, @"eng"))
> {
> engine.SetVariable("tessedit_ocr_engine_mode", "0");
> engine.SetPageSegMode(PageSegmentationMode.SINGLE_LINE);
> engine.SetVariable("tessedit_char_blacklist", type.GetTesseractOptions
> ().Blacklist());
> engine.SetVariable("tessedit_char_whitelist", type.GetTesseractOptions
> ().Whitelist());
> engine.Process(imageFileName, false);
> text = engine.GetUTF8Text();
>
> }
>
> I'm sending images which represent one or a few words on a single line, 
> but in the above code, the SetPageSegMode(..) method has no effect. On the 
> command line I can use:
> 
> tesseract.exe input.png result -l eng --psm 7 --oem 1
>
> on the same images and see clearly better results on psm 7. Does anyone 
> know how to configure this option via the wrapper or is it just not 
> suppported?
>
> Also, blacklists and whitelists are having no effect in the wrapper. 
> Whilst I understand that these are not supported in Tesseract 4 LSTM mode 
> yet, they should still work in 'Tesseract Only' mode right? I know the 
> SetVariable method works (as I see its effect on engine mode). Is there 
> another way of setting blacklists and whitelists through this wrapper?
>
> Thanks
> James 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bbd58c56-8ba2-414b-b834-6775c1b49565%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Very inaccurate output - need help

2018-01-11 Thread James Q
This image is quite skewed. I suggest you straighten it and binarize it 
before passing it to tesseract.

On Thursday, January 11, 2018 at 10:14:13 AM UTC, Zaheer Javi wrote:
>
> Hi,
>
> I'm trying to apply tesseract with the attached file, however I get back 
> extremely low accuracy. Have tried increasing dpi to 300 as well as tried 
> changing the image to binary but no luck, can you please assist...
>
> This is what I get as output...
>
> Notes lo the Annual Financial Statements
>
> Page 12
>
> Group Company
> 21117 zma 21:17 201 s
> a I: a n
> 2 Pvvpelly. pm: .114 eqmpmenl
> Gvaup 21:11 zm s
> cm Acumwlalad Carrymg value Cast ;1m,.11..1a1aa Carrymg vaiue
> dupvemahon depmu::a11ov1
> Eqmpmen|—bu1Wnw5 21.551515 17s.a17m1 14 1401727 as 9551957 Lss,sa7 435) 
> 153214.531
> Fumilure ar1d11xIun:s 1.:«an,3ns 11.m,sns1 wee mmma (1 135312941 12.914
> MaIa(veNc\es 51.512 1s1,s1s1 7 51,515 (511618) 7
> 11 euulpvvvem 221.900 .z1s.ses1 mm 22119111: 12m us) may
> beasal1n\d1mp:memems 4a , 4 1151.22‘; 7 631124 14511224; 7
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ef66e08-e769-42d7-9239-e60e7d070016%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Variables having no effect on C# Tesseract.net 4.0.0.6 wrapper

2018-01-10 Thread James Q
Here is my code:
string text = "";

string tessDataPath = ConfigurationManager.AppSettings["TessPath"];
using (var engine = new TessBaseAPI(@tessDataPath, @"eng"))
{
engine.SetVariable("tessedit_ocr_engine_mode", "0");
engine.SetPageSegMode(PageSegmentationMode.SINGLE_LINE);
engine.SetVariable("tessedit_char_blacklist", type.GetTesseractOptions
().Blacklist());
engine.SetVariable("tessedit_char_whitelist", type.GetTesseractOptions
().Whitelist());
engine.Process(imageFileName, false);
text = engine.GetUTF8Text();

}

I'm sending images which represent one or a few words on a single line, but 
in the above code, the SetPageSegMode(..) method has no effect. On the 
command line I can use:

tesseract.exe input.png result -l eng --psm 7 --oem 1

on the same images and see clearly better results on psm 7. Does anyone 
know how to configure this option via the wrapper or is it just not 
suppported?

Also, blacklists and whitelists are having no effect in the wrapper. Whilst 
I understand that these are not supported in Tesseract 4 LSTM mode yet, 
they should still work in 'Tesseract Only' mode right? I know the 
SetVariable method works (as I see its effect on engine mode). Is there 
another way of setting blacklists and whitelists through this wrapper?

Thanks
James 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/19eaed8d-7fdc-4b6c-b803-5d23cb4dd49a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-08 Thread James Q
Thanks for the reply ShreeDevi, I haven't found much in the way of 
documentation to say which options are supported in 4.0alpha compared to 
3.0x. I ran "tesseract.exe --print-parameters" and both 
"tessedit_char_whitelist" / "tessedit_char_blacklist" were still in the 
list. I therefore assumed they were still supported. 

Could you please let me know how to find out which options are still 
supported in 4.0?

Is there an alternative option to tell tesseract to exclude certain 
characters (in my case I have a number format which has numbers and letters 
but never capital O).

Thanks
James

On Monday, January 8, 2018 at 11:26:56 AM UTC, shree wrote:
>
> tesseract 4 alpha does not support whitelist/blacklist.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jan 8, 2018 at 4:52 PM, ShreeDevi Kumar  > wrote:
>
>> please see https://github.com/charlesw/tesseract/issues/306
>>
>> maybe the fix there will help.
>>
>> ShreeDevi
>> ____
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jan 8, 2018 at 3:33 PM, James Q > > wrote:
>>
>>> By the way I do have the Tesseract.net nuget package working ( 
>>> https://www.nuget.org/packages/tesseract.net/ ), but have 2 issues with 
>>> this:
>>> 1.) I need to write a separate Bitmap -> Pix converter in C#
>>> 2.) I haven't yet got whitelists/blacklists working
>>>
>>> Neither of these were issues with the tesseract 3 Charles Weld wrapper, 
>>> hence my reason for trying to get the tdhintz one working (as this is based 
>>> on Charles Weld's 3 wrapper).
>>> Thanks
>>> James
>>>
>>> On Monday, January 8, 2018 at 7:49:43 AM UTC, Mohammad Mahdizadeh wrote:
>>>>
>>>> I have the same problem 
>>>>
>>>>
>>>> On Friday, January 5, 2018 at 8:38:08 PM UTC+3:30, James Q wrote:
>>>>>
>>>>> I'm trying to use this wrapper:
>>>>> https://github.com/tdhintz/tesseract4win64
>>>>>
>>>>> It's an x64 .Net assembly with one main DLL (Tesseract.dll) and two 
>>>>> dependency DLLs (liblept1741.dll and libtesseract400.dll). To start with 
>>>>> I'm just trying to get a Visual Studio console app running. I've added 
>>>>> Tesseract.dll in as a reference but it fails to recognize the dependency 
>>>>> DLLs, throwing a runtime DllNotFoundException: "Failed to find library 
>>>>> "liblept1741.dll" for platform x64.".
>>>>>
>>>>> I've tried placing the DLLs in the .\bin\x64\Debug folder and 
>>>>> elsewhere along the project structure but no luck! I've tried manually 
>>>>> adding them to an ItemGroup in the csproj file with 
>>>>> 'CopyToOutputDirectory 
>>>>> Always'. I've also tried setting TesseractEnviornment.CustomSearchPath in 
>>>>> my Main class, but although the runtime searches in the correct folders, 
>>>>> it 
>>>>> still doesn't find the DLLs. My app is for x64 so the image type should 
>>>>> match. I can't think of what else to try.
>>>>>
>>>>> If anyone has this working I would greatly appreciate any advice.
>>>>>
>>>>> Thanks in advance
>>>>> James
>>>>>
>>>>>
>>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com .
>>> To post to this group, send email to tesser...@googlegroups.com 
>>> .
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/13d63957-ecfc-4451-833f-ad6d23b76b01%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/13d63957-ecfc-4451-833f-ad6d23b76b01%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2f27ffd6-b3e5-4e56-bd35-58741e924c27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-08 Thread James Q
By the way I do have the Tesseract.net nuget package working ( 
https://www.nuget.org/packages/tesseract.net/ ), but have 2 issues with 
this:
1.) I need to write a separate Bitmap -> Pix converter in C#
2.) I haven't yet got whitelists/blacklists working

Neither of these were issues with the tesseract 3 Charles Weld wrapper, 
hence my reason for trying to get the tdhintz one working (as this is based 
on Charles Weld's 3 wrapper).
Thanks
James

On Monday, January 8, 2018 at 7:49:43 AM UTC, Mohammad Mahdizadeh wrote:
>
> I have the same problem 
>
>
> On Friday, January 5, 2018 at 8:38:08 PM UTC+3:30, James Q wrote:
>>
>> I'm trying to use this wrapper:
>> https://github.com/tdhintz/tesseract4win64
>>
>> It's an x64 .Net assembly with one main DLL (Tesseract.dll) and two 
>> dependency DLLs (liblept1741.dll and libtesseract400.dll). To start with 
>> I'm just trying to get a Visual Studio console app running. I've added 
>> Tesseract.dll in as a reference but it fails to recognize the dependency 
>> DLLs, throwing a runtime DllNotFoundException: "Failed to find library 
>> "liblept1741.dll" for platform x64.".
>>
>> I've tried placing the DLLs in the .\bin\x64\Debug folder and elsewhere 
>> along the project structure but no luck! I've tried manually adding them to 
>> an ItemGroup in the csproj file with 'CopyToOutputDirectory Always'. I've 
>> also tried setting TesseractEnviornment.CustomSearchPath in my Main class, 
>> but although the runtime searches in the correct folders, it still doesn't 
>> find the DLLs. My app is for x64 so the image type should match. I can't 
>> think of what else to try.
>>
>> If anyone has this working I would greatly appreciate any advice.
>>
>> Thanks in advance
>> James
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/13d63957-ecfc-4451-833f-ad6d23b76b01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: why tesseract always detect number 6 to number 8 in a good image??

2018-01-05 Thread James Q
I had the same problem. I edited the csproj file for each dll to always 
copt the content item to the output directory like this:


  Always


After that it worked for me but still have trouble converting a Bitmap to a 
Pix.
Thanks
James

On Tuesday, January 2, 2018 at 7:27:54 AM UTC, Bau auto wrote:
>
> how to used tesseract 4.0 in c#.
> i setup follow to: https://www.nuget.org/packages/tesseract.net/4.0.0.6 
> but Failed to add reference to 
> 'pvt.cppan.demo.danbloomberg.leptonica-1.74.4'.
>
> Vào 15:53:40 UTC+7 Thứ Sáu, ngày 29 tháng 12 năm 2017, pranaya mhatre đã 
> viết:
>>
>> Hi,
>>>
>>
>> Tesseract 4 eng.traineddata is giving correct result.
>>
>> eng.traineddata link :
>> https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata 
>>
>>
>> Thank you
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c02723f6-0819-497e-8663-53f15123cdd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] I Need help getting Tesseract 4.0 C# .Net Wrapper working please!

2018-01-05 Thread James Q
I'm trying to use this wrapper:
https://github.com/tdhintz/tesseract4win64

It's an x64 .Net assembly with one main DLL (Tesseract.dll) and two 
dependency DLLs (liblept1741.dll and libtesseract400.dll). To start with 
I'm just trying to get a Visual Studio console app running. I've added 
Tesseract.dll in as a reference but it fails to recognize the dependency 
DLLs, throwing a runtime DllNotFoundException: "Failed to find library 
"liblept1741.dll" for platform x64.".

I've tried placing the DLLs in the .\bin\x64\Debug folder and elsewhere 
along the project structure but no luck! I've tried manually adding them to 
an ItemGroup in the csproj file with 'CopyToOutputDirectory Always'. I've 
also tried setting TesseractEnviornment.CustomSearchPath in my Main class, 
but although the runtime searches in the correct folders, it still doesn't 
find the DLLs. My app is for x64 so the image type should match. I can't 
think of what else to try.

If anyone has this working I would greatly appreciate any advice.

Thanks in advance
James


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1cf4bd7d-5e41-46cd-8443-44cd2f391492%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.