Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
> ​ The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly. Oh, I remember now. I had changed that for ease in renaming files for some reason. > In this way can I train a model that, for example, only recognize uppercase characters, or numbers

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
I think I found the problem. Running directly the new Makefile I had this error: make: *** No rule to make target 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'. Stop. The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script dir

[tesseract-ocr] Extracting some text and numbers from pdf

2018-06-29 Thread May
Hello The following are the words and numbers combinations that I am trying to extract from a pdf that has many other words and data. Sometimes,

Re: [tesseract-ocr] wron Characters in LibreOffice Writer with German spezial Characters

2018-06-29 Thread Zdenko Podobny
this is not tesseract problem: https://ask.libreoffice.org/en/question/97993/why-doesnt-lo-writer-open-and-save-text-documents-encoded-in-utf-8-without-bom-any-plans-to-fix-this-soon/ tesseract output is UTF-8 encoded. Zdenko pi 29. 6. 2018 o 19:37 Martin Jenniges napísal(a): > Hello, > > whe

[tesseract-ocr] wron Characters in LibreOffice Writer with German spezial Characters

2018-06-29 Thread Martin Jenniges
Hello, when I use the TXT-File, which was created from Tesseract in Windows-Cmd,  with Libre Office Writer: the German Spezial Character üöä ect are wrong. I help me, with open the txt-foöe with Notepad++ and copy and paste the text in Writer. Can I do anything, that Libre Office Writer op

Re: [tesseract-ocr] How to improve quality?

2018-06-29 Thread Dattatraya Tembare
You can also use - import java.awt.Rectangle; public String ocrText(File file, String lang, ImageGeometry geometry) { String resultText = null; Tesseract instance = getTesseractInstance("TesseractEnvPath", "eng"); // define an equal or smaller region of interest on the image. Follow: // x-scale, y

Re: [tesseract-ocr] How come tesseract 4.0 misses, what am I missing here?

2018-06-29 Thread Dattatraya Tembare
Image editing you could do using ImageMagick (command line/java api) On Thursday, June 28, 2018 at 4:42:55 PM UTC-4, cohen...@gmail.com wrote: > > Thank you Shree!! :) > > Ok after rotating it, > tesseract haven't succeed retrieving the text. > > *BUT* I kept experimenting with convert app (part o

Re: [tesseract-ocr] How to improve quality?

2018-06-29 Thread Dattatraya Tembare
"C" is missing in the text because tesseract doesn't have enough margin to read the text. Require proper margin. On Friday, June 29, 2018 at 12:39:06 PM UTC-4, Dattatraya Tembare wrote: > > Hello Hari, > I faced the same problem. > > When there are 2 different type of fonts, Tesseract doesn't

Re: [tesseract-ocr] How to improve quality?

2018-06-29 Thread Dattatraya Tembare
Hello Hari, I faced the same problem. When there are 2 different type of fonts, Tesseract doesn't recognize it properly. It recognizes first text and ignores next text if the font size is bigger than first one. I resolved it by cropping the image into 2 pieces. I'm using ImageMagick (java api) to

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
You should be able to use the new makefile after you make changes for all the directory locations to match your setup. Change the language from frk to eng, though the sample training text seems to be non-english. In which case it is better for you to use the appropriate language traineddata eg. te

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
Hi Shree, thanks for your answer. I tried the script setting: TESSDATA=extracted # here I have the eng.lstm and eng.trainedata LANGDATA=langdata-master # all langdata downladed by OCR-D MODEL_NAME = eng CONTINUE_FROM = eng First I run the old Makefile to create the boxes.

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
I modified the makefile for ocrd-train to do fine-tuning. It is pasted below: export SHELL := /bin/bash LOCAL := $(PWD)/usr PATH := $(LOCAL)/bin:$(PATH) HOME := /home/ubuntu TESSDATA = $(HOME)/tessdata_best LANGDATA = $(HOME)/langdata # Name of the model to be built MODEL_NAME = frk # Name of

[tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
​​ Hi, I'm trying to do fine tuning of an existing model using line images and text labels. I'm running this version: tesseract 4.0.0-beta.3-56-g5fda leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0