Re: [tesseract-ocr] Re: Training from Scratch

2023-11-27 Thread Lorenzo Bolzani
Hi Simon, yes, I think the instructions you can give to the segmentation step are quite limited, mostly the PSM parameter and I suppose a few minor ones. There is something about tables but I've never used it and yours might be too small for this to work. Yes, you should be able to see what is happ

Re: [tesseract-ocr] Re: Training from Scratch

2023-11-24 Thread Lorenzo Bolzani
Hi Simon, if I understand correctly how tesseract works, it follows this steps: - it segments the image into lines of text - it then takes each individual line and slides a small window, 1px wide I think, over it, from one end to the other. For each step the model outputs a prediction. The model,

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-09-13 Thread Lorenzo Bolzani
I'm not 100% sure but, if I remember correctly, one iteration, in this context, means one image so with 15 iterations you did not even use the whole dataset. Also, especially when training from scratch, you likely need to pass over the whole dataset multiple times. You should let the training

Re: [tesseract-ocr] Re: Trying to understand why Tesseract-ocr fails on some images

2023-07-27 Thread Lorenzo Bolzani
Hi Nor, I would crop the text as tight as possible, in this way you control exactly the text region (see the attached image). Altro try adding a white border of 1 or 2 pixels later, see IF this works best. The image you sent is not pure black and white, so maybe the automatic cropping gets confus

Re: [tesseract-ocr] Russian + English characters recognition

2023-02-28 Thread Lorenzo Bolzani
Hi, try rus+eng as a language or eng+rus and see what works best. You can also use more than two languages. Or run both languages separatedly and keep the result with the highest confidence score. You could also consider the location of the text on the page to decide. Lorenzo Il giorno mar 28

Re: [tesseract-ocr] Re: Optimal image resolution (dpi/ppi) for Tesseract 4.0.0 and eng.traineddata?

2023-02-22 Thread Lorenzo Bolzani
Looks like the "fast" models are better or on par with the "best" ones and more robust. Or is there a difference in the 20-40 range that is not visibile from the chart at this resolution? Thanks, Lorenzo Il giorno mar 21 feb 2023 alle ore 22:22 wil...@gmail.com ha scritto: > Sorry it took a

Re: [tesseract-ocr] Difficult image, any tips would be appreciated

2022-11-13 Thread Lorenzo Bolzani
I did it by hand with Gimp. The code depends on what you know about the image. If it is fixed size and fixed location you can easily do this, for example, with python and opencv: crop, invert header, two different thresholds. If the size/alignment are not fixed you could use SIFT to align the ima

Re: [tesseract-ocr] Difficult image, any tips would be appreciated

2022-11-13 Thread Lorenzo Bolzani
Hi Chris, you should try to get something like this: [image: temp2b.jpg] I inverted the headers section and then did two different threshold on each part. If you are not interested in the titles you can just crop them out. The image is blurry, maybe it was upscaled a little? If so, try differe

Re: [tesseract-ocr] Improve text extraction

2022-07-22 Thread Lorenzo Bolzani
Hi Atef, I think your best option is to generate a lot of images as bad as this one and use them for training. So you take the good images (with the corresponding text), thousands, and ruin/blur them in many different ways. In this way, for example, from good 1000 images you get 5000/1 bad ima

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-24 Thread Lorenzo Bolzani
Hi Yash, please see the example at the bottom of this page: https://github.com/sirfz/tesserocr and this issue about the versions (I think you need version 5.x): https://github.com/sirfz/tesserocr/issues/166 If you have problems with tesserocr make sure it matches the tesseract version it was c

Re: [tesseract-ocr] Tesseract confused between a character and a digit which look-alike

2022-06-07 Thread Lorenzo Bolzani
Hi Yash, in my experience you are going top see a lot of these errors on similar characters. Given the pre processed text only I might do the same mistake myself. What I do is to fix these letters according to a pattern, in this case WDDD and I replace: S <-> 8 O <-> 0 I <-> 1 i <-> 1

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Lorenzo Bolzani
Hi Merlijn, out of curiosity, did you note an impovement over the previous version? Thanks Lorenzo Il giorno mar 19 ott 2021 alle ore 16:05 Merlijn B.W. Wajer < merl...@archive.org> ha scritto: > Hi, > > On 19/10/2021 11:08, juan carlos hernández wrote: > > Hello > > I'm working in a project th

Re: [tesseract-ocr] The pictures captured by the camera did not identify well after preprocessing

2021-09-16 Thread Lorenzo Bolzani
Hi Vli, I think you should test this on something similar to your actual text, not on the alphabet or random strings. With real text you are not going to see () or <> that may be mistaken for a O. The sequence of characters may influence the output, in other words try it on real text. You can als

Re: [tesseract-ocr] Improve OCR Accuracy

2021-03-26 Thread Lorenzo Bolzani
Hi Hamzeh, next time please explain exactly where the problem is so that people here do not have to manually check all the numbers to spot the mistakes. Try to threshold the image and upscale it more, I would start with four times. Bye Lorenzo Il giorno ven 26 mar 2021 alle ore 14:46 Hamzeh a

Re: [tesseract-ocr] Pytesseract processing images already in memory

2021-03-25 Thread Lorenzo Bolzani
Try tesserocr, a real binding library. Bye Lorenzo Il giorno gio 25 mar 2021 alle ore 05:44 Alex Zetaeffesse ha scritto: > Hi all, > > I'm already using a python library (pyvips) for cropping images with text > inside. > Is there a way to have Pytesseract process images in memory without the

Re: [tesseract-ocr] Tesseract Performance

2020-12-24 Thread Lorenzo Bolzani
If the results are exactly the same the most likely explanation is that you are still using the old model. Try to move or rename the new model and see if something change. Did you see an improvement during the training? Mean rms, char train, word train, ecc. Bye Lorenzo Il giorno gio 24 dic

Re: [tesseract-ocr] Recognising numbers in sudoku

2020-11-16 Thread Lorenzo Bolzani
Use hough lines detector to detect the lines and draw a thick white line over them. https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_houghlines/py_houghlines.html Or use MSER to detect blobs of the appropriate size. https://answers.opencv.org/question/19015/how-

Re: [tesseract-ocr] Guidance for not recognized text

2020-10-01 Thread Lorenzo Bolzani
Invert the image. Il gio 1 ott 2020, 14:58 Jean-Marc Spaggiari ha scritto: > Hi, > > I'm playing around with Tesseract to try to do some OCR on screen captures. > > My picture looks like this: > [image: name.png] > > But is recognized like this: > Eglise Chrétienne Evangélique de > sy oan 8)=1

Re: [tesseract-ocr] Tesseract makes different predictions on seemingly equal images. How to make it more robust?

2020-07-15 Thread Lorenzo Bolzani
I think the reason is that your input is bad so the model is confused and a few pixels are enough to see an extra letter. Your input is "bad" because it is different from the one used to train the neural network. The difference between the two images is small but the difference from the training

Re: [tesseract-ocr] Pytesseract cant read my image(close letter problem)

2020-05-31 Thread Lorenzo Bolzani
Hi, first invert the image. $ tesseract -l eng test2.png - FUTLutz FUTSalkay FUTLovazin FUTRaum Also upscale the image to twice the size to get text height about 30/50 pixels, this fixes the wrong letter. Lorenzo Il giorno dom 31 mag 2020 alle ore 21:19 Dtractus ha scritto: > I think pytes

Re: [tesseract-ocr] Re: What is the "Confidence"value returned by Tesseract and how it is calculated?

2020-05-05 Thread Lorenzo Bolzani
Hi, I think the confidence score is returned by the neural network itself. In my experience values below 95 are usually unusable. Above 99 is usually correct. I would set the threshold somewhere between 97.5 and 98.5 depending on your requirements. The lowest value I have ever seen is 75 but anyth

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

2020-04-10 Thread Lorenzo Bolzani
I thought this may lead to some insights useful for the OP but as the matter gets more mysterious I'm opening a new thread not to hijack this. Lorenzo Il giorno ven 10 apr 2020 alle ore 17:27 Lorenzo Bolzani < l.bolz...@gmail.com> ha scritto: > Hi, > I started writing this em

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

2020-04-10 Thread Lorenzo Bolzani
Hi, I started writing this email thinking that removing some characters should not make any real difference: I think the model parameters do not change with fine tuning and even when removing a few layers the bulk of the model remains the same. I decided to test it and I found a very strange thin

Re: [tesseract-ocr] fine tuning from traineddata_best

2020-04-03 Thread Lorenzo Bolzani
Hi, tesstrain (https://github.com/tesseract-ocr/tesstrain) works very well. It is not the same thing as tesstrain.sh, it was called ocr-d before. tesstrain works only with single lines. You need only the images and the corresponding gt.txt files, it will create the tiff, box files and ltmsf, unich

Re: [tesseract-ocr] Unable detect number in box

2020-04-03 Thread Lorenzo Bolzani
Yes, I think this kind of boxes may be a problem for tesseract. But the script I posted removes the box and solves the problem. To remove the box you fill the area around the box with the same color of the box so they merge. It's easier if you do it on a thresholded image. Bye Lorenzo Il gi

Re: [tesseract-ocr] Re: Looking to hire a pytesseract consultant via skype

2020-04-02 Thread Lorenzo Bolzani
Hi, you could try to look at the distances between the symbol boxes, see the attached script. It's not very reliable as it depends very much on how you preprocess the text and you have to fine the magic threshold. I'm using the 4.0 version, symbol boxes were improved in the 4.1 version, it could w

Re: [tesseract-ocr] Unable detect number in box

2020-03-27 Thread Lorenzo Bolzani
Hi, an easy trick to remove closed borders it to fill the outside area with the border color and then with the opposite one. See the attached example. For image 2 it is more complex. You can crop a little the image to remove the external borders and paint a rectangle over the middle line if the lo

Re: [tesseract-ocr] Re: How to prepare fonts folder to train from scratch

2020-03-25 Thread Lorenzo Bolzani
I think fine tuning may work very well in this case, no need to train from scratch. Training from scratch does not guarantee better results, especially if you don't do it correctly. I suggest to try fine tuning first and see if the results are good enough for you. In this way you get comfortable w

Re: [tesseract-ocr] How to prepare fonts folder to train from scratch

2020-03-25 Thread Lorenzo Bolzani
Why do you want to do this? Il mar 24 mar 2020, 21:05 Essam Zaky ha scritto: > Hi Dears , > > I would like to build *.traindata from scratch specially for English and > Arabic > > So lets talk about English as example > my question how to prepare fonts folder? > > i read the > https://github.com

Re: [tesseract-ocr] Tesseract not recognizing ancient language's code

2020-03-15 Thread Lorenzo Bolzani
Common fonts do not cover every unicode symbol (about 10). If one font works and another does not the text is correct and you just need to find fonts covering that language. Lorenzo Il sab 14 mar 2020, 23:34 aby tesh ha scritto: > Even google's Noto font doesn't show glyphs while opening

Re: [tesseract-ocr] Best filter/preprocess for these type of images?

2020-02-24 Thread Lorenzo Bolzani
Do a threshold (otsu), count the white and black pixels, this will tell you if you have white text on dark background or the opposite. If necessary, negate the image so to have a dark text on bright background. The images are very small, you want al least 35/50px. Try to have them larger if possib

Re: [tesseract-ocr] approches used for language detection on images ...

2020-02-01 Thread Lorenzo Bolzani
You can try some machine learning based text detection, like this one for example: https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/ https://github.com/argman/EAST It's not so easy to use because, as you can see in the images, you are going to get multiple boxes.

Re: [tesseract-ocr] Exception while using tesseract with CentOS Linux

2020-01-02 Thread Lorenzo Bolzani
Hi, leptonica is a little strange about versions. For example leptonica 1.76 means library version 5. The library I use is liblept.so.5 under /usr/local/lib/, /usr/lib/x86_64-linux-gnu/ (with Mint 19, Ubuntu based) I suppose you already have this if you correctly installed leptonica. You can che

Re: [tesseract-ocr] Re: Force Tesseract to do individual character OCR only

2019-10-31 Thread Lorenzo Bolzani
Hi Dave, are you sure the parameters are being used? For example setting lstm_choice_mode to an invalid number or lstm_choice_iterations to zero should at least produce some errors. With lstm_choice_mode > 0 you should get the extra matches in the HOCR. About the boxes, these are a problem in deco

Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

2019-10-30 Thread Lorenzo Bolzani
Hi, first crop the white border around the text. In this way I get correct the result. Try this on a large batch of data and see what works best, no border, one pixel border, etc.. Also try different text sizes, from 30 to 50, just upscale the image. If this does not help have a look here: https

Re: [tesseract-ocr] Is it possible to pass a numpy array to Tesseract, instead of saving it to the disk.

2019-10-30 Thread Lorenzo Bolzani
Hi, using the API through tesserocr I use api.SetImageBytes(raw_img.tobytes(), raw_img.shape[1], raw_img.shape[0], 1, raw_img.shape[1]) I recommend using this over pytesseract even if the installation sometimes may be a little more complex. Lorenzo Il giorno mer 30 ott 2019 alle ore 06:29 Ayu

Re: [tesseract-ocr] Tesseract Strangely Thinks Text is Upside Down - ACCURACY

2019-10-17 Thread Lorenzo Bolzani
Maybe a problem with the exif rotation data? Il giorno gio 17 ott 2019 alle ore 20:12 Umut Barış Korkut < umut.kor...@gamyte.com> ha scritto: > Default psm works with these two pages but it does not work with the other > pages of the document because they have tables and vertical text. > > Is it

Re: [tesseract-ocr] Mute "Empty page!!" print when using libtesseract?

2019-10-09 Thread Lorenzo Bolzani
nding (if not the best, at least most >active C# tesseract solution) >2. Tesseract library prints output to stderr and stdout. Check the >source. > > Zdenko > > Dňa ut 8. 10. 2019, 14:52 Lorenzo Bolzani > napísal(a): > >> Hi, >> I suspect what you ar

Re: [tesseract-ocr] Mute "Empty page!!" print when using libtesseract?

2019-10-08 Thread Lorenzo Bolzani
I'm not a C# developer but I suppose you can just use the c++ library as is. Lorenzo Il giorno mar 8 ott 2019 alle ore 15:07 MPursche ha scritto: > I think you might be right, I just installed this one from NuGet: > https://github.com/charlesw/tesseract/ > > Do you know of a C# binding you wou

Re: [tesseract-ocr] Mute "Empty page!!" print when using libtesseract?

2019-10-08 Thread Lorenzo Bolzani
Hi, I suspect what you are using is not a real api bindings but more of a command line wrapper. This is very slow and inconvenient to use. I would simply use the API, probably even the plain tesseract libs as you are using C#. The API does not write anything to the console. Lorenzo Il giorno ma

[tesseract-ocr] Parameter name to fine tune beam search added in 4.1?

2019-10-04 Thread Lorenzo Bolzani
Hi, I remember reading of an option/parameter to fine tune the beam search step (IIRC). Maybe it was here on the mailing list or in a bug report but I can't find it anymore. It was related to the bounding boxes problem where a character is split in multiple parts. Does anybody can give me some p

Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-21 Thread Lorenzo Bolzani
If you are not sure if you have a single line or a single block use psm 6. See tesseract --help-extra Psm 6 generally works fine for single lines too. If you have full pages and single lines mixed you need a pre processing step (threshold, morphology, etc.) to understand what psm is the correct

Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread Lorenzo Bolzani
I tried to upscale, downscale, with and without the white border and I always get Calibrations. I even tried a few psm modes. I'm using: tesseract 4.0.0 leptonica-1.76.0 libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 What I would do is this: - prepare a test se

Re: [tesseract-ocr] Could anyone help me about pytessract?

2019-09-19 Thread Lorenzo Bolzani
Try to invert the images. Lorenzo Il giorno gio 19 set 2019 alle ore 05:52 luffy monky ha scritto: > Hi ALL > I try to use any sample code from google. > But it's show no thing in my code > Could I trouble you for any advice?? > Here is my sample code > > import py

Re: [tesseract-ocr] Re: problems with upper-case character

2019-09-19 Thread Lorenzo Bolzani
You say that both letters looks the same (same height too?) and that it is not possible to do it in processing as both spellings are possible. How is tesseract, or a human, supposed to tell them apart? Can you please share a sample? Maybe using a smaller/bigger image is enough. Or maybe the image

[tesseract-ocr] Small script to generate all boxes for ocrd-train

2019-09-18 Thread Lorenzo Bolzani
Hi, I wrote this small script to speed up OCRD-train training startup. It generates the boxes for all the images provided on the command line (it works only for single line images). It is a simple conversion of the generate_line_box.py from ocrd-train. I used

Re: [tesseract-ocr] Is there any way to load model(tesseract custom model) at ones instead of loading every time?

2019-09-18 Thread Lorenzo Bolzani
Yes, you can create an instance and reuse it. You call Init once and just reuse it. Performance does improve. If you have multiple threads see this: https://stackoverflow.com/questions/4827924/is-tesseractan-ocr-engine-reentrant For multi threading I created a pool of instances. On each instance

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-14 Thread Lorenzo Bolzani
I googled "node.js tesseract". Try one of these: https://www.npmjs.com/package/node-tesseract-ocr https://www.npmjs.com/package/node-tesseract https://www.npmjs.com/package/tesseract https://www.npmjs.com/search?q=Tesseract Just follow the instructions. Tesseract does not run as a server, is a

Re: [tesseract-ocr] Getting started with tesseract-ocr in a web app.

2019-09-13 Thread Lorenzo Bolzani
Hi Clint, after you install tesseract and tesseract libraries and tesseract dev libraries (using packages or from source) you can call it from your program. If you are using C++ just call it, if you are using other languages you need to find a wrapper/bindings library (for example tesserocr for pyt

Re: [tesseract-ocr] Fine tuning existing model

2019-09-08 Thread Lorenzo Bolzani
>>>>>> # Name of the model to continue from >>>>>>>> CONTINUE_FROM = frk >>>>>>>> >>>>>>>> # Normalization Mode - see src/training/language_specific.sh for >>>>>>>> details

Re: [tesseract-ocr] Fine tuning existing model

2019-09-06 Thread Lorenzo Bolzani
ated files" >>>>>> @echo "" >>>>>> @echo " Variables" >>>>>> @echo "" >>>>>> @echo "MODEL_NAME Name of the model to be built" >>>>>> @echo "CORES

Re: [tesseract-ocr] Fine tuning existing model

2019-09-06 Thread Lorenzo Bolzani
gt; >>>> # Create unicharset >>>> unicharset: data/unicharset >>>> >>>> # Create lists of lstmf filenames for training and eval >>>> lists: $(ALL_LSTMF) data/list.train data/list.eval >>>> >>>> data/list.train: $(A

Re: [tesseract-ocr] my scan of alphanumeric data needs TLC

2019-08-27 Thread Lorenzo Bolzani
Try to manually clean the images with Gimp, remove the black noise and see if it helps. Also try to remove the white border. After each step run tesseract again to see if the problem was there. Also try to downscale the images so that the text is 40/60 px tall, try different sizes and see what work

Re: [tesseract-ocr] Tesseract 4.0 LSTM comparing with other OCR engines

2019-07-25 Thread Lorenzo Bolzani
> 1. Tesseract is failing on recognizing reverse video text(Black backround and White foreground). Yes, it works better black on white I think the training is black on white only (but I'm not 100% sure). This cases are easy to detect (count the ratio of pixels above/below a low threshold) and to i

Re: [tesseract-ocr] Tesseract 4.0 LSTM comparing with other OCR engines

2019-07-24 Thread Lorenzo Bolzani
If you cannot share not even a few words, lines or fragments I thinks there is not much to tell other than this: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality Google vision does a lot of complex pre-processing, tesseract does none, it has to be done by the user. I suggest to man

Re: [tesseract-ocr] Tesseract 4.0 LSTM comparing with other OCR engines

2019-07-24 Thread Lorenzo Bolzani
Hi Prasad, please post a few samples of normal and poor quality images and details of any preprocessing you did on these images before calling the OCR, if any. Bye Lorenzo Il giorno mer 24 lug 2019 alle ore 13:09 prasad nemmikanti < prasadn...@gmail.com> ha scritto: > Recently I have started

Re: [tesseract-ocr] Trained data for E13B font

2019-07-19 Thread Lorenzo Bolzani
PSM 7 was a partial solution for my specific case, it improved the situation but did not solve it. Also I could not use it in some other cases. The proper solution is very likely doing more training with more data, some data augmentation might probably help if data is scarce. Also doing less train

Re: [tesseract-ocr] Trained data for E13B font

2019-07-17 Thread Lorenzo Bolzani
Phantom characters here for me too: https://github.com/tesseract-ocr/tesseract/issues/1778 Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also improved. I wrote some code that uses symbols iterator to discard symbols that are clearly duplicated: too small, overlapping, etc. B

Re: [tesseract-ocr] How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

2019-07-11 Thread Lorenzo Bolzani
Hi, a few things I would try (*I never trained on cursive fonts*): - I would use a stable tesseract version (4.1 right now) - 0.7 is not a very good score for a text this clean - I think 6000 lines is not much, hard to tell if it is enough, this is not a classic font - data pre processing may help

Re: [tesseract-ocr] Train Tesseract to ignore music?

2019-06-28 Thread Lorenzo Bolzani
Hi Sara, can you please post a sample picture? You could probably detect the pentagram (hough lines with very tight paramters, custom horizontal lines detection) and just replace it with a white rectangle. Lorenzo Il giorno ven 28 giu 2019 alle ore 07:15 Sara Palmer ha scritto: > I'd like to p

Re: [tesseract-ocr] Re: Trouble reading text "in between lines"

2019-06-26 Thread Lorenzo Bolzani
I was referring to the image sample you posted where there are three columns. Regarding the new diagrams, I do not know what informations you need and if all the diagrams have the same layout. Anyway I would first cut individual boxes from the bottom right table or at least three columns. I would

Re: [tesseract-ocr] Re: Trouble reading text "in between lines"

2019-06-26 Thread Lorenzo Bolzani
Cut the image in half with gimp and try to see if it is the case. Each image will be smaller so, if you discard empty white borders it could even be faster. I do it in my application with no problems. I do not understand why you need overlap. Maybe you cannot cut the image in the way I would expect

Re: [tesseract-ocr] Re: Trouble reading text "in between lines"

2019-06-26 Thread Lorenzo Bolzani
Can you cut the image vertically in a simple way? Lorenzo Il giorno mer 26 giu 2019 alle ore 11:08 'Hu gePanic' via tesseract-ocr < tesseract-ocr@googlegroups.com> ha scritto: > I have "sort of" solved the problem. > > I run tesseract 2 times. > After the first run I delete all the text already

Re: [tesseract-ocr] OCR pipeline with OpenCV

2019-06-19 Thread Lorenzo Bolzani
Hi Nicolas, I think what you did is good, you just need to play with pre-processing more. I usually process the images with Gimp until I can get a good results, then I try to do the same processing with opencv/PIL. You do not strictly need to threshold the image, a very very strong contrast is en

Re: [tesseract-ocr] Re: FontAwesome and Tesseract

2019-06-18 Thread Lorenzo Bolzani
How many different chars do you need to detect? What is the size range (in pixels)? What kind of images, scans, smartphone pictures, screenshots? If you just want to locate the symbols something like opencv matchTemplate may

Re: [tesseract-ocr] Tesseract does not give good output we need some suggestion.

2019-06-11 Thread Lorenzo Bolzani
Try to straighten the text: https://www.pyimagesearch.com/2017/02/20/text-skew-correction-opencv-python/ (I suspect you are already doing this) Small dots will give you problems with this method, so first make a copy of the image, run a light close/erode (google: morphology transformation) to re

Re: [tesseract-ocr] Bounding box

2019-06-09 Thread Lorenzo Bolzani
the link about "no need" of bounding boxes of > every unit but rather the whole line > > On Sun, Jun 9, 2019 at 2:52 PM Lorenzo Bolzani > wrote: > >> I think you are talking about preparing the training data. With >> tesseract 4.x you do not need to define the

Re: [tesseract-ocr] Bounding box

2019-06-09 Thread Lorenzo Bolzani
I think you are talking about preparing the training data. With tesseract 4.x you do not need to define the boxed for each chartacter just one big box for the whole line. Bye Lorenzo Il giorno dom 9 giu 2019 alle ore 10:50 Jennil Thiyam < thiyamjen...@gmail.com> ha scritto: > ই 110 4657 137 47

Re: [tesseract-ocr] MRZ/MRP (Machine-readable zone/passport) dataset for tesseract v4

2019-05-29 Thread Lorenzo Bolzani
Hi Mamadou, this sounds very interesting. How did you do the training and accuracy measurements? What parameters did you use for the model? Thanks, bye Lorenzo Il giorno lun 27 mag 2019 alle ore 07:38 Mamadou ha scritto: > Hello, > > We have open sourced (BSD license) MRZ/MRP (Machine-readabl

Re: [tesseract-ocr] unicharset_extractor error

2019-05-24 Thread Lorenzo Bolzani
Also try: locate tesseract ldconfig -p | grep tesseract ls -l /usr/local/lib/libtesseract* and run: sudo ldconfig after you uninstall tesseract (or even right now). Il giorno ven 24 mag 2019 alle ore 15:37 anne < christineannecatu...@gmail.com> ha scritto: > These are what I get > *ldd /u

Re: [tesseract-ocr] Black & white comic text recognition

2019-05-24 Thread Lorenzo Bolzani
Hi, I do not think tesseract page segmentation can handle this kind on layout. It's more oriented towards paragraphs, tables and classic text layouts. And I think page segmentation is not based on neural networks. I would try something like opencv EAST

Re: [tesseract-ocr] OCRing simple numbers unreliable

2019-05-22 Thread Lorenzo Bolzani
Hi, try these (in any combination): psm 6 or 7 remove white border (all or most) downscale so that the font is 20/50px tall fine tune a model to recognize only numbers threshold Otherwise post more details about how you are using tesseract. Bye Lorenzo Il giorno mer 22 mag 2019 alle ore 11:

Re: [tesseract-ocr] Tire DOT OCR - Black Text, Black Background

2019-05-21 Thread Lorenzo Bolzani
Hi, this looks hard. You have two problems here, straighten the text and clean it up. Once you have straighten the text to something like this: [image: 8829199908894_crop.jpg] google vision api recognize it correctly. So it can be done. I do not know how they

Re: [tesseract-ocr] Recommendation on how to best train Tesseract for new UTF-8 symbols

2019-05-21 Thread Lorenzo Bolzani
Hi, when you fine tune the model (maybe with ocrd-train) you can choose to restrict the model output to a smaller set of characters. No need to blacklist or anything else. If you just want to locate the symbols something like opencv matchTemplate

Re: [tesseract-ocr] How to extract text for processing by tesseract v4?

2019-05-20 Thread Lorenzo Bolzani
I just found this: https://www.quora.com/How-do-I-fill-holes-in-image-using-image-processing/answer/V-Sri-Chakra-Kumar Il giorno mer 8 mag 2019 alle ore 09:57 Lorenzo Bolzani ha scritto: > Hi, > you can try a few things, but you need to write a small script (python, > etc.) or use im

Re: [tesseract-ocr] looking for URLs in screen shots

2019-05-15 Thread Lorenzo Bolzani
Hi, if you are willing to program a little this is what I would try: - opencv template matching : extract a few frame fragments containing "https://"; from the video then look for it in all frames (or maybe one frame out of).

Re: [tesseract-ocr] Processing an image batch from the API

2019-05-09 Thread Lorenzo Bolzani
Like a dozen of jpegs. I can do a for loop, but I'm looking for something like a setImages() giving me a list of results. Il giorno gio 9 mag 2019 alle ore 16:35 Zdenko Podobny ha scritto: > What do you mean by " batch of images "? Tiff? > > Zdenko > > > št 9

[tesseract-ocr] Processing an image batch from the API

2019-05-09 Thread Lorenzo Bolzani
Hi, is there a way to process a batch of images with a single api call? By looking at the api I'm quite sure you cannot, but maybe I'm missing something. Thanks, Lorenzo -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from th

Re: [tesseract-ocr] How to extract text for processing by tesseract v4?

2019-05-08 Thread Lorenzo Bolzani
Hi, you can try a few things, but you need to write a small script (python, etc.) or use imagemagick. I suggest to first try with gimp, find what works best, and then write the code. You want dark text on clear background. For white text on red: 1. Invert the image. Desaturate. Increase contrast.

Re: [tesseract-ocr] OCR Failing to Consistenly Recongnize the single digit in my screenshot

2019-05-07 Thread Lorenzo Bolzani
This is where you need to improve contrast. https://pillow.readthedocs.io/en/stable/reference/ImageEnhance.html You need to play a little with PIL to find out what works best for your data. Lorenzo Il giorno mar 7 mag 2019 alle ore 21:21 Sean Connell < nightfire120sla...@gmail.com> ha scritto:

Re: [tesseract-ocr] OCR Failing to Consistenly Recongnize the single digit in my screenshot

2019-05-07 Thread Lorenzo Bolzani
Hi, try to invert the images (black text on white) and use psm 6 or 7. Increasing contrast may also help. Lorenzo Il mar 7 mag 2019, 08:49 Sean Connell ha scritto: > Currently my program searches for the picture of the word Opponents on the > screen then moves a bit a takes a picture of the

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
Shree, thanks for the clarification. Il giorno ven 3 mag 2019 alle ore 11:59 Shree Devi Kumar < shreesh...@gmail.com> ha scritto: > >There are three model sizes: best, normal and fast. Each of these can > also be converted to an integer model. > > Only `best` can be converted to integer and in fa

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
s://github.com/tesseract-ocr/tessdata). >>>> >>>> >>>> >>>> One more question: I wanted to check if the output character set of the >>>> new and old model differ. I used: >>>> >>>> combine_tessdata -u eng.tra

Re: [tesseract-ocr] Fine tuning existing model

2019-05-02 Thread Lorenzo Bolzani
lse to configure? >> >> >> Thanks, bye >> >> Lorenzo >> >> >> 2018-06-29 18:27 GMT+02:00 Shree Devi Kumar : >> >>> You should be able to use the new makefile after you make changes for >>> all the directory locations to match your

Re: [tesseract-ocr] Fails to recognize seemingly simple text

2019-05-02 Thread Lorenzo Bolzani
Hi, use psm 6 (or 7). Also try to crop to have a single line, if possible. Black text on white bg is better. You should be able to isolate text in this way: https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/ Lorenzo Il giorno gio 2 mag 2019 alle ore 16:15 Arjun Bk

Re: [tesseract-ocr] Simple image FAIL fails

2019-04-29 Thread Lorenzo Bolzani
Hi, inverting the image gives the correct results. Also cropping the image just around the text works. Lorenzo Il giorno lun 29 apr 2019 alle ore 19:11 Jason ha scritto: > Apologies for such a simple question but this is a super simple test case > and I don't understand why it isn't working. T

Re: [tesseract-ocr] Re: Recognition of "5" instead of "S"

2019-04-28 Thread Lorenzo Bolzani
I think the problem is also that the network does not expect a mix of letters and numbers. The text is processed as a continuous stream and not as individual characters. This is good for text but not for codes. So if you want to fine tune you need to provide similar mixed sequences. Also, if poss

Re: [tesseract-ocr] small image and OCR

2019-04-23 Thread Lorenzo Bolzani
Hi, I suspect you did a cut and paste or some edits and now you have some non-printable characters in your command line (the question mark boxes). Write it again from scratch. And you are missing one parameter in the command line, the output file, you can use "-" for standard output. $ tesseract

Re: [tesseract-ocr] is there a way to scan only first word of a page?

2019-04-19 Thread Lorenzo Bolzani
Hi, if the page has a fixed simple format you can crop the image leaving only the upper part. You can use imagemagick or a python script, etc. Lorenzo Il giorno ven 19 apr 2019 alle ore 14:49 Vikas Sharma < vikasharma2...@gmail.com> ha scritto: > Hello guys, > > I am trying to identify page cate

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-18 Thread Lorenzo Bolzani
> >> There is no existing utility to do that. However, Ray had dumped the info >> for tessdata_fast (and partly for tessdata_best) which has been posted in >> the wiki at >> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast >> >> >

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

2019-04-17 Thread Lorenzo Bolzani
Split the data set in two parts (80/20 for example), use the large one for training and the other for evaluation. Train for a few epochs (100 or 1000 depending on how much data you have), stop it and check with lstmeval if the *eval score* is improving. Restart the training adding 100/1000 to the

Re: [tesseract-ocr] Tips and advice for preprocessing images before feeding them to tesseract.

2019-04-15 Thread Lorenzo Bolzani
This is very hard to do reliably for general images. You may use something like EAST to detect text regions, then a few tests to understand if it's black on white text or the opposite. Then you can crop the image and rescale it to a standard size (this may not be the final size you'll feed to tess

Re: [tesseract-ocr] small image and OCR

2019-04-14 Thread Lorenzo Bolzani
Hi Alex, you need to pre process the image a little. First negate it, tesseract expect dark on white background text. Then use --psm 6 to tell tesseract that this is a single block or text and not a complex page to split in paragraphs. Also try psm 7, single line. tesseract --psm 6 cropped_image

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-08 Thread Lorenzo Bolzani
om deep learning. We can get any > complicated feature from convolution. So theoretically, it is no need to do > such preprocessing. How do you think about this ? > > > On Wed, Apr 3, 2019 at 21:17 Lorenzo Bolzani wrote: > >> Hi, I train with real data. I use grayscale im

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-08 Thread Lorenzo Bolzani
a script for image pre-processing? Please share, if possible. > It will be helpful to many. > > On Wed, Apr 3, 2019 at 6:47 PM Lorenzo Bolzani > wrote: > >> Hi, I train with real data. I use grayscale images, I think color makes >> no difference. >> >> I do

Re: [tesseract-ocr] confuse whether Otsu Thresholding affects lstm training

2019-04-03 Thread Lorenzo Bolzani
Hi, I train with real data. I use grayscale images, I think color makes no difference. I do a very good image cleanup: background removal, denoise, straightening, sharpening, illumination correction, contrast stretching, etc. before passing the text to tesseract. This part is likely better done o

Re: [tesseract-ocr] General strategies for dealing with problem images

2019-03-23 Thread Lorenzo Bolzani
Il giorno mar 19 mar 2019 alle ore 06:03 Jonathan Muller < jmul...@pukogames.com> ha scritto: > 5 - Create a whitelist based on the zone of probable characters (this one > improves accuracy a lot !) > Ho do you do whitelisting with tesseract 4.x? As far as I know is not yet supported. I do the

[tesseract-ocr] Does the psm value used to generate lstmf files influences the training?

2019-03-21 Thread Lorenzo Bolzani
Hi, I keep having problems with duplicated letters with custom fine-tuned models. For example an M becomes MH. I'm using ocrd-train with actual crops and I noticed that the lstmf files are generated with psm 6. At runtime I use psm 7. Do you think this may make a difference? From a quick test it

Re: [tesseract-ocr] How to choose a suitable threshold for Binarisation

2019-03-08 Thread Lorenzo Bolzani
I someone wants to try this and is looking for a python implementation here is one: http://scikit-image.org/docs/dev/auto_examples/segmentation/plot_niblack_sauvola.html https://github.com/scikit-image/scikit-image/pull/905/files/bb6af8ec723776fc821654847aec04a652f70042 binary_phansalkar = thr

Re: [tesseract-ocr] [4.00] Extra symbols produced

2019-03-01 Thread Lorenzo Bolzani
Yes, I have the same problem, some characters are split, sometimes from one character you even get three ("O0O" for example). https://github.com/tesseract-ocr/tesseract/issues/1778 I wrote quite a complex code to try to limit the problem (with psm 13). The idea is this: Process each symbol indi

  1   2   >