Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
what I understand, > tesseract requires on the order of 10K images and box files to train on. > However, unless I am missing something, what I read at > https://github.com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at

Re: [tesseract-ocr] How to generate training images with noise

2023-10-13 Thread Shree Devi Kumar
com/tesseract-ocr/tesstrain assumes the ground truth > (images + box files) already exist. > > On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar > wrote: > >> Have you looked at >> >> https://github.com/tesseract-ocr/tesstrain >> >> >> >> On Thu

Re: [tesseract-ocr] How to generate training images with noise

2023-10-12 Thread Shree Devi Kumar
Have you looked at https://github.com/tesseract-ocr/tesstrain On Thu, Oct 12, 2023, 11:45 PM Keith Smith wrote: > Hello, > > I am trying to use tesseract to OCR the MICR line of checks (i.e. the > micr-e13b font). The training data that I found at > https://github.com/BigPino67/Tesseract-MIC

Re: [tesseract-ocr] Does the checkpoint_name contain the number of iterations

2023-09-20 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints On Wed, Sep 20, 2023, 2:53 AM Des Bw wrote: > I couldn't understand what the numbers on the checkpoint_names are. > I looked at this one: but clear to me. > > https://github.com/tesseract-ocr

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
ory of components from the .traineddata file. *-l* *.traineddata* *FILE*...: List the network information. On Sat, Sep 16, 2023, 2:11 PM Shree Devi Kumar wrote: > The language name headings seem to be missing from the tessdoc page for > tessdata_fast > > Please revert to an older ve

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
The language name headings seem to be missing from the tessdoc page for tessdata_fast Please revert to an older version of page from history On Sat, Sep 16, 2023, 2:08 PM Shree Devi Kumar wrote: > > https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md >

Re: [tesseract-ocr] How to get the net_spec

2023-09-16 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_best.md https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-tessdata_fast.md Version string : 4.00.00alpha : [Network specification] for tessdata_best tessdata_best models - *incomplete list*, only till Kannad

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

2023-08-09 Thread Shree Devi Kumar
Include the default fonts also in your fine-tuning list of fonts and see if that helps. On Wed, Aug 9, 2023, 2:27 PM Ali hussain wrote: > I have trained some new fonts by fine-tune methods for the Bengali > language in Tesseract 5 and I have used all official trained_text and > tessdata_best and

Re: [tesseract-ocr] Tesseract training for New font/language

2023-04-01 Thread Shree Devi Kumar
Aurebesh seems to be different symbols mapped to the English alphabet rather than a new font for English, hence training would need to be for a new language rather than just fine-tuning. On Sat, Apr 1, 2023, 10:47 Ali Abedian wrote: > Hello, > > Thank you for providing the references, but I'm st

Re: [tesseract-ocr] Re: Kurdish traineddata

2022-10-17 Thread Shree Devi Kumar
ble on > > https://github.com/KurdishBLARK/KurdishOCR > > On Sun, Oct 16, 2022 at 20:59 Shree Devi Kumar > wrote: > >> Thank you for sharing information regarding successful training of >> Kurdish traineddata for Tesseract. >> >> Please also let us know wheth

[tesseract-ocr] Re: Kurdish traineddata

2022-10-16 Thread Shree Devi Kumar
Thank you for sharing information regarding successful training of Kurdish traineddata for Tesseract. Please also let us know whether the traineddata is available for others to use. You may want to contribute to the tess_contrib repo. Let us know whether the recognition covers 0-9 digits in Arabi

Re: [tesseract-ocr] Running Tesseract 5 on Linux

2022-04-03 Thread Shree Devi Kumar
Have you tried instructions on https://tesseract-ocr.github.io/tessdoc/Installation.html On Sun, Apr 3, 2022, 22:08 'Peter Kronenberg' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > Has anyone had any luck installing Tesseract 5 on Linux? It doesn’t seem > to be available in any of

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-03-31 Thread Shree Devi Kumar
ionic main" and paste it as shown below on the next line. If you are using a different release of ubuntu, then replace bionic with the respective release name. deb http://archive.ubuntu.com/ubuntu bionic universe On Fri, Apr 1, 2022, 11:49 Shree Devi Kumar wrote: > https://packages.ubu

Re: [tesseract-ocr] Ubuntu : Unable to locate package libleptonica-dev

2022-03-31 Thread Shree Devi Kumar
https://packages.ubuntu.com/focal/libleptonica-dev On Fri, Apr 1, 2022, 11:07 polki paul wrote: > Hello, > > how to install libleptonica-dev on Ubuntu 20.04 ? > > > *sudo apt-get updatesudo apt-get install libleptonica-dev* > > > > > *Reading package lists... DoneBuilding dependency treeReading

Re: [tesseract-ocr] Tesseract 4.0 - Multiline text

2022-03-23 Thread Shree Devi Kumar
Use the hocr option. On Thu, Mar 24, 2022, 10:52 Muraliraj DK wrote: > I am not sure if you have looked at the image. What i meant on Multi line > text is when the sentence is wrapped to next line i would like to extract > as single sentence instead of 2 lines (paragraph). > > Single line is - s

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
I have also posted in vcpkg repo for them to update the official package to 5.0.0. https://github.com/microsoft/vcpkg/issues/16019 On Sat, Jan 1, 2022, 17:20 Shree Devi Kumar wrote: > You can download windows binaries from > https://github.com/UB-Mannheim/tesseract/wiki > > > &g

Re: [tesseract-ocr] compile tessract 5.0 in win10

2022-01-01 Thread Shree Devi Kumar
You can download windows binaries from https://github.com/UB-Mannheim/tesseract/wiki On Sat, Jan 1, 2022, 16:54 杜德銘 wrote: > the original : > > vcpkg install tesseract:x64-windows for 64-bit. Use –head for the master > branch. > > is not 5.0, is 4.1. > > can update this command? > > reply by

Re: [tesseract-ocr] training scripts in 5.0.0

2021-12-04 Thread Shree Devi Kumar
Please see the tesstrain repo. Python version of tesstrain.sh etc have been moved there. On Sat, Dec 4, 2021, 22:37 Marco Atzeri wrote: > Hi, > > I am updating the cygwin package from 4.1.1 to 5.0.0 > and I noticed that 3 scripts > >language-specific.sh >tesstrain.sh >tesstrain_utils

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-21 Thread Shree Devi Kumar
Also see the Technical Information section in https://tesseract-ocr.github.io/tessdoc/ On Mon, Nov 22, 2021, 01:36 Peter Geraghty wrote: > Thank you!!! will do! > > On Sunday, November 21, 2021 at 12:51:51 AM UTC-6 shree wrote: > >> Please see https://github.com/tesseract-ocr/tesstrain/wiki for

Re: [tesseract-ocr] Re: Using Tesseract for Handwriting..

2021-11-20 Thread Shree Devi Kumar
Please see https://github.com/tesseract-ocr/tesstrain/wiki for detailed examples of tesseract training for handwritten texts. On Sat, Nov 20, 2021 at 11:53 AM Peter Geraghty wrote: > sorry, by word recognition, I meant word and character localization. > > On Friday, November 19, 2021 at 11:04:38

Re: [tesseract-ocr] Doubt about using 5.0.0-beta-20210916 before release version is available

2021-10-19 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/3331#issuecomment-946532564 On Tue, Oct 19, 2021, 16:26 juan carlos hernández < juan.carlos.h.c.valen...@gmail.com> wrote: > Hello > I'm working in a project that needs OCR and we have choosed to use > Tesseract. We would like to use v5.0.0, b

Re: [tesseract-ocr] What are Langdata repository given for retraining Tesseract

2021-04-15 Thread Shree Devi Kumar
Use langdata_lstm repo for LSTM training. That has larger training text. On Thu, Apr 15, 2021, 00:52 Venkatapathy S wrote: > Hi, > I want to retrain Tesseract from the scratch for a particular language(I > have read as many resources as possible, including warnings, from the > Tutorial

Re: [tesseract-ocr] What do iteration numbers mean in the train logging?

2021-04-14 Thread Shree Devi Kumar
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#iterations-and-checkpoints Epoch size depends on your training data. If you have 1000 lines of training data, then 1 epoch is 1000 iterations. If you have 5 lines of training text, 1 epoch is 5 iterations. On Wed,

Re: [tesseract-ocr] What is Max Iterations & Epochs in tesstrain Makefile

2021-04-14 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#lstmtraining-command-line Epoch has been recently added to the tesstrain makefile and converts to number of iterations based on amount of training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > Hi All, > > I w

Re: [tesseract-ocr] Unable to understand Iterations?

2021-04-14 Thread Shree Devi Kumar
It has seen only 600 lines of data of which only 300 have been used for learning. Iterations are different from an epoch which is going through all training data. On Wed, Apr 14, 2021, 01:36 GCP COGNEXT wrote: > What does *At Iteration 300/600/600.* > > Let's assume I have 10k data and I want

Re: [tesseract-ocr] Lstm training parameters

2021-04-04 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_best.md On Mon, Apr 5, 2021, 00:52 Adriana Camilleri wrote: > By any chance, is there any information out there about the training > p

Re: [tesseract-ocr] tesseract WIndows 10 Newbie here

2021-04-01 Thread Shree Devi Kumar
Check that the tesseract directories is added to Path so that it can be found. On Fri, Apr 2, 2021, 11:03 Gianfranco Dy wrote: > One of the requirements of alimranahmed/LaraOCR is that the Tesseract > command is accessible > > [image: Capture.PNG] > after installing tesseract-ocr-w64-setup-v5.0.

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-30 Thread Shree Devi Kumar
k you so much! > > What hyperparameters did you use for training? number of pages? epochs? > > Which model did you start with? your file seems smaller than other > eng.traineddata files. > > Thanks, > ~Marvin > > On Sun, Mar 28, 2021 at 10:16 AM Shree Devi Kumar > wrote

Re: [tesseract-ocr] tesseract failing on extremely simple example

2021-03-27 Thread Shree Devi Kumar
Do you have the font used in the sample? Do you only need to recognise numbers in it? On Sat, Mar 27, 2021, 16:10 Marvin Thielk wrote: > I've tried a variety of pre-processing attempts and different configs, but > this feels like it should be an easy detection task. > > I've tried with several d

Re: [tesseract-ocr] pytesseract having high accuracy but performing very very slow

2021-03-25 Thread Shree Devi Kumar
Try with newer version of tesseract. On Thu, Mar 25, 2021, 13:19 Vidya Chitragar < vidya.chitra...@lucidatechnologies.com> wrote: > Hi Every one. > I am using pytesseract with tesseract-ocr version 3.05.02 for conversion > of scanned pdf document of 1000k pages to searchable pdf document but my >

Re: [tesseract-ocr] Installing tesseract 5 via vcpkg

2021-03-24 Thread Shree Devi Kumar
Yes, -head doesn't work with vcpkg. You can install the dependencies via vcpkg and then build tessaract. See https://github.com/tesseract-ocr/tesseract/actions/runs/681261367/workflow for the steps On Wed, Mar 24, 2021, 19:44 Fábio Ramos wrote: > Hello, I've tried using vcpkg to install tesse

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
@AlexanderP/tesseract-debian Is there a way to use older ppa versions? On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was w

Re: [tesseract-ocr] downgrade to last tessract alpha version tesseract 5.0.0-alpha-20201231-246-gfe61

2021-03-23 Thread Shree Devi Kumar
Please report as issue in tesseract repo. On Tue, Mar 23, 2021, 13:46 Kumar Rajwani wrote: > The latest push is working fine but when image is blury or have some noise > it can't able to pass the image. it shows Detected 12 diacritics . The > previous version was working fine with my images. > >

Re: [tesseract-ocr] Properly Insert OCR Into Separate Columns

2021-03-21 Thread Shree Devi Kumar
Please see the newly added table detector to the master branch https://github.com/tesseract-ocr/tesseract/pull/3330 On Mon, Mar 22, 2021, 10:53 Daniel Lu wrote: > Hi, > > I am trying to read hundreds of pages of information like the picture > below into a CSV file. For us humans, it is very cle

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-20 Thread Shree Devi Kumar
Yes, finetuning can be done. Please see https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining If you already have scanned images and their box files you can also try makefile based training using the tesstrain repo. On Fri, Mar 19, 2021 at 2:31 PM avi

Re: [tesseract-ocr] Training Tessearct for custom data --Urgent Help Required

2021-03-13 Thread Shree Devi Kumar
You have not stated the version of tesseract that you are using. >We downloaded some online training data available for the language Malayalam You have not mentioned from where you got it. Are these the official traineddata files? >we found that few special characters in the language are not pic

Re: [tesseract-ocr] Re: To make traineddata file non-traineable

2021-02-24 Thread Shree Devi Kumar
Yes. Usage for compacting LSTM component to int: combine_tessdata -c traineddata_file On Wed, Feb 24, 2021 at 10:56 PM Jennil Thiyam wrote: > HI shree, so by running this command, the model will be in its > integer/fast version? > > On Wed, Feb 24, 2021 at 10:27 AM shree wrote: > >> You ca

Re: [tesseract-ocr] Tesseract output text and symbol

2021-02-18 Thread Shree Devi Kumar
See https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-page-separators-are-used-in-txt-output-by-tesseract-400 On Thu, Feb 18, 2021, 12:15 J Cassar wrote: > Good Day, > > I've used tesseract on a number of jpeg images ( see input image attached) > and it works fine as it outputs the text. How

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
Training won't fix that. See https://www.pyimagesearch.com/2020/09/07/ocr-a-document-form-or-invoice-with-tesseract-opencv-and-python/ https://stackoverflow.com/questions/61265666/how-to-extract-data-from-invoices-in-tabular-format On Fri, Feb 5, 2021 at 6:14 PM Kumar Rajwani wrote: > i have t

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
I see the tabular image that you shared. I don't think training is going to help you in this. eng.traineddata should be able to recognize it quite well. You should select the different areas of interest and just OCR those sections. On Fri, Feb 5, 2021 at 5:33 PM Kumar Rajwani wrote: > i have tr

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
On Fri, Feb 5, 2021 at 4:44 PM Kumar Rajwani wrote: > hi, > i have tried minus 1 and got following result > Iteration 0: GROUND TRUTH : ) @® > Iteration 0: BEST OCR TEXT : Yo > File eng.arial.exp0.lstmf line 0 : > > What's your version of tesseract? What o/s? > Without your files, it's diffic

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
Have you tried with value -1?? minus 1 On Fri, Feb 5, 2021 at 4:37 PM Kumar Rajwani wrote: > hi, > i have tried that it's shows following output > Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar > ./ScrollView.jar & wait" > ScrollView: Waiting for server... > Error: Unable t

Re: [tesseract-ocr] not training on image after loading data

2021-02-05 Thread Shree Devi Kumar
Add the following to your lstmtraining command and see. --debug_interval -1 On Fri, Feb 5, 2021 at 4:05 PM Kumar Rajwani wrote: > HI, > i am trying to finetune eng.traindata as per my images i have tried to > train but all time i am stuck somewhere can you tell me how can i procced > further.

Re: [tesseract-ocr] Training tesseract, APPLY_BOXES: ... FAILURE! Couldn't find a matching blob for BENGALI language.

2021-01-28 Thread Shree Devi Kumar
For Bengali, you need to train the LSTM model. Legacy model training won't work. On Thu, Jan 28, 2021, 22:32 Boring Guy69 wrote: > > Hello i am new to tesseract. i am working on bengali language [kalpurush > font]. > I got lots of error when i make TR files. if i describe my work flow > At first

Re: {EXTERNAL}[tesseract-ocr] Installing tessdata

2021-01-27 Thread Shree Devi Kumar
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files.html Also the readme files in the three repos https://github.com/tesseract-ocr/tessdata_fast On Thu, Jan 28, 2021, 03:20 Peter Kronenberg wrote: > Hi, can someone help with these questions? Just trying to understand > better how

Re: [tesseract-ocr] New release for tessdata_{fast,best}?

2021-01-27 Thread Shree Devi Kumar
>The Internet Archive has switched to using Tesseract for all our OCR, I am so happy to hear this. It will be great to have the Indic languages that were marked as non-ocrable so far be converted to text correctly on Internet Archive. Is there any page with instructions to do this? Can a language

Re: [tesseract-ocr] Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

2021-01-19 Thread Shree Devi Kumar
>*wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata * That is not correct. You need to get the `raw` file. https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata *wget https://githu

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-12 Thread Shree Devi Kumar
Unicharset is extracted from training text, because those are the samples that will be used for training. Why do you want to use a different unicharset? On Tue, Jan 12, 2021, 23:47 Kamui 7 wrote: > > > Great! The PR that you submitted fixed issue #3. All that's left is the > encoding string pr

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
w up on any of the (4) Compiling and > Installation pages. There's lots of mention about installing the > dependencies to support training, but no mention about actually installing > it. > > Do you think that's worthy of filing an issue? > > I'm probably not

Re: [tesseract-ocr] make training does nothing when run

2021-01-08 Thread Shree Devi Kumar
>After placing the groundtruth files in a folder called *data/foo-ground-truth* inside the main *tesseract *repo folder, data/foo-ground-truth needs to be under the tesstrain folder not tesseract folder. You can use ground-truth in a different location, in that case you have to refer to it whi

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Are any of these vertical fonts? Encoding errors could be if the characters in training text are not in the unicharset. On Fri, Jan 8, 2021, 00:46 Kamui 7 wrote: > Looks like that fixed bug #1. Now it is able to successfully create 400 > pages. Do you have any ideas as to why the other 2 errors

Re: [tesseract-ocr] Easily readable Russian not recognized in language app screenshot

2021-01-07 Thread Shree Devi Kumar
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract rus.png - -l rus+eng --tessdata-dir ~/tessdata_best D 20:22 Э 5IN AROW 5IN AROW 5IN AROW Translate this sentence Translate this sentence Translate this sentence (0) Вопросы есть? (0) Вопросы есть? Вопросы есть? Апу questions Any questions Any questi

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Your training text file is only 175 lines, so the rendered image fits in 4 pages. You need to use a larger text if you want more pages. Also check that your fonts support both English and Japanese as the text seems to have samples of both languages. On Thu, Jan 7, 2021, 22:40 Kamui 7 wrote: > I

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
Or you may have an old version of data/ben/checkpoints/ben_checkpoint -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com

Re: [tesseract-ocr] Numerous different bugs while training jpn

2021-01-07 Thread Shree Devi Kumar
Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere. On Thu, Jan 7, 2021 at 10:17 PM Kamui 7 wrote: > I have a script to train tesseract and I ran it on Arch Linux, Debian, and > even a docker container and they all produce

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
eed to change >> it to the makefile from ben branch instead? >> >> On Thu, Jan 7, 2021 at 5:26 PM Shree Devi Kumar >> wrote: >> >>> ModuleNotFoundError: No module named 'bidi >>> >>> Install python-bidi >>> >>> On Thu, J

Re: [tesseract-ocr] Tesseract Performance

2021-01-07 Thread Shree Devi Kumar
t; `ben` branch in my fork of tesstrain repo. See >> >> https://github.com/Shreeshrii/tesstrain/tree/ben >> and >> >> https://github.com/Shreeshrii/tesstrain/commit/a6474ef2dbbac47803d13b6f92fdcf8c9dc3107b >> >> Results for the validation data (not seen by l

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
ri, Jan 1, 2021 at 12:09 PM Soumik Ranjan Dasgupta < > ranjansou...@gmail.com> wrote: > >> Hi Shreeshrii, >> >> Can you please tell me the training command used? Also, how can I create >> the graphs and these other documents? >> >> On Sat, 26 Dec 2020

Re: [tesseract-ocr] Tesseract Performance

2021-01-01 Thread Shree Devi Kumar
Shreeshrii, > > Can you please tell me the training command used? Also, how can I create > the graphs and these other documents? > > On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, wrote: > >> Soumik, >> >> I used your groundtruth and trained using ben as the START

Re: [tesseract-ocr] Tesseract Performance

2020-12-24 Thread Shree Devi Kumar
>testing an unseen image, the performance was exactly the same. Can you share the image (preferably a page) and expected result? On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta < ranjansou...@gmail.com> wrote: > Hi everyone, > I wanted to do fine-tune the ben.traineddata model by using so

Re: [tesseract-ocr] Diacriticals Training

2020-12-14 Thread Shree Devi Kumar
Appreciate your offer to help and provide feedback as well as training data. Let me try to answer your queries: 1. > I have been using san. But was unaware that you can also use Devanagari. What is the difference? san has been trained for Sanskrit. But it is missing certain Devanagari character

Re: [tesseract-ocr] tesseract-ocr for train persian language

2020-12-11 Thread Shree Devi Kumar
I don't think jTessBoxEditor supports RTL languages like Persian. You can try using tesstrain.sh On Fri, Dec 11, 2020 at 8:57 PM alireza m wrote: > hi i want to train persian language by b nazanin font but jTessBoxEditor > doesn't have b nazanin font how can i add new font??? > > -- > You receiv

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Thu, Nov 12, 2020 at 4:08 PM Shree Devi Kumar wrote: > Please see tesseract-ocr/tesstrain repo > > You need line images and their groun

Re: [tesseract-ocr] Diacriticals Training

2020-11-12 Thread Shree Devi Kumar
> and achieve accuracy too or the alternative ways to do achieve this ? > > Regards, > > On Thu, Nov 5, 2020 at 8:15 PM Shree Devi Kumar > wrote: > >> Legacy engine training won't work for Devanagari. The cube engine which >> was used in tesseract for Hindi has been

Re: [tesseract-ocr] Low tesseract accuracy

2020-11-11 Thread Shree Devi Kumar
Suggest you pre-process images instead of training. See https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html On Tue, Nov 10, 2020 at 12:14 PM Dinesh Yakkanti wrote: > Hello Everyone, >I am trying to build custom tesseract-ocr model. I am getting > high error rate even if i kep

Re: [tesseract-ocr] Diacriticals Training

2020-11-05 Thread Shree Devi Kumar
ratch. ,. shapetable, tr etc are all files for legacy engine, 3.0x and before. It is supported in tesseract4 with --oem 0 On Thu, Nov 5, 2020, 17:14 Shree Devi Kumar wrote: > Are you trying to train for the legacy tesseract engine? > > On Thu, Nov 5, 2020, 16:46 shreyansh dwivedi >

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-11-01 Thread Shree Devi Kumar
Invert the image. Results using tessdata_best/eng - LSTM engine $ tesseract legacy-invert.jpg - --psm 6 063.433 $ tesseract legacy-300.jpg - --psm 6 063.433 $ tesseract legacy-144.jpg - --psm 6 063.433 On Sun, Nov 1, 2020 at 8:37 PM Cailey McVay wrote: > Here is an example of the sample image

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
>When we use tesseract on the images without the trained language we receive outputs that are accurate about 50% of the time. You haven't shared a sample image. Sometimes preprocessing the images, using a whitelist in case of limited character set can be the solution rather than training. On Sun,

Re: [tesseract-ocr] URGENT DEADLINE: NEED HELP WITH NEW LANGUAGE, PLEASE RESPOND

2020-10-31 Thread Shree Devi Kumar
Are you trying to train for the legacy tesseract engine? On Sun, Nov 1, 2020, 03:29 Cailey McVay wrote: > Hello! > I am working on a project that is trying to read borehole video depths. We > trained a new language to read these numbers called NTS. When we use > tesseract on the images without t

Re: [tesseract-ocr] Fwd: Training tesseract OCR

2020-10-30 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain On Sat, Oct 31, 2020 at 9:54 AM bosh sherikar wrote: > Please Reply back > > -- Forwarded message - > From: bosh sherikar > Date: Tue, Oct 13, 2020 at 10:42 PM > Subject: Training tesseract OCR > To: > > > Dear community, > > I ha

Re: [tesseract-ocr] add new characters

2020-10-24 Thread Shree Devi Kumar
Ray has suggested using plus-minus type of training for adding a couple of characters to the traineddata. Did you try that? Please share the training data you used (box/tiff pairs or lstmf files). I have done replace a layer training for Sanskrit. It adds the two characters you want (in addition

Re: [tesseract-ocr] training doubts

2020-10-20 Thread Shree Devi Kumar
For English, most of the times, preprocessing your images and using official traineddata will give better results than trying to do training. For finetuning, ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#fine-tuning-for-impact) what is recommended is using the existing trai

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-11 Thread Shree Devi Kumar
Tesseract will make a checkpoint, if needed, every 100 iterations, so I suggest a minimum 50-100 line images to test finetuning. Also, one of your image samples has a lot of noise on the right side. Crop all extra parts. Also for `ben` you should choose the Indic language option in tesstrain. On S

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-10-10 Thread Shree Devi Kumar
000.lstmf line 0 (Perfect): >>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0% >>>>>> 2 Percent improvement time=4, best error was 100 @ 0 >>>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char >>>>>>

Re: [tesseract-ocr] Diacriticals Training

2020-10-08 Thread Shree Devi Kumar
nk&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar wrote: > I am currently running a training run based on synthetic training data for > Sanskrit to support both Devanagari script with vedic

Re: [tesseract-ocr] PyTesseract not recognizing decimal points

2020-10-06 Thread Shree Devi Kumar
Have you tried cropping the image to remove the arrowhead to see if that improves the result? On Tue, Oct 6, 2020 at 9:42 AM Andrew wrote: > As per my question on StackOverflow: PyTesseract not recognizing decimals >

Re: [tesseract-ocr] Tesseract failing for very clear image

2020-10-05 Thread Shree Devi Kumar
Try to add a little bit of white border to image and see. Try --psm 6 On Mon, Oct 5, 2020, 11:00 Guillaume Bersac wrote: > Hello, > > ### Environment > **Tesseract Version**: > tesseract 4.0.0 > leptonica-1.76.0 > libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : > libtiff 4.

Re: [tesseract-ocr] Diacriticals Training

2020-10-01 Thread Shree Devi Kumar
to train new > characters for the tesseract engine.Procedural approach will make the > things better for understanding. Thank you! Regards > > Shreyansh Dwivedi > > On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar > wrote: > >> I am currently running a training run

Re: [tesseract-ocr] unable to install tesseract-ocr in RHEL

2020-09-29 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/wiki#rhelcentosscientific-linux-fedora-opensuse-packages On Tue, Sep 29, 2020, 12:53 Yeshwant Kumar wrote: > Hi folks, > > We are having a problem in building docker image. > > > > When base image of docker is *ubuntu:focal* we are able to install

Re: [tesseract-ocr] Diacriticals Training

2020-09-27 Thread Shree Devi Kumar
I am currently running a training run based on synthetic training data for Sanskrit to support both Devanagari script with vedic accents as well as iAST (Roman with diacritics support). I will share the traineddata for you and others who are interested to test how well it works with real life image

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-27 Thread Shree Devi Kumar
;>>> Iteration 401: GROUND TRUTH : 696,969 >>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect): >>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0% >>>>> Iteration 402: GROUND TRUTH : 71,000,000 >>>

Re: [tesseract-ocr] Making a serachable PDF.

2020-09-25 Thread Shree Devi Kumar
Try to use a gui frontend such as gimagereader or ocrmypdf. Tesseract does not take pdf as input. On Fri, Sep 25, 2020, 12:58 Arvind Mahesh wrote: > > Complete programming noob, so please pardon my ignorance. > I really want to convert this PDF into a searchable pdf but I barely > understand any

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-20 Thread Shree Devi Kumar
mf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0% >>> Iteration 405: GROUND TRUTH : 4,500,000 >>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect): >>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), s

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar wrote: > > Each of my PNG files have file names that indicate ground truth, and I > have a little script that generat

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
> Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names. Please review your script. I notice a number of file names ending with -2. The gt.txt files for the same also contain -2 while the image only

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

2020-09-19 Thread Shree Devi Kumar
Please share your training data so that we can test. Thanks. Virus-free. www.avg.com

Re: [tesseract-ocr] building tesseract for online hosting

2020-09-09 Thread Shree Devi Kumar
Thanks, Alex. I suggest that you also add this to tesseract documentation, tessdoc repo. On Wed, Sep 9, 2020, 23:30 Александр Поздняков wrote: > Hi. > Alternatively, use AppImage (Ubuntu >= 16.04) > 1. Download > >> wget >> https://github.com/AlexanderP/tesseract-appimage/releases/download/v5.0

Re: [tesseract-ocr] bash: training/lstmtraining: No such file or directory during tesstutorial

2020-08-26 Thread Shree Devi Kumar
Did you install tesseract training tools? try the following commands: lstmtraining --version which lstmtraining text2image --version lstmeval --version On Tue, Aug 25, 2020 at 1:15 PM Theo M-Z wrote: > I followed the tesstutorial, creating base traineddata, but at this point, > the log file

Re: [tesseract-ocr] Tesseract could give me the position of the output characters.

2020-08-23 Thread Shree Devi Kumar
Try tsv or hocr output. Also Google search for receipt recognition with tesseract. I have seen few examples where item names and prices are being recognised. You could try similar with nutritional information. On Thu, Aug 20, 2020, 21:40 ELIANA MARTINEZ CORTES wrote: > > Hi! I am working on a p

Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

2020-08-19 Thread Shree Devi Kumar
For multiple languages the standard invocation is to use the two language codes with + sign. Eg. -l ara+eng or -l eng+jpn Alternately you can also try the script traineddata files eg. Devanagari includes eng+hin+san+mar+nep However, multiple languages recognition takes more time and is not perfe

Re: [tesseract-ocr] Re: Help: lstmtraning not found

2020-08-07 Thread Shree Devi Kumar
The number of iterations for training from scratch need to be much larger hundreds of thousands. 5000 is used in tutorial to give an idea of training process. You need to train till error rates is close to 0.01 On Fri, Aug 7, 2020, 14:24 minh...@gmail.com wrote: > Could you also please advise f

Re: [tesseract-ocr] Re: Help: lstmtraning not found

2020-08-06 Thread Shree Devi Kumar
If you have tesseract and all training tools installed, you should be able to use tesseract lstmtraining etc without giving the path. What's the output of which tesseract tesseract -v which lstmtraining lstmtraining -v On Fri, Aug 7, 2020, 01:13 minh...@gmail.com wrote: > Sorry that I forgot

Re: [tesseract-ocr] building tir.traineddata from scratch

2020-08-04 Thread Shree Devi Kumar
Please see https://tesseract-ocr.github.io/tessdoc/Data-Files-in-tessdata_fast.html Version string:4.00.00alpha:tir:synth20170629 LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1], flags=41, iteration=10498000, sample_iteration=10498000, null_char=267, learning_rat

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-14 Thread Shree Devi Kumar
/tessdata_Arabic_Numbers/blob/master/ara_number.traineddata> >>>> it >>>> is working for number but unable to get date format with slash >>>> and also searched for similar issue here >>>> <https://github.com/tesseract-ocr/tesseract/issues/1193> here >

Re: [tesseract-ocr] How to get linewise/ row-wise output rather than column wise in hOCR output

2020-07-13 Thread Shree Devi Kumar
Use --psm 6 Page segmentation mode instead of the default On Mon, Jul 13, 2020, 22:05 Deepak Sen wrote: > Hi, > I am using latest tessaract version and getting the hOCR output of a table > where line no of (column2, row1) is not line-1 so what i want is tessaract > first goes through all the ro

Re: [tesseract-ocr] How to exclude some symbols from recognizing?

2020-07-13 Thread Shree Devi Kumar
Search for whitelist / blacklist On Mon, Jul 13, 2020, 17:24 Владимир Калачихин wrote: > Subj > Numbers, for example. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, se

Re: [tesseract-ocr] Looking for segmentation algorithm implementations and (G)UIs

2020-07-13 Thread Shree Devi Kumar
Good collection of segmentation algorithms. Dan Bloomberg has update the segmentation algorithms in leptonica some time back. You may want to take a look at those too. Tesseract also uses leptonica, but older algorithms, I think. On Sat, Jul 11, 2020 at 9:19 PM Rainer Verteidiger < materialdefen

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
; [image: date.jpg] > > > > > On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote: >> >> See https://github.com/tesseract-ocr/tesseract/issues/758 and other >> similar issues >> >> On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar >> wrote: &

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/758 and other similar issues On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar wrote: > @Eliyaz What version of tesseract are you using? Which traineddata? > > >Always the letter "لا" is predicted as "ال" . &

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
@Eliyaz What version of tesseract are you using? Which traineddata? >Always the letter "لا" is predicted as "ال" . I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos. On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <

  1   2   3   4   5   6   7   8   9   >